Skip to content

Amharic N-Gram Language Model for Auto-Completion, Implemented in Python and NumPy from Scratch

Notifications You must be signed in to change notification settings

yordanoswuletaw/amharic-ngram-autocomplete

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Amharic N-Gram Language Model for Auto-Complete

This project implements an N-Gram language model for the Amharic language using only Python and NumPy, designed to provide auto-completion functionality. The model employs a simple N-Gram probabilistic approach to predict and suggest the most probable next words based on input sequences.

Features

  • N-Gram Based Predictions: Supports unigram, bigram, trigram and n-gram models to generate context-aware suggestions.
  • Amharic Language Support: Handles the structure and highly morphological nature of Amharic text.
  • Tokenization: Includes an Amharic-specific tokenizer to handle words and punctuation correctly.
  • Smoothing Techniques: Implements smoothing methods (e.g., Laplace smoothing) to address the issue of zero probabilities.
  • Scalable Design: Can be trained on large datasets for improved accuracy.

Installation

  1. Clone this repository:
    git clone https://github.com/yordanoswuletaw/amharic-ngram-autocomplete.git
  2. Navigate to the project directory:
    cd amharic-ngram-autocomplete
  3. Install the required dependencies:
    pip install -r requirements.txt

Notebooks

  1. Amharic Auto Complete

Repository Structure

├── .vscode/
│   └── settings.json            # VS Code settings for environment setup
├── .github/
│   └── workflows/
│       ├── unittests.yml        # CI/CD pipeline for unit tests
├── .gitignore                   # Ignored files and folders
├── requirements.txt             # Dependencies for the project
├── README.md                    # Documentation of the repository
├── data/                        # Dataset for training, dev and testing
├── src/                         # Source code for analysis and processing
├── notebooks/
│   ├── __init__.py              # Package initialization
│   └── README.md                # Documentation for the notebooks
├── tests/
│   ├── __init__.py              # Test initialization
└── scripts/
    ├── __init__.py              # Scripts package initialization
    └── README.md                # Documentation for scripts

Requirements

  • Python 3.8+
  • Required Python libraries (see requirements.txt)

Examples

Input: አበበ በሶ
Suggestions: በላ

Future Work

  • Integrate neural language models for amharic langauge tokenization.
  • Expand support for additional Amharic linguistic features and dialects.

Contributing

Contributions are welcome! Please feel free to submit a pull request or open an issue to suggest improvements or report bugs.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgments

Special thanks to:

  • The Amharic NLP community for providing open-source datasets.

amharic-ngram-autocomplete

About

Amharic N-Gram Language Model for Auto-Completion, Implemented in Python and NumPy from Scratch

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published