This project implements an N-Gram language model for the Amharic language using only Python and NumPy, designed to provide auto-completion functionality. The model employs a simple N-Gram probabilistic approach to predict and suggest the most probable next words based on input sequences.
- N-Gram Based Predictions: Supports unigram, bigram, trigram and n-gram models to generate context-aware suggestions.
- Amharic Language Support: Handles the structure and highly morphological nature of Amharic text.
- Tokenization: Includes an Amharic-specific tokenizer to handle words and punctuation correctly.
- Smoothing Techniques: Implements smoothing methods (e.g., Laplace smoothing) to address the issue of zero probabilities.
- Scalable Design: Can be trained on large datasets for improved accuracy.
- Clone this repository:
git clone https://github.com/yordanoswuletaw/amharic-ngram-autocomplete.git
- Navigate to the project directory:
cd amharic-ngram-autocomplete
- Install the required dependencies:
pip install -r requirements.txt
├── .vscode/
│ └── settings.json # VS Code settings for environment setup
├── .github/
│ └── workflows/
│ ├── unittests.yml # CI/CD pipeline for unit tests
├── .gitignore # Ignored files and folders
├── requirements.txt # Dependencies for the project
├── README.md # Documentation of the repository
├── data/ # Dataset for training, dev and testing
├── src/ # Source code for analysis and processing
├── notebooks/
│ ├── __init__.py # Package initialization
│ └── README.md # Documentation for the notebooks
├── tests/
│ ├── __init__.py # Test initialization
└── scripts/
├── __init__.py # Scripts package initialization
└── README.md # Documentation for scripts
- Python 3.8+
- Required Python libraries (see
requirements.txt
)
Input: አበበ በሶ
Suggestions: በላ
- Integrate neural language models for amharic langauge tokenization.
- Expand support for additional Amharic linguistic features and dialects.
Contributions are welcome! Please feel free to submit a pull request or open an issue to suggest improvements or report bugs.
This project is licensed under the MIT License. See the LICENSE
file for details.
Special thanks to:
- The Amharic NLP community for providing open-source datasets.