This repository contains a collection of toy implementations and examples of key components from modern Transformer architectures. Each example is designed to be educational, well-documented, and easy to understand.
Component | Description | Paper |
---|---|---|
Multi-Head Latent Attention (MLA) | A novel attention mechanism from DeepSeek V2 that uses latent queries to reduce KV cache and Rotary Position Embeddings | DeepSeek V2 Technical Report |
Multi-Head Attention | The original attention mechanism from the Transformer paper | Attention Is All You Need |
Relative Multi-Head Attention | Attention with relative position representations | Self-Attention with Relative Position Representations |
Absolute Positional Encoding | Sinusoidal positional encoding from the original Transformer | Attention Is All You Need |
Rotary Position Embedding | Enhanced positional encoding using rotation | RoFormer |
- Create and activate a virtual environment (optional but recommended):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install the package in development mode:
pip install -e .
- Install additional dependencies:
pip install -r requirements.txt
Each component has its own directory with:
- Implementation code
- Jupyter notebook with examples (and visualizations)
To run a notebook:
- Make sure Jupyter is installed:
pip install jupyter
- Start Jupyter:
jupyter notebook
- In your browser, navigate to the component you want to explore (e.g.,
attention/mla_attention.ipynb
) - Click on the notebook to open it
- You can run cells individually by pressing
Shift+Enter
or run all cells from theCell
menu
For example, to explore Multi-Head Latent Attention from DeepSeek:
cd attention
jupyter notebook mla_attention.ipynb
Contributions are welcome! If you'd like to add a new component or improve an existing one, please feel free to submit a pull request.
This project is licensed under the MIT License - see the LICENSE file for details.