- Data Collection: Collect a large dataset of text.
- Preprocessing: Tokenize the text into subwords.
- Model Architecture: Choose a model architecture. BPE algorithm for tokenization.
- Training: Train the model on the dataset. Data splitting and training strategies.
- Fine-Tuning: Fine-tune the model on a smaller dataset.
- Inference: Use the model to generate text.
- Train V3-Base with RL for reasoning -> R1-Zero
- Create SFT data from R1-Zero using rejection sampling + synthetic data from V3.
- Re-train V3-Base from scratch on SFT data., followed by RL (reasoning + human prefs.)
- For each input question, sample mulitple responses
- Compute reward for each, and calculate its group-normailzed advantage
- No need for critic model (answers are rewarded compared to the group)
Foundational Models
Model | Parameters | Methods | Notes |
---|---|---|---|
V1 | 67B | Traditional Transformer | |
V2 | 236B | (MHLA) Multi-head latent Attention; (MoE) Mixture of Experts | made the model fast |
V3 | 671B | (RL) Reinforcement Learning; (MHLA) Multi-head latent Attention; (MoE) Mixture of Experts | Balance load amongst many GPU's |
R1-Zero | 671B | (RL) Reinforcement Learning; (MHLA) Multi-head latent Attention; (MoE) Mixture of Experts; (CoT) Chain of Thougt | Reasoning model |
R1 | 671B | (RL) Reinforcement Learning; (SPT) Supervised Fine-Tuning; (MHLA) Multi-head latent Attention; (MoE) Mixture of Experts; (CoT) Chain of Thougt | Reasoning model; Human preferences |
Model | Full Name | Organization |
---|---|---|
BERT | Bidirectional Encoder Representations from Transformers | |
XLNet | Generalized Autoregressive Pretraining for Language Understanding | Google/CMU |
RoBERTa | A Robustly Optimized BERT Approach | |
DistilBERT | DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter | Hugging Face |
CTRL | Conditional Transformer Language Model for Controllable Generation | Salesforce |
GPT-2 | Generative Pre-trained Transformer | OpenAI |
ALBERT | A Lite BERT for Self-supervised Learning of Language Representations | |
Megatron | Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism | NVIDIA |