v0.8.0
New Features
Sequence parallelism support via ring-flash-attn
This enables long context training by distributing sequences across GPUs, reducing memory requirements per device while allowing near-linear scaling in context length per GPU. This complements other parallelism features that Axolotl offers, including FSDP and DeepSpeed. See our documentation here.
Gemma-3 support has landed alongside several features to help you fine-tune Gemma-3 models:
- Cut cross entropy
- Liger kernel
- Multimodal
- Fixed loss calculation for Gradient Accumulation
Multimodal Beta support for a variety of multi-modal models:
- Mllama
- Pixtral
- Llava-1.5
- Mistral-Small-3.1
- Gemma-3
- Qwen2-VL
- Qwen2.5-VL
Additional Features
- Updated cut-cross-entropy patches for several models: Cohere, Cohere-2, Gemma, Gemma-2, Gemma-3, Mistral-3, and Mllama
- Support for the REX Learning Rate Scheduler - https://arxiv.org/abs/2107.04197
- Tokenizer Overrides - you can now fine-tune with custom values in tokenizers using reserved tokens
- Single-gpu and DDP support for Muon Optimizer
- Sequential packing for Curriculum learning
- Speeding up GRPO training with distributed vLLM - you can now use
axolotl vllm-serve path/to/config.yaml
to serve a separate vLLM instance which can utilize multiple GPUs to speed up trajectory generation during GRPO.
Notes
v0.8.x will be the last set of releases that will officially support torch<=2.4.1. With PyTorch 2.7 release this month, we aim to support the latest 2 stable releases of PyTorch.
We expect FSDP2 support to be a fast follow and we'll include that in v0.8.1 once we can fix and validate issues such as saving checkpoints.
What's Changed
train.py
refactor by @djsaunde in #2371- fix(doc): add installation for cce to docs by @NanoCode012 in #2375
- chore(docs): remove phorm by @NanoCode012 in #2378
- feat(doc): add docker images explanation by @NanoCode012 in #2379
- feat(doc): document drop_system_message and clarify limitation by @NanoCode012 in #2381
- chore(doc): add clarification about mpi4py error on single gpu deepspeed by @NanoCode012 in #2383
- fix(doc): add missing low_cpu_mem_usage config to docs by @NanoCode012 in #2369
- feat(grpo): add reward_weights config and refactor by @NanoCode012 in #2365
- Add REX LR Scheduler by @xzuyn in #2380
- Update Tokenizer Overrides Handling in models.py by @mhenrichsen in #1549
- various fixes 20250305 by @winglian in #2384
- Optimizer refactor and add Muon support by @winglian in #2367
- remove lion-pytorch as it's already handled upstream by @winglian in #2389
- refactor: trl grpo configs to have descriptions by @NanoCode012 in #2386
- feat(doc): add more info on RewardModel datasets by @NanoCode012 in #2391
- chore(doc): add faq when having no default chat_template by @NanoCode012 in #2398
- Use Latest Cut Cross Entropy by @xzuyn in #2392
- fix: create mount folder on modal if not exist by @NanoCode012 in #2390
- include iproute2 and nvtop in cloud image by @winglian in #2393
- fix(modal): add git pull when getting branch files by @NanoCode012 in #2399
- pass additional info for fix untrained tokens when using distributed + offloading by @winglian in #2388
- use max of 32 dataset processes if not explicit by @winglian in #2403
- build cloud images with torch 2.6.0 by @winglian in #2413
- only validate hf user token on rank 0 by @winglian in #2408
- fixes against upstream main branches by @winglian in #2407
- chore(docs): add cookbook/blog link to docs by @NanoCode012 in #2410
- Feat: minor docs improvements for RLHF and faq on embeddings by @NanoCode012 in #2401
- Update README.md by @SicariusSicariiStuff in #2360
- use default torch fused adamw optimizer as default as adamw_hf is deprecated by @winglian in #2425
- bump HF versions except for trl by @winglian in #2427
- add 12.8.1 cuda to the base matrix by @winglian in #2426
- add run on novita ai by @liyiligang in #2421
- chore(doc): add instructions on adding custom integrations by @NanoCode012 in #2422
- Fixing KTO+QLoRA+multi-GPU by @SalmanMohammadi in #2420
- adding pre-commit auto-update GH action and bumping plugin versions by @djsaunde in #2428
- chore(doc): add explanation on fsdp_transformer_layer_cls_to_wrap by @NanoCode012 in #2429
- Autodoc generation with quartodoc by @djsaunde in #2419
- Sequence parallelism by @djsaunde in #2412
- installing axolotl prior to quartodoc build by @djsaunde in #2434
- Fix failing test by @djsaunde in #2436
- Feat: Add support for gemma3_text and add e2e for gemma2 by @NanoCode012 in #2406
- Feat: Rework multimodal support (mllama, llava, pixtral, qwen2, qwen25, gemma3, mistral3) by @NanoCode012 in #2435
- feat: add CCE for gemma3, cohere, and cohere2 by @NanoCode012 in #2443
- chore: minor optim changes (add apollo, improve docs, remove lion-pytorch) by @NanoCode012 in #2444
- fix(doc): document
do_causal_lm_eval
required to runeval_causal_lm_metrics
by @NanoCode012 in #2445 - Set the pytorch_cuda_alloc_conf env in the train module by @winglian in #2447
- add override of upstream fix for multi-gpu orpo by @winglian in #2440
- hf offline decorator for tests to workaround rate limits by @winglian in #2452
- bump liger to 0.5.5 by @winglian in #2448
- use offline for precached stream dataset by @winglian in #2453
- fix streaming packing test by @winglian in #2454
- fix: minor patches for multimodal by @NanoCode012 in #2441
- Sequence parallelism quick follow-ups; remove ModelCallback by @djsaunde in #2450
- destroy process group on Ctrl+C / training or eval run by @djsaunde in #2457
- Ray train bugfix by @djsaunde in #2458
- Updates for trl 0.16.0 - mostly for GRPO by @winglian in #2437
- Fix(doc): Clarify doc on attention configs and missing pad_token by @NanoCode012 in #2455
- Sequential sample packing by @DreamGenX in #2404
- gemma3 packing fixes by @winglian in #2449
- Release update 20250331 by @winglian in #2460
- Fix(doc): Minor doc changes for peft and modal by @NanoCode012 in #2462
- Fix: remove the numerous sequential log by @NanoCode012 in #2461
- Validation for Muon optimizer with DS/FSDP by @winglian in #2464
- fixing eval for SP by @djsaunde in #2468
- fix: downgrade deepspeed to fix grad checkpoint oom by @NanoCode012 in #2465
- fix: set rl=None during inference by @NanoCode012 in #2463
- torch 2.7.0 base image for testing by @winglian in #2467
- fix: pydantic warning validator not returning self by @NanoCode012 in #2474
- feat: add support for multimodal in lora kernels by @NanoCode012 in #2472
- fix: gemma3 loss in forward pass by @NanoCode012 in #2473
- fix: disable SP during merge by @NanoCode012 in #2470
- fix: separate gemma3 text and vision example config by @NanoCode012 in #2471
- fix(doc): document offload gradient_checkpointing option by @NanoCode012 in #2475
- set release version 0.8.0 by @winglian in #2476
New Contributors
- @SicariusSicariiStuff made their first contribution in #2360
- @liyiligang made their first contribution in #2421
Full Changelog: v0.7.1...v0.8.0