Skip to content

v0.8.0

Compare
Choose a tag to compare
@github-actions github-actions released this 02 Apr 13:51
· 103 commits to main since this release
3877c5c

New Features

Sequence parallelism support via ring-flash-attn

This enables long context training by distributing sequences across GPUs, reducing memory requirements per device while allowing near-linear scaling in context length per GPU. This complements other parallelism features that Axolotl offers, including FSDP and DeepSpeed. See our documentation here.
Screenshot 2025-04-02 at 9 17 14 AM

Gemma-3 support has landed alongside several features to help you fine-tune Gemma-3 models:

  • Cut cross entropy
  • Liger kernel
  • Multimodal
  • Fixed loss calculation for Gradient Accumulation

Multimodal Beta support for a variety of multi-modal models:

  • Mllama
  • Pixtral
  • Llava-1.5
  • Mistral-Small-3.1
  • Gemma-3
  • Qwen2-VL
  • Qwen2.5-VL

Additional Features

  • Updated cut-cross-entropy patches for several models: Cohere, Cohere-2, Gemma, Gemma-2, Gemma-3, Mistral-3, and Mllama
  • Support for the REX Learning Rate Scheduler - https://arxiv.org/abs/2107.04197
  • Tokenizer Overrides - you can now fine-tune with custom values in tokenizers using reserved tokens
  • Single-gpu and DDP support for Muon Optimizer
  • Sequential packing for Curriculum learning
  • Speeding up GRPO training with distributed vLLM - you can now use axolotl vllm-serve path/to/config.yaml to serve a separate vLLM instance which can utilize multiple GPUs to speed up trajectory generation during GRPO.

Notes

v0.8.x will be the last set of releases that will officially support torch<=2.4.1. With PyTorch 2.7 release this month, we aim to support the latest 2 stable releases of PyTorch.
We expect FSDP2 support to be a fast follow and we'll include that in v0.8.1 once we can fix and validate issues such as saving checkpoints.

What's Changed

New Contributors

Full Changelog: v0.7.1...v0.8.0