Release v0.8.0 · axolotl-ai-cloud/axolotl

New Features

Sequence parallelism support via ring-flash-attn

This enables long context training by distributing sequences across GPUs, reducing memory requirements per device while allowing near-linear scaling in context length per GPU. This complements other parallelism features that Axolotl offers, including FSDP and DeepSpeed. See our documentation here.

Gemma-3 support has landed alongside several features to help you fine-tune Gemma-3 models:

Cut cross entropy
Liger kernel
Multimodal
Fixed loss calculation for Gradient Accumulation

Multimodal Beta support for a variety of multi-modal models:

Mllama
Pixtral
Llava-1.5
Mistral-Small-3.1
Gemma-3
Qwen2-VL
Qwen2.5-VL

Additional Features

Updated cut-cross-entropy patches for several models: Cohere, Cohere-2, Gemma, Gemma-2, Gemma-3, Mistral-3, and Mllama
Support for the REX Learning Rate Scheduler - https://arxiv.org/abs/2107.04197
Tokenizer Overrides - you can now fine-tune with custom values in tokenizers using reserved tokens
Single-gpu and DDP support for Muon Optimizer
Sequential packing for Curriculum learning
Speeding up GRPO training with distributed vLLM - you can now use axolotl vllm-serve path/to/config.yaml to serve a separate vLLM instance which can utilize multiple GPUs to speed up trajectory generation during GRPO.

Notes

v0.8.x will be the last set of releases that will officially support torch<=2.4.1. With PyTorch 2.7 release this month, we aim to support the latest 2 stable releases of PyTorch.
We expect FSDP2 support to be a fast follow and we'll include that in v0.8.1 once we can fix and validate issues such as saving checkpoints.

What's Changed

train.py refactor by @djsaunde in #2371
fix(doc): add installation for cce to docs by @NanoCode012 in #2375
chore(docs): remove phorm by @NanoCode012 in #2378
feat(doc): add docker images explanation by @NanoCode012 in #2379
feat(doc): document drop_system_message and clarify limitation by @NanoCode012 in #2381
chore(doc): add clarification about mpi4py error on single gpu deepspeed by @NanoCode012 in #2383
fix(doc): add missing low_cpu_mem_usage config to docs by @NanoCode012 in #2369
feat(grpo): add reward_weights config and refactor by @NanoCode012 in #2365
Add REX LR Scheduler by @xzuyn in #2380
Update Tokenizer Overrides Handling in models.py by @mhenrichsen in #1549
various fixes 20250305 by @winglian in #2384
Optimizer refactor and add Muon support by @winglian in #2367
remove lion-pytorch as it's already handled upstream by @winglian in #2389
refactor: trl grpo configs to have descriptions by @NanoCode012 in #2386
feat(doc): add more info on RewardModel datasets by @NanoCode012 in #2391
chore(doc): add faq when having no default chat_template by @NanoCode012 in #2398
Use Latest Cut Cross Entropy by @xzuyn in #2392
fix: create mount folder on modal if not exist by @NanoCode012 in #2390
include iproute2 and nvtop in cloud image by @winglian in #2393
fix(modal): add git pull when getting branch files by @NanoCode012 in #2399
pass additional info for fix untrained tokens when using distributed + offloading by @winglian in #2388
use max of 32 dataset processes if not explicit by @winglian in #2403
build cloud images with torch 2.6.0 by @winglian in #2413
only validate hf user token on rank 0 by @winglian in #2408
fixes against upstream main branches by @winglian in #2407
chore(docs): add cookbook/blog link to docs by @NanoCode012 in #2410
Feat: minor docs improvements for RLHF and faq on embeddings by @NanoCode012 in #2401
Update README.md by @SicariusSicariiStuff in #2360
use default torch fused adamw optimizer as default as adamw_hf is deprecated by @winglian in #2425
bump HF versions except for trl by @winglian in #2427
add 12.8.1 cuda to the base matrix by @winglian in #2426
add run on novita ai by @liyiligang in #2421
chore(doc): add instructions on adding custom integrations by @NanoCode012 in #2422
Fixing KTO+QLoRA+multi-GPU by @SalmanMohammadi in #2420
adding pre-commit auto-update GH action and bumping plugin versions by @djsaunde in #2428
chore(doc): add explanation on fsdp_transformer_layer_cls_to_wrap by @NanoCode012 in #2429
Autodoc generation with quartodoc by @djsaunde in #2419
Sequence parallelism by @djsaunde in #2412
installing axolotl prior to quartodoc build by @djsaunde in #2434
Fix failing test by @djsaunde in #2436
Feat: Add support for gemma3_text and add e2e for gemma2 by @NanoCode012 in #2406
Feat: Rework multimodal support (mllama, llava, pixtral, qwen2, qwen25, gemma3, mistral3) by @NanoCode012 in #2435
feat: add CCE for gemma3, cohere, and cohere2 by @NanoCode012 in #2443
chore: minor optim changes (add apollo, improve docs, remove lion-pytorch) by @NanoCode012 in #2444
fix(doc): documentdo_causal_lm_eval required to run eval_causal_lm_metrics by @NanoCode012 in #2445
Set the pytorch_cuda_alloc_conf env in the train module by @winglian in #2447
add override of upstream fix for multi-gpu orpo by @winglian in #2440
hf offline decorator for tests to workaround rate limits by @winglian in #2452
bump liger to 0.5.5 by @winglian in #2448
use offline for precached stream dataset by @winglian in #2453
fix streaming packing test by @winglian in #2454
fix: minor patches for multimodal by @NanoCode012 in #2441
Sequence parallelism quick follow-ups; remove ModelCallback by @djsaunde in #2450
destroy process group on Ctrl+C / training or eval run by @djsaunde in #2457
Ray train bugfix by @djsaunde in #2458
Updates for trl 0.16.0 - mostly for GRPO by @winglian in #2437
Fix(doc): Clarify doc on attention configs and missing pad_token by @NanoCode012 in #2455
Sequential sample packing by @DreamGenX in #2404
gemma3 packing fixes by @winglian in #2449
Release update 20250331 by @winglian in #2460
Fix(doc): Minor doc changes for peft and modal by @NanoCode012 in #2462
Fix: remove the numerous sequential log by @NanoCode012 in #2461
Validation for Muon optimizer with DS/FSDP by @winglian in #2464
fixing eval for SP by @djsaunde in #2468
fix: downgrade deepspeed to fix grad checkpoint oom by @NanoCode012 in #2465
fix: set rl=None during inference by @NanoCode012 in #2463
torch 2.7.0 base image for testing by @winglian in #2467
fix: pydantic warning validator not returning self by @NanoCode012 in #2474
feat: add support for multimodal in lora kernels by @NanoCode012 in #2472
fix: gemma3 loss in forward pass by @NanoCode012 in #2473
fix: disable SP during merge by @NanoCode012 in #2470
fix: separate gemma3 text and vision example config by @NanoCode012 in #2471
fix(doc): document offload gradient_checkpointing option by @NanoCode012 in #2475
set release version 0.8.0 by @winglian in #2476

New Contributors

@SicariusSicariiStuff made their first contribution in #2360
@liyiligang made their first contribution in #2421

Full Changelog: v0.7.1...v0.8.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.8.0