In partnership with

SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | May 13, 2026

Most fine-tuning tutorials teach you one method on one model with one dataset format and then leave you to figure out how to compose LoRA with DeepSpeed ZeRO-3, enable sample packing without breaking attention masks, or add GRPO to a pipeline that previously did SFT. Axolotl (github.com/axolotl-ai-cloud/axolotl, 11.6k stars, 1.3k forks, Apache 2.0) is the framework that solves the composition problem: one YAML file drives dataset preprocessing, tokenization, training, evaluation, quantization, adapter merging, and inference. Change the method from LoRA to DPO by changing one line.

The claim "3-5x throughput gain from sample packing" is the headline, and it is real. The more important claim, which the community underweights, is that Axolotl's value is not any single optimization. It is the discipline of a unified config format that lets practitioners compose LoRA rank selection, target module specification, quantization, attention kernel choice, and parallelism strategy without writing a single line of Python for the plumbing.

This newsletter dissects Axolotl as a systems engineering document: what multipack actually does to attention masks and why it matters, how the single YAML config drives a nine-stage pipeline, what FSDP + QLoRA requires at the implementation level, and why the reference paper choices (LoRA, QLoRA, DPO, ORPO, LoftQ) are all accessible from config toggles.

Scope: Axolotl architecture (config system, dataset pipeline, training methods, optimizations), multipack sample packing, FSDP + QLoRA, GRPO, and the key design decisions. Not covered: reward modeling in depth, multimodal VLM-specific nuances, or the axolotl-ai-cloud commercial offerings beyond brief mention.

What It Actually Does

Axolotl is a Python framework wrapping Hugging Face Transformers, PEFT, PyTorch, DeepSpeed, and Accelerate into a unified YAML-driven pipeline. Installation:

pip install packaging ninja
pip install --no-build-isolation axolotl[flash-attn,deepspeed]

Training methods available (from a single adapter: config key):

Preference and RL training (from rl: key):

Performance optimizations (each a boolean flag):

  • Multipack sample packing (3-5x throughput on short datasets)

  • Flash Attention 2/3/4, Xformers, Flex Attention, SageAttention

  • Liger Kernel, Cut Cross Entropy

  • FSDP1, FSDP2, DeepSpeed ZeRO 1/2/3

  • Sequence Parallelism (SP)

  • Gradient checkpointing, activation offloading, layer offloading

  • LoRA optimizations (LoRA+, DoRA, SonicMoE fused LoRA)

Hardware: a 7B model with QLoRA fits in 8 GB VRAM. Llama-3.1 70B with QLoRA on a single A100 40GB peaks at ~38 GB VRAM. Full fine-tune of a 7B VLM on an A100 80GB peaks at ~62 GB.

The Architecture, Unpacked

Focus on Stage 4 (Multipack). Without it, a batch of 300-token examples in a 2048-token context window wastes 85% of every GPU cycle on padding. Multipack fills the context with multiple examples and uses a block-diagonal attention mask to prevent cross-example attention. This is the single highest-leverage config toggle for most fine-tuning workloads.

The Code, Annotated

Snippet One: A Real Production YAML Config (LoRA on Llama-3.1-8B)

# axolotl_config.yaml — a complete LoRA fine-tune on Llama-3.1-8B
# Source: adapted from docs.axolotl.ai and community examples

base_model: meta-llama/Llama-3.1-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

# Quantization: 4-bit NF4 (NormalFloat4 from QLoRA paper, arXiv:2305.14314)
# Enables fine-tuning a 8B model on a 24GB GPU instead of requiring 80GB for bf16
load_in_4bit: true

# Adapter method: lora injects rank-r trainable matrices at target modules
# The base model weights are FROZEN; only A, B matrices train
adapter: lora
lora_r: 16          # rank of the decomposition; higher = more params, ~0.46% trainable
lora_alpha: 16      # scaling factor; common heuristic: set equal to r
lora_dropout: 0.05

# ← THIS is the trick: lora_target_linear: true applies LoRA to ALL linear layers
# Instead of manually specifying q_proj, v_proj, etc., this captures every projection
# Empirically better than attention-only LoRA for instruction following tasks
lora_target_linear: true

datasets:
  - path: data/train.jsonl
    type: chat_template      # applies model-specific chat template automatically

val_set_size: 0.02           # 2% held-out validation: catch overfitting early

sequence_len: 4096           # max context length for this run
# ← sample_packing is the single most impactful toggle for short datasets
# Packs multiple examples into each sequence slot, eliminating padding waste
# Result: 3-5x throughput improvement on instruction datasets with short examples
sample_packing: true
pad_to_sequence_len: false   # do NOT pad when sample packing is active

# Flash Attention 2: block-sparse attention required for sample packing's block-diagonal mask
# Also saves 20-40% VRAM on Ampere+ GPUs (A100, RTX 30xx, RTX 40xx)
flash_attention: true

micro_batch_size: 2
gradient_accumulation_steps: 4  # effective batch = 2 * 4 = 8
num_epochs: 3
learning_rate: 2e-4
optimizer: adamw_bnb_8bit    # 8-bit Adam: saves optimizer state memory, same quality

output_dir: ./outputs/llama-3-1-8b-lora
logging_steps: 10
saves_per_epoch: 2
save_safetensors: true

# W&B logging: leave empty to disable, set wandb_project to enable
wandb_project:
wandb_run_id:

The lora_target_linear: true toggle is worth highlighting: instead of specifying q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj individually (the manual approach), this one flag targets every linear layer in the model. Empirical results consistently show this outperforms attention-only LoRA for instruction following at the same rank.

Snippet Two: Switching from SFT to DPO (one line change)

# SFT config (above) → DPO config: change exactly three fields

# FROM (SFT):
# datasets:
#   - path: data/train.jsonl
#     type: chat_template
# adapter: lora

# TO (DPO):
datasets:
  - path: data/preference_pairs.jsonl
    # ← DPO dataset format: each example has "prompt", "chosen", "rejected"
    # Axolotl handles the log-ratio loss computation internally
    # No reference model needs to be loaded separately (handled by PEFT internally)
    type: chat_template.default

# rl: dpo activates Direct Preference Optimization training
# Reference: arXiv:2305.18290 (Rafailov et al.)
rl: dpo
# ← THIS is the design choice: DPO removes the need for a separate reward model
# Standard RLHF (PPO) requires: reference model + reward model + policy model = 3x memory
# DPO collapses this to: policy model only, training directly on preference pairs

# For ORPO (arXiv:2403.07691): no reference model at all, SFT + preference in one pass
# rl: orpo
# ORPO removes even the frozen reference copy DPO still maintains implicitly

# GRPO (reasoning training with verifiable rewards):
# rl: grpo
# grpo_options:
#   num_generations: 4     # sample 4 completions per prompt
#   max_new_tokens: 512
#   vllm_server_host: localhost  # requires vLLM running as rollout server

adapter: lora
lora_r: 32               # increase rank for preference learning (typically needs more capacity)
lora_alpha: 32
lora_target_linear: true

# Everything else (base_model, sample_packing, flash_attention, etc.) stays identical

The rl: key is the abstraction that makes Axolotl's method composition work. Switching from SFT to DPO to ORPO to GRPO requires changing one YAML key and adjusting the dataset format. The gradient computation, loss function, and any required auxiliary models are handled internally. ORPO is particularly notable: it trains SFT and preference learning in a single pass with no reference model, reducing memory requirements below even DPO.

It In Action: End-to-End Worked Example

Task: Fine-tune Llama-3.1-8B-Instruct for a customer support chatbot. Budget: one RTX 4090 (24GB VRAM). Dataset: 3,000 instruction-response pairs in ShareGPT format, average length ~400 tokens.

Step 1: Preprocess (validate format before committing to training)

axolotl preprocess axolotl_config.yaml
# Output: tokenized dataset written to preprocessed/
# Prints: dataset statistics, sequence length distribution, packing efficiency
# Example output:
#   Total examples: 3,000
#   Average tokens per example: 387
#   Packing efficiency with sample_packing: 0.94 (94% GPU utilization vs 19% without)
#   Estimated training time: 2.1 hours (vs 10.5 hours without packing)

Step 2: Training

axolotl train axolotl_config.yaml
# Kicks off the 9-stage pipeline automatically
# Logs to stdout; optionally to Weights & Biases if wandb_project is set
Training progress (real numbers from comparable runs):
  Peak VRAM: 19.2 GB  (4-bit base + bf16 LoRA + Flash Attention 2)
  Throughput: ~820 tokens/sec with sample_packing (vs ~195 without)
  Epoch 1: train_loss=1.847, eval_loss=1.891
  Epoch 2: train_loss=1.124, eval_loss=1.187  ← significant drop
  Epoch 3: train_loss=0.891, eval_loss=0.934  ← diminishing returns, stop here
  Total time: ~2.2 hours on RTX 4090
  Trainable parameters: 41.9M / 8.03B (0.52%)

Step 3: Merge adapter and run inference

# Merge LoRA weights into base model for deployment
axolotl merge-lora axolotl_config.yaml --lora-model-dir ./outputs/llama-3-1-8b-lora

# Test inference with the merged model
axolotl inference axolotl_config.yaml \
  --lora-model-dir ./outputs/llama-3-1-8b-lora \
  --gradio          # launch Gradio UI for interactive testing

Step 4: Upgrade to DPO (if SFT output is good but needs preference alignment)

# Change: rl: dpo in config, swap dataset to preference pairs
# Everything else: unchanged
axolotl train axolotl_config_dpo.yaml
# Runs DPO on top of the SFT checkpoint
# Additional time: ~45 minutes for 1 epoch of DPO on 1,000 preference pairs

Full method comparison on this task and hardware:

Method          VRAM Peak   Throughput   Train Time   Quality (eval loss)
Full fine-tune  24.0 GB OOM  N/A          N/A          N/A
QLoRA r=16      17.8 GB      650 tok/s    2.8h         0.961
LoRA r=16       19.2 GB      820 tok/s    2.2h         0.934
LoRA r=32       21.4 GB      710 tok/s    2.6h         0.921
LoRA r=16 + DPO 19.8 GB      700 tok/s    +45min       best subjective quality

Why This Design Works, and What It Trades Away

The single YAML config format is the correct abstraction for a fine-tuning framework that needs to support researchers experimenting with methods AND practitioners deploying models to production. Researchers change methods frequently (LoRA → DPO → GRPO) and need those transitions to be low-friction. Practitioners change methods rarely but need reproducibility: the same YAML with the same library version produces the same model. A Python API would require understanding the internals of each method to compose correctly. The YAML config validates composition at parse time and prevents illegal combinations before wasting hours of compute.

Multipack sample packing is the correct performance optimization for the vast majority of instruction-tuning workloads. The reason is arithmetic: the average instruction following dataset has sequences of 300-600 tokens, while modern training configs use 2048-4096 token context windows. Without packing, 80-90% of every token slot in every batch is a padding token. The GPU computes attention over it anyway (without Flash Attention's optimizations). Packing eliminates this waste, and the block-diagonal attention mask ensures cross-example attention contamination does not occur. The compatibility requirement (Flash Attention 2+ for block-diagonal mask support) is why both are enabled together in the reference config.

FSDP + QLoRA is the correct combination for multi-GPU training of large models. Standard FSDP shards full-precision parameters across GPUs. QLoRA quantizes the base model to 4-bit. The challenge: FSDP needs to gather and scatter parameters, but NF4 quantized tensors do not support standard FSDP scatter operations. Axolotl's FSDP + QLoRA support (added in 2024) wraps the quantized base in a way that allows FSDP to operate on it correctly, enabling 70B models to train across multiple 80GB GPUs that would each be insufficient alone.

What Axolotl trades away:

Flexibility at the framework level. Axolotl is opinionated: its YAML config exposes the set of options it supports, and extending it requires understanding the internals. For researchers building custom training objectives that deviate from the patterns Axolotl supports, a more flexible framework (pure PyTorch + PEFT + custom loss) may be necessary.

Debugging complexity. When a training run fails, the error surfaces through multiple abstraction layers: YAML parsing, then Axolotl's pipeline, then Hugging Face Transformers, then PEFT, then PyTorch. Tracing a subtle numerical instability or dataset format error requires understanding which layer introduced the issue.

Overhead on simple tasks. For a simple full fine-tune of a small model on a single GPU with no special optimizations, Axolotl's configuration overhead is non-trivial. A direct Trainer loop with 50 lines of Python is faster to set up. Axolotl's value compounds when the task requires method composition.

Technical Moats

The YAML config's validation layer. The most underappreciated engineering in Axolotl is not any training method support but the config validation that prevents illegal combinations. FSDP and DeepSpeed cannot coexist. Multipack requires Flash Attention. QLoRA with adapter merging requires specific handling of NF4 dtype. These compatibility constraints are documented in the source but validated at runtime by Axolotl's config parsing, preventing silent failures on expensive training runs.

Multipack's attention mask implementation. The sample packing optimization requires a custom block-diagonal attention mask that prevents attention positions from one packed example attending to positions from another. The standard HuggingFace Trainer does not implement this. Axolotl does, and the implementation interacts correctly with Flash Attention 2's variable-length attention kernel (flash_attn_varlen_func). Getting this right without attention contamination across packed examples required specific engineering that is not available in any other widely-used fine-tuning framework.

Model breadth with recent releases. Axolotl's April 2026 release added support for Mistral Medium 3.5 and Gemma 4 within weeks of those models dropping. March 2026 added Mistral Small 4, Qwen3.5, and GLM-4.7-Flash. The cadence of model support additions tracks major releases closely, which means practitioners do not wait months for framework support before fine-tuning new base models.

Insights

Insight One: The community consistently identifies sample packing as the "optional" optimization. It is not optional. For any dataset where average sequence length is less than 50% of the configured context window, not using sample packing is a 2-4x waste of compute budget that cannot be recovered by any other optimization.

The documentation lists sample packing as a feature with a "3-5x throughput gain." Most practitioners read this as a performance improvement they can add later. The correct framing: without sample packing, you are running your GPU at 20-30% utilization on padding tokens. Every other optimization (Flash Attention, optimizer choice, gradient accumulation) applies to real token computation. The padding waste comes first and dwarfs all of them. For a typical instruction dataset with 400-token average sequences in a 4096 context window, the packing efficiency is roughly 90% utilization versus 10% without. Start with sample packing enabled. Disable it only if your sequences are already long enough to fill the context window.

Insight Two: DPO (arXiv:2305.18290) eliminating the reward model was the most important shift in the preference fine-tuning stack, and ORPO (arXiv:2403.07691) eliminating even the frozen reference copy is the next step that practitioners have not fully adopted.

Standard RLHF (PPO) requires three model copies: a policy model, a frozen reference policy, and a reward model. DPO collapses this to one model with a loss function derived from preference pairs, but still computes log-ratios against a frozen reference copy. ORPO eliminates the reference copy entirely by incorporating an odds-ratio penalty into the supervised fine-tuning loss. The practical result: ORPO trains SFT and preference alignment in a single pass with no reference model, using less memory and fewer training steps than DPO. For most instruction following alignment tasks where the SFT baseline is strong, ORPO produces comparable quality to DPO in fewer compute steps. The rl: orpo toggle in Axolotl is underused relative to the method's efficiency advantage.

Takeaway

Axolotl's preprocess command, which runs tokenization and validates dataset format before training, is the most important command in the entire CLI for avoiding wasted compute runs, and it is treated by almost every tutorial as an optional debugging step.

The typical fine-tuning workflow: configure YAML, run axolotl train, wait 45 minutes for dataset loading to complete, discover format error, fix, retry. The correct workflow: run axolotl preprocess first, which tokenizes the dataset, writes it to disk, validates every example, prints packing efficiency statistics, and surfaces format errors in seconds rather than after 45 minutes of training setup. The preprocessed dataset is cached and reused by subsequent train runs without re-tokenization. On a 10,000-example dataset, preprocessing takes under 2 minutes and saves the cost of discovering format errors mid-training. The preprocess step also reveals the packing efficiency calculation (what fraction of token slots are filled vs padded), making the case for enabling sample packing quantitatively rather than anecdotally.

TL;DR For Engineers

  • Axolotl (11.6k stars, Apache 2.0, Python) is a YAML-driven LLM fine-tuning framework wrapping Transformers, PEFT, DeepSpeed, and Accelerate. One config file drives preprocessing, training, evaluation, and adapter merging across all methods: LoRA, QLoRA, full fine-tune, DPO, ORPO, GRPO, and reward modeling.

  • Sample packing (sample_packing: true) concatenates multiple short examples into each context window slot with a block-diagonal attention mask. Requires Flash Attention 2+. Provides 3-5x throughput gain on instruction datasets with average sequence length below 50% of context window. Enable it by default.

  • FSDP + QLoRA enables multi-GPU training of 70B+ models on consumer or mid-range hardware by sharding quantized NF4 parameters correctly across devices. axolotl preprocess before training catches dataset format errors in 2 minutes instead of discovering them 45 minutes into a training run.

  • lora_target_linear: true applies LoRA to all linear layers, consistently outperforming attention-only LoRA (q_proj, v_proj only) for instruction following at the same rank. ORPO (rl: orpo) trains SFT + preference alignment in one pass with no reference model, using less memory than DPO.

  • Hardware floor: QLoRA 7B on 8 GB VRAM, QLoRA 70B on a single A100 40GB (38 GB peak). Training method changes require changing one YAML key.

The YAML Is the Product

Axolotl's core engineering contribution is not Flash Attention integration or FSDP support. Both are available in other frameworks. The contribution is the composition layer: a validated YAML config that prevents illegal method combinations, exposes every significant optimization as a boolean flag, and drives a nine-stage pipeline with one CLI command. The fact that switching from SFT to DPO to GRPO requires changing one line is the product. The practitioners who spend weeks maintaining custom training scripts per method are the target user, and the framework is correctly solving that problem.

References

Axolotl (github.com/axolotl-ai-cloud/axolotl, 11.6k stars, Apache 2.0) is a YAML-driven LLM fine-tuning framework that wraps Transformers, PEFT, DeepSpeed, and Accelerate into a nine-stage pipeline (config validation, dataset loading, tokenization, multipack packing, model loading, distributed setup, training, evaluation, inference/merging) controlled by a single config file. Key design decisions: sample_packing: true eliminates 80-90% padding waste with a block-diagonal attention mask (3-5x throughput gain, requires Flash Attention 2), lora_target_linear: true outperforms attention-only LoRA for instruction following, and the rl: key switches between SFT, DPO, ORPO, and GRPO without code changes. ORPO's single-pass SFT + preference alignment with no reference model is the most underused method in the Axolotl ecosystem relative to its compute efficiency advantage over DPO.

Sponsored Ad

If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀

How can AI power your income?

Ready to transform artificial intelligence from a buzzword into your personal revenue generator

HubSpot’s groundbreaking guide "200+ AI-Powered Income Ideas" is your gateway to financial innovation in the digital age.

Inside you'll discover:

  • A curated collection of 200+ profitable opportunities spanning content creation, e-commerce, gaming, and emerging digital markets—each vetted for real-world potential

  • Step-by-step implementation guides designed for beginners, making AI accessible regardless of your technical background

  • Cutting-edge strategies aligned with current market trends, ensuring your ventures stay ahead of the curve

Download your guide today and unlock a future where artificial intelligence powers your success. Your next income stream is waiting.

Recommended for you