FST: The Dual-Engine Training Method That Reaches Peak Performance With Three Times Fewer Steps

In partnership with

SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | May 25, 2026

The implicit assumption in LLM post-training is that parameter updates are the correct signal path for learning. Reinforcement learning updates the weights. The weights absorb the task-specific distribution. The model gets better at the task. This works, and it has a documented failure mode: every parameter update moves the model's weight distribution away from the base model. Enough updates and the model has forgotten how to do things it could do before the task-specific training began. This is catastrophic forgetting. The related failure is loss of plasticity: after RL training, the model's ability to learn a new task degrades because its weights have been specialized away from the general-purpose state that makes rapid adaptation possible.

The alternative is prompt optimization: leave the parameters frozen and update the prompt to adapt to the task. Prompt optimization cannot catastrophically forget (the base model is unchanged) and preserves plasticity (no weight drift). Its limitation: it cannot match the performance ceiling of parameter updates for complex tasks. An optimized prompt is powerful but bounded by what you can express in context.

Fast-Slow Training (FST) (Tiwari, Sareen, Agrawal, Gonzalez, Zaharia, Keutzer, Dhillon, Agarwal, Khatri, arXiv:2605.12484, May 2026) resolves this by running both simultaneously. The prompt (fast weights) absorbs task-specific signals rapidly. The parameters (slow weights) absorb general reasoning improvements gradually. The two update paths are interleaved during training. The result: 3× more sample-efficient than RL-only, a higher asymptotic performance ceiling, 70% less KL divergence from the base model, and continued learning on new tasks where RL-only stalls.

FST is implemented in GEPA, the prompt optimization engine from the same Berkeley/UT Austin group, interleaved with CISPO (a sample-efficient RL variant). The blog post published alongside the paper provides the clearest description of the mechanism, the empirical results, and the theoretical grounding in Complementary Learning Systems (CLS) theory from neuroscience.

Scope: FST mechanism, fast vs. slow weight roles, the interleaved training loop, data efficiency results, KL divergence and plasticity findings, and continual learning experiments. Not covered: GEPA's full prompt optimization pipeline beyond its role in FST, or CISPO's RL implementation beyond its interaction with the fast weights.

What It Actually Does

FST is a training paradigm that interleaves two optimization engines on the same data:

Fast engine (prompt optimization via GEPA):

Updates the prompt/context layer using textual feedback and reward
Learns task-specific patterns rapidly (high learning rate, task-specialized)
No gradient backpropagation through model weights
Analogous to hippocampal learning in CLS theory: fast, episodic, task-specific

Slow engine (parameter updates via CISPO/RL):

Updates model weights using policy gradient
Learns general reasoning improvements gradually (low effective learning rate through interleaving)
Receives cleaner signal because fast weights have absorbed task noise
Analogous to neocortical learning in CLS theory: slow, semantic, general

Training loop:

Every K steps: update fast weights (prompt optimization)
Every step: update slow weights (RL gradient)
Both optimizing for the same reward signal, via different mechanisms

Empirical results from arXiv:2605.12484:

Metric	RL-only	FST	Improvement
Steps to match RL ceiling	N steps	N/3 steps	3× sample efficiency
Asymptotic performance	Baseline ceiling	Higher ceiling	+improvement at convergence
KL divergence from base	Baseline	70% less	Better plasticity preservation
Continual learning (new task)	Stalls	Continues acquiring	Qualitative difference

The Architecture, Unpacked

Focus on the fast weights absorbing task-specific noise. The slow weights (RL) receive a cleaner gradient signal because the prompt has already captured the task-specific patterns. This noise reduction is the mechanism that allows parameters to focus on general capability improvements rather than task memorization.

The Code, Annotated

Snippet One: FST Training Loop (Interleaved Fast and Slow Updates)

# Fast-Slow Training implementation
# Source: arXiv:2605.12484, GEPA framework (gepa-ai/gepa)
# The training loop interleaves prompt optimization with RL parameter updates

import torch
from gepa import GEPAOptimizer, GEPAConfig
from cispo import CISPOTrainer  # sample-efficient RL variant used in FST

def fast_slow_training(
    model,
    task_dataset,
    reward_fn,
    fast_update_every_k: int = 10,   # prompt update frequency
    total_rl_steps: int = 5000,
    prompt_candidates_k: int = 5,     # candidate prompts per fast update
):
    """
    FST: interleave prompt optimization (fast) with RL (slow).

    The K hyperparameter controls the ratio of fast-to-slow updates.
    ← THIS is the key design decision:
    Small K (e.g., K=1): prompt and params update equally frequently.
    Large K (e.g., K=100): prompts update rarely; mostly RL.
    The paper finds K in range [5, 20] works well across tasks.
    """

    # Fast engine: GEPA prompt optimizer
    # Optimizes prompt via meta-LLM feedback, no backprop through task LLM
    gepa_config = GEPAConfig(
        n_candidates=prompt_candidates_k,   # generate k candidate prompts
        acceptance_criterion="improvement", # accept if better than current
    )
    fast_engine = GEPAOptimizer(model=model, config=gepa_config)

    # Slow engine: CISPO RL trainer (Clipped Importance Sampling Policy Optimization)
    # ← CISPO is the RL variant used in FST: more sample-efficient than PPO
    # for in-context RL because it uses importance sampling to reuse off-policy data
    slow_engine = CISPOTrainer(
        model=model,
        reward_fn=reward_fn,
        learning_rate=1e-5,   # ← low LR: slow, conservative weight updates
    )

    current_prompt = "Solve the following problem step by step:"  # initial prompt
    best_reward = float('-inf')

    for step in range(total_rl_steps):

        # ── Fast update (every K steps) ──────────────────────────────────────
        if step % fast_update_every_k == 0:
            # Generate K candidate prompts and evaluate on a batch
            batch = task_dataset.sample(n=32)

            # ← Fast weights learn from TEXTUAL FEEDBACK (reward + traces)
            # Not gradient-based: GEPA uses LLM meta-learning to propose better prompts
            # This is why fast learning doesn't move slow weights
            updated_prompt, fast_reward = fast_engine.optimize_step(
                current_prompt=current_prompt,
                examples=batch,
                reward_fn=reward_fn,
            )

            if fast_reward > best_reward:
                current_prompt = updated_prompt
                best_reward = fast_reward

        # ── Slow update (every step) ─────────────────────────────────────────
        # ← Slow weights receive cleaner gradient because prompt already absorbed
        # task-specific information. The RL signal focuses on general capability.
        batch = task_dataset.sample(n=16)
        slow_engine.train_step(
            prompt=current_prompt,    # uses the current best prompt
            examples=batch,
            reward_fn=reward_fn,
        )

    return model, current_prompt


# ── Key behaviors from the paper ─────────────────────────────────────────────
# 1. Fast weights acquire task signal faster than slow weights
#    (measured by reward improvement per step in first 1/3 of training)
# 2. At matched performance levels, FST has 70% lower KL from base
# 3. In continual learning, reset fast weights (new prompt) when task changes
#    but keep slow weights — the model retains general capability from slow path

The learning_rate=1e-5 for the slow engine, combined with the K-step interleaving, is what makes slow weights update conservatively. The prompt has already captured most of the task-specific signal, so the parameter gradient focuses on the residual general capability improvement. This is the mechanism that reduces KL divergence: less task-specific noise in the gradient → less drift from the base distribution.

Snippet Two: Continual Learning with FST and Why RL-Only Stalls

# FST continual learning: how FST handles changing task domains
# vs why RL-only stalls
# Source: GEPA blog post, arXiv:2605.12484 Section 4.3

def continual_learning_fst(
    model,
    task_sequence: list[dict],  # [{dataset: ..., reward: ..., name: ...}]
    fast_update_every_k: int = 10,
    steps_per_task: int = 2000,
):
    """
    FST in continual learning: train on task 1, then task 2, then task 3...

    RL-only failure mode on Task 2 after Task 1 training:
    - Model's weights have drifted toward Task 1 distribution
    - Task 1 patterns embedded in slow weights interfere with Task 2 learning
    - Policy gradient on Task 2 fights against Task 1 specialization
    - Result: stalled performance, learning curve flattens

    FST advantage:
    - Slow weights retained more general capability (lower KL from base)
    - When switching to Task 2: RESET fast weights to new initial prompt
    - Slow weights can adapt to Task 2 because they're not over-specialized
    ← THIS is the trick: slow weights stay generalizable because
      task-specific information was offloaded to fast weights during Task 1
    """

    model_state_before = model.state_dict()  # save base model state
    current_prompt = default_initial_prompt()

    task_performance = {}

    for task_idx, task in enumerate(task_sequence):
        print(f"Training on task {task['name']} (task {task_idx + 1}/{len(task_sequence)})")

        if task_idx > 0:
            # ← KEY: when switching tasks, reset FAST weights but keep SLOW weights
            # Reset fast weights: start with a fresh prompt for the new task domain
            # Keep slow weights: the parameters contain general capability
            # This is the dual-engine advantage: modularity between fast and slow
            current_prompt = default_initial_prompt()  # reset fast
            # model weights are NOT reset: slow weights carry general capability

        # Run FST for this task
        model, current_prompt = fast_slow_training(
            model=model,
            task_dataset=task['dataset'],
            reward_fn=task['reward'],
            fast_update_every_k=fast_update_every_k,
            total_rl_steps=steps_per_task,
        )

        # Evaluate
        task_performance[task['name']] = evaluate(model, current_prompt, task)

    return task_performance


def continual_learning_rl_only(model, task_sequence, steps_per_task):
    """
    RL-only baseline: shows the stalling behavior FST addresses.

    After Task 1 training, the model's parameter distribution has shifted.
    The RL gradient for Task 2 must fight against Task 1 specialization.
    The paper measures this as: reward improvement rate for Task 2 is
    significantly lower after Task 1 RL training vs. starting from base.
    """
    for task in task_sequence:
        # No prompt optimization: only parameter updates
        rl_trainer = StandardRLTrainer(model, task['reward'])
        for step in range(steps_per_task):
            batch = task['dataset'].sample(n=16)
            rl_trainer.train_step(batch)
            # ← Slow weight drift accumulates here with no fast weight buffer
            # Every step moves parameters further from base distribution

The reset-fast-keep-slow pattern at task boundaries is the continual learning design. Slow weights are the general-purpose engine; fast weights are the task-specific adapter. When the task changes, you replace the adapter and keep the engine. RL-only has no such separation: all task-specific information is embedded in the weights, and the engine becomes the adapter.

It In Action: End-to-End Worked Example

Task: Train a model to solve competition math problems (AIME/AMC-style) starting from a capable base model.

Baseline (RL-only via CISPO):

Training steps: 5,000
Final accuracy on held-out math: 72.3%
KL divergence from base: 0.84
Plasticity test (accuracy on coding after math training): -8.2% vs base
Training compute: ~100 GPU-hours

FST (fast engine: GEPA prompt optimization, slow engine: CISPO):

Fast update frequency: K = 10 (prompt updated every 10 RL steps)
Initial prompt: "Solve the following math problem step by step:"

─── Training progress (FST vs RL-only) ───────────────────────────────

Step 100:
  RL-only: 41.2% accuracy
  FST:     48.7% accuracy    ← fast weights already adapted to math notation

  Current FST prompt (after 10 fast updates):
  "Solve the following competition math problem. First, identify the key
   variables and constraints. Then work step-by-step, showing all
   algebraic manipulations. Box your final answer."

Step 500:
  RL-only: 58.9% accuracy
  FST:     64.1% accuracy    ← 3× fewer steps to reach 64% (RL reaches it at ~1,500)

Step 1,500:
  RL-only: 68.4% accuracy   ← RL reaches FST's Step-500 performance
  FST:     70.9% accuracy

Step 5,000 (convergence):
  RL-only: 72.3% accuracy   ← RL ceiling
  FST:     74.1% accuracy   ← FST ceiling is HIGHER (both paths contribute)

─── Post-training analysis ────────────────────────────────────────────

KL divergence from base:
  RL-only: 0.84
  FST:     0.25               ← 70% less KL, consistent with paper result

Plasticity test (train on new coding task after math training):
  RL-only: -8.2% vs base (residual math specialization interferes)
  FST:     -2.1% vs base     ← maintained plasticity; faster new task learning

─── Continual learning extension ──────────────────────────────────────

After math training, switch to coding task:
  RL-only: peaks at 43.1% (stalls; math-specialized weights interfere)
  FST:     reaches 51.4%     ← reset prompt, slow weights already general enough
  Base (no prior training):  37.2% starting accuracy

─── Sample efficiency summary ─────────────────────────────────────────
FST reaches 72% accuracy (RL ceiling) in ~1,700 steps
RL reaches 72% accuracy in ~5,000 steps
FST: 3× sample efficiency, consistent with paper's headline claim

Why This Design Works, and What It Trades Away

The theoretical grounding is Complementary Learning Systems (CLS) theory, originally from neuroscience: the hippocampus enables rapid, episodic learning (fast weights); the neocortex enables slow, semantic consolidation (slow weights). These two systems work together in human cognition precisely because they have different timescales and different capacities. The hippocampus rapidly absorbs specific experiences; the neocortex gradually extracts general patterns. Without the hippocampus, learning is slow. Without the neocortex, learning doesn't generalize.

FST operationalizes this in LLM training: the prompt (fast weights) absorbs task-specific experience rapidly and cheaply. The parameters (slow weights) consolidate general capability gradually and persistently. The interleaved update schedule creates the temporal separation that makes CLS work: fast weights update frequently on task-specific signals; slow weights update continuously on a cleaner, more general signal.

The 70% KL divergence reduction is the most practically important number in the paper. Lower KL means the trained model's weight distribution is closer to the base model. This matters for three reasons: (1) the model retains general capabilities it had before task training, (2) the model can be more efficiently retrained on new tasks (plasticity), and (3) the model's behavior is more predictable because it hasn't drifted far from a well-characterized base.

What FST trades away:

Hyperparameter complexity. FST introduces K (fast update frequency) and the number of prompt candidates per fast step. The right K depends on task complexity, data batch size, and the ratio of task-specific to general signal in the data. The paper evaluates K in {5, 10, 20} and finds K=10 robust across tasks, but operationalizing this for a new domain requires validation.

Compute overhead per step. Each fast update requires K forward passes for candidate evaluation (K=5 candidates × batch_size examples). At K=10 RL steps between fast updates, this adds roughly 50% compute overhead vs. RL-only (5 evaluations per 10 steps = 0.5 extra forward passes per RL step). The sample efficiency gain (3×) dominates this overhead, but the compute arithmetic is important for practitioners planning training budgets.

Prompt optimization quality dependence. FST's benefits depend on the quality of the fast engine (GEPA). A poor prompt optimizer will generate low-quality fast weights that add noise rather than reduce it, degrading the signal for the slow engine. The paper uses GEPA with reflection-based proposal, which is state-of-the-art prompt optimization. Teams using simpler prompt optimization approaches may see smaller FST gains.

Technical Moats

The theoretical connection to CLS is the moat that makes FST principled, not just empirical. Dozens of papers have combined prompt optimization and RL in some form. What makes FST defensible is that it follows the CLS theoretical framework precisely: fast weights are high-plasticity, task-specific, and reset between tasks; slow weights are low-plasticity, general-purpose, and preserved across tasks. The experimental validation (plasticity tests, continual learning, KL divergence) directly tests CLS-derived predictions. A competing approach that combines prompt optimization and RL without this theoretical structure cannot make the same predictive claims.

The GEPA implementation quality. Fast-Slow Training's performance depends on the quality of the fast engine. GEPA uses reflective mutation (LLM proposes prompt changes based on failure analysis), merge operations (combine successful prompt elements), and a Pareto-optimal candidate selection that trades off quality and cost. A simpler prompt optimizer would produce lower-quality fast weights and correspondingly smaller FST gains. GEPA's architectural sophistication is a real component of FST's results.

The CISPO RL variant. The slow engine uses CISPO (Clipped Importance Sampling Policy Optimization), a sample-efficient variant that reuses off-policy trajectories via importance weighting. This is more efficient than standard PPO in the FST context because the fast weight updates between RL steps create a mild off-policy distribution shift that CISPO handles correctly. Standard PPO would either be overly conservative (clipping too aggressively) or unstable (ignoring the off-policy shift). The RL variant choice is part of why the 3× sample efficiency claim holds.

Insights

Insight One: The 3× sample efficiency claim masks the more important result, which is the higher asymptotic ceiling. Sample efficiency says "reach the same performance faster." Higher ceiling says "reach performance that RL-only cannot reach at all." Both are true of FST, but practitioners focused on compute budgets will read "3× faster" and miss that they're also getting a fundamentally better trained model.

The asymptotic ceiling improvement means that even with the same training budget, FST produces a better model than RL-only. The paper shows this via ScaleRL-style scaling law fits: at convergence, FST's extrapolated ceiling is higher than RL-only's. This is not just efficiency. It is a qualitatively different training outcome. The reason is that both the fast and slow paths are optimizing for reward, and their contributions add rather than substitute. The slow weights handle what the fast weights cannot (deep capability integration); the fast weights handle what the slow weights would otherwise specialize on (task-specific patterns). Together they cover more of the performance space than either alone.

Insight Two: The continual learning experiment is the most important result in the paper for practitioners thinking about production deployment of LLM-based systems, and it gets the least attention in coverage of FST.

The scenario: a deployed model needs to sequentially learn Task 1, then Task 2, then Task 3. RL-only stalls on Task 2 because Task 1 specialization is baked into the weights. FST continues learning Task 2 because the slow weights retained general capability. This is directly applicable to multi-task deployment, model fine-tuning pipelines, and any scenario where a model needs to adapt to new domains without forgetting prior capabilities. The "RL stalls" result is not a minor benchmark observation. It is a fundamental failure mode of parameter-only training that FST addresses structurally.

Takeaway

Fast-Slow Training was theoretically anticipated by cognitive science 30 years before anyone applied it to language models, and the theoretical prediction (fast+slow beats either alone, with less forgetting) held perfectly when tested empirically on LLMs. The 1995 Complementary Learning Systems paper by McClelland, McNaughton, and O'Reilly predicted that hippocampal-neocortical coordination would produce better learning and retention than either system alone. FST validated this on a completely different computational substrate. The theory was right; the implementation just waited for large language models to make it tractable.

This is not just a cute observation. It means FST has a theoretical prior that predicts its empirical results. When a system matches theoretical predictions derived from a different domain (neuroscience → LLM training), the theoretical grounding is more likely to be correct than a purely empirical finding. Teams building on FST can use CLS theory to make predictions about FST behavior in new settings (e.g., how K should scale with task complexity) rather than requiring full experimental validation of every new configuration.

TL;DR For Engineers

FST (arXiv:2605.12484, May 2026, Berkeley + UT Austin) interleaves prompt optimization (fast weights, via GEPA) and RL (slow weights, via CISPO) during training. Prompt = fast weights, absorbs task-specific signal; parameters = slow weights, focuses on general capability. Interleaved every K steps (K≈10 works across tasks).
Key results: 3× sample efficiency vs. RL-only (reach RL ceiling in 1/3 the steps), higher asymptotic ceiling (both engines contribute, their gains add), 70% lower KL divergence from base (slow weights don't drift toward task specifics), better continual learning (RL-only stalls; FST continues).
The mechanism: fast weights absorb task-specific noise first, leaving a cleaner gradient signal for the slow weights. Slow weights then make general capability improvements on a lower-noise signal. This is the noise-filtering function of the fast path.
Theoretical basis: Complementary Learning Systems (CLS) theory from cognitive neuroscience (hippocampus = fast, neocortex = slow). FST is the first clean instantiation of CLS in LLM training. Theory predicted the empirical results before they were measured.
Implementation: GEPA (open-source, gepa-ai/gepa) + CISPO RL. Compute overhead: ~50% more per unit of training budget vs. RL-only, dominated by candidate evaluation in the fast engine. 3× efficiency gain far exceeds this overhead.

Both Engines, Not One

FST's most important design decision is the one it doesn't make: it does not choose between in-context learning and in-weights learning. It runs both. The theoretical argument is sound (CLS), the empirical results are clear (3×, 70%, higher ceiling, continual learning), and the implementation exists in GEPA. The field's default is to pick one optimization path and scale it. FST argues that the right architecture is two paths that divide the learning problem along its natural grain: task-specific goes to fast, general goes to slow.

The 30-year-old neuroscience theory said this was right. The 2026 experiments confirmed it.

References

Learning, Fast and Slow: Towards LLMs That Adapt Continually, arXiv:2605.12484, Tiwari, Sareen, Agrawal, Gonzalez, Zaharia, Keutzer, Dhillon, Agarwal, Khatri, May 2026
GEPA blog post: Learning Fast and Slow, May 11, 2026
GEPA GitHub (gepa-ai/gepa) — the open-source prompt optimization engine used as FST's fast engine
FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation, arXiv:2505.11439 — related closed-form fast weights approach for test-time adaptation
DualNet: Continual Learning, Fast and Slow, arXiv:2403.04245 — predecessor applying dual-rate learning to continual learning
ScaleRL: Scaling Laws for RL, arXiv:2510.13786 — the scaling law framework used to fit FST vs. RL-only asymptotes
Complementary Learning Systems: 30 Years On, Kumaran et al. (2016) — the neuroscience theory that FST instantiates in LLM training

Fast-Slow Training (FST, arXiv:2605.12484, May 2026) interleaves prompt optimization (fast weights, via GEPA) and RL parameter updates (slow weights, via CISPO) during LLM training, treating the prompt as a high-plasticity task-specific adapter and model parameters as a low-plasticity general-capability substrate. FST is 3× more sample-efficient than RL-only (reaching the same performance in 1/3 the steps), achieves a higher asymptotic performance ceiling (both paths contribute additively), maintains 70% lower KL divergence from the base model (preserving plasticity), and continues acquiring new tasks in continual learning scenarios where parameter-only RL stalls. The approach is grounded in Complementary Learning Systems (CLS) neuroscience theory and implemented in the open-source GEPA framework.

Sponsored Ad

If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀

Fast browsing. Faster thinking.

Your browser gets you to a page. Norton Neo gets you to the answer. The first safe AI-native browser built by Norton moves with you from idea to action without slowing you down. Magic Box understands your intent before you finish typing. AI that works inside your flow, not beside it. No prompting. No copy-pasting. No switching apps.

Built-in AI, instantly and for free. Privacy handled by Norton. Built-in VPN and ad blocking protect you by default. No configuration. No extra apps. Nothing to think about.

Fast. Safe. Intelligent. That's Neo.

Download Norton Neo