The core argument is not that prompts are better than weights. It is that a scalar reward throws away almost everything a language model could learn from a single rollout, and natural language reflection does not.
SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | June 18, 2026
When you train an LLM-based system with reinforcement learning, every rollout produces a rich trace: the reasoning the model went through, the tool calls it made, the tool outputs it received, sometimes even compiler error messages or evaluator diagnostics. GRPO and similar RL methods compress all of that into a single scalar reward and use it to estimate a policy gradient. Everything except the final number is discarded before learning happens.
GEPA (Agrawal, Khattab et al., UC Berkeley, Stanford, MIT, Databricks, ICLR 2026 Oral) asks a different question: what if the system kept the language instead of collapsing it to a number? The trace already contains the diagnostic information needed to understand why a rollout succeeded or failed. An LLM can read that trace and propose a specific, targeted edit to the prompt that caused the problem. This is reflection as a learning mechanism, evolutionary search as the optimization loop, and Pareto-frontier sampling as the mechanism that keeps the search from getting stuck.
The numbers are not marginal. Across six tasks, GEPA outperforms GRPO by 6% on average and by up to 20%, using up to 35x fewer rollouts. It beats the leading prompt optimizer, MIPROv2, by over 10%, including a 12-point accuracy gain on AIME-2025.
Scope: the formal compound AI system definition GEPA optimizes over, the three-part algorithm (genetic prompt evolution, reflective mutation, Pareto-based candidate selection), the feedback function design that extracts diagnostic text from evaluation traces, and the actual prompt evolution example from the paper. Not covered: GEPA's preliminary results on inference-time code optimization (NPUEval, KernelBench) beyond a brief mention, or the System Aware Merge operator in detail.
What It Actually Does
GEPA optimizes prompts, not weights, for what the paper calls a compound AI system: any modular pipeline of one or more LLM calls, interleaved with tool calls, orchestrated by arbitrary control flow. This covers agents, multi-agent systems, and general scaffolding patterns like ReAct.
Formally, a system Φ = (M, C, X, Y) consists of language modules M, control flow C, and global input/output schemas X, Y. Each module Mi = (πi, θi, Xi, Yi) has a prompt πi and weights θi. GEPA optimizes Π (the prompts) while typically leaving Θ (the weights) frozen. The optimization target:
⟨Π*, Θ*⟩ = argmax E[μ(Φ(x; ⟨Π,Θ⟩), m)] subject to #rollouts ≤ B
This is the same objective reinforcement learning methods optimize. The difference is entirely in how the budget B of rollouts gets converted into improvement.
Evaluated on four core tasks, two models:
Task | Type | Models tested |
|---|---|---|
HotpotQA | Multi-hop reasoning | Qwen3 8B, GPT-4.1 mini |
HoVer | Retrieval-augmented verification | Qwen3 8B, GPT-4.1 mini |
IFBench | Instruction following | Qwen3 8B, GPT-4.1 mini |
PUPA | Privacy-aware delegation | Qwen3 8B, GPT-4.1 mini |
Code: github.com/gepa-ai/gepa
The Architecture, Unpacked

Focus on the feedback function μf in Step 3. The standard evaluation metric μ produces a scalar. GEPA's insight is that most evaluation pipelines already generate rich diagnostic text on the way to that scalar (a code evaluator runs compilation, then execution, then profiling, each producing logs) and that text is normally thrown away. μf is the operationalization of "don't throw it away."
The Code, Annotated
Snippet One: The Feedback Function and Reflective Mutation Core
# GEPA's reflective prompt mutation mechanism
# Reconstructed from arXiv:2507.19457 Section 3.2, Algorithm 1 (lines 9-11)
# The key departure from standard RL: keep the trace, don't collapse it early
from dataclasses import dataclass
from typing import Callable
@dataclass
class RolloutTrace:
"""Everything generated during one execution of the compound system."""
module_inputs: dict # what each module received
module_outputs: dict # what each module produced
reasoning_chains: dict # intermediate reasoning per module
tool_calls: list # any external tool invocations
tool_outputs: list # results from those tool calls
@dataclass
class FeedbackResult:
"""
The output of μf, NOT just μ.
← THIS is the trick: standard RL only keeps `score`. GEPA keeps both.
"""
score: float # the same scalar a standard metric μ would return
feedback_text: str # diagnostic text describing WHY the score is what it is
def feedback_function_code_eval(
rollout: RolloutTrace,
test_cases: list,
) -> FeedbackResult:
"""
Example feedback function for a code-generation module.
← Standard metric μ would just return pass_rate (a float).
μf additionally captures the compiler/runtime diagnostics that
would normally be discarded the moment the scalar is computed.
"""
results = []
for test in test_cases:
try:
output = execute_code(rollout.module_outputs["code"], test.input)
passed = (output == test.expected)
results.append((passed, None))
except CompileError as e:
# ← THIS diagnostic text is exactly what a scalar reward discards
# A standard RL pipeline would just see "reward = 0" here
results.append((False, f"Compilation failed: {e}"))
except RuntimeError as e:
results.append((False, f"Runtime error on input {test.input}: {e}"))
pass_rate = sum(1 for passed, _ in results if passed) / len(results)
failure_diagnostics = [msg for passed, msg in results if not passed and msg]
feedback_text = (
f"Passed {sum(1 for p, _ in results if p)}/{len(results)} tests. "
f"Failures: {'; '.join(failure_diagnostics[:3])}"
if failure_diagnostics else "All tests passed."
)
return FeedbackResult(score=pass_rate, feedback_text=feedback_text)
def reflective_prompt_mutation(
current_prompt: str,
module_name: str,
minibatch_traces: list[tuple[RolloutTrace, FeedbackResult]],
reflection_llm,
) -> str:
"""
The core mutation step (Algorithm 1, line 11: UPDATEPROMPT).
← This is an LLM call, not a gradient step. The reflection_llm reads
the actual traces and feedback_text, and proposes a SPECIFIC edit
grounded in what it observed, rather than a direction in weight-space
derived from backpropagating a scalar.
"""
trace_summary = "\n\n".join(
f"Rollout {i}: input={t.module_inputs}, output={t.module_outputs}\n"
f"Feedback: {f.feedback_text} (score: {f.score:.2f})"
for i, (t, f) in enumerate(minibatch_traces)
)
reflection_prompt = f"""
You are improving the prompt for the module "{module_name}".
CURRENT PROMPT:
{current_prompt}
OBSERVED ROLLOUTS AND FEEDBACK:
{trace_summary}
TASK: Identify patterns in what went wrong (or right). Propose a new,
more specific version of this module's prompt that addresses the
diagnosed failure modes. Be concrete: reference the specific kinds
of mistakes you observed.
"""
# ← The reflection_llm performs IMPLICIT CREDIT ASSIGNMENT here:
# attributing the final outcome to specific elements of the prompt
# This is exactly the kind of reasoning a policy gradient cannot do,
# because a policy gradient only sees "reward went up or down"
new_prompt = reflection_llm.complete(reflection_prompt)
return new_prompt
The distinction between μ and μf is the single most important design decision in the paper. Any existing evaluation pipeline that produces a scalar score almost always has diagnostic text available somewhere in its execution before that score is computed. Wrapping μ into μf to surface that text is a small implementation change with a large effect on what the optimizer can learn from each rollout.
Snippet Two: Pareto-Based Candidate Selection
# GEPA's Pareto-frontier candidate selection
# Reconstructed from arXiv:2507.19457 Algorithm 2
# Prevents the optimizer from getting stuck mutating one dominant strategy
import random
from collections import defaultdict
def select_candidate_pareto(
candidate_pool: list, # list of candidate systems Φ
scores_matrix: dict, # scores_matrix[candidate_idx][instance_idx] = score
train_instances: list,
) -> int:
"""
Pareto-based candidate selection (Algorithm 2 in the paper).
← WHY not just pick the globally best candidate every time?
A naive "always mutate the best" strategy converges to ONE strategy
and then exhausts the rollout budget trying and failing to improve it
further, never exploring alternative strategies that might generalize
better even if they score slightly lower on average today.
"""
n_candidates = len(candidate_pool)
n_instances = len(train_instances)
# ── Step 1: find the best score achieved on EACH instance, across ALL candidates ──
best_score_per_instance = {}
winners_per_instance = defaultdict(set)
for instance_idx in range(n_instances):
scores_here = [
scores_matrix[c][instance_idx] for c in range(n_candidates)
]
best = max(scores_here)
best_score_per_instance[instance_idx] = best
# ← THIS is the trick: a candidate doesn't need to be globally best,
# it just needs to be tied for best on AT LEAST ONE instance
# to enter the Pareto frontier
for c in range(n_candidates):
if scores_matrix[c][instance_idx] == best:
winners_per_instance[instance_idx].add(c)
# ── Step 2: collect the union of all per-instance winners ────────────────────
pareto_candidates = set()
for winners in winners_per_instance.values():
pareto_candidates.update(winners)
# ── Step 3: remove dominated candidates ───────────────────────────────────────
# A candidate is dominated if another candidate beats it on EVERY instance
# it's a "winner" on. (Simplified: paper's actual dominance check is more
# involved, this captures the core idea.)
non_dominated = set(pareto_candidates)
for c1 in list(pareto_candidates):
for c2 in pareto_candidates:
if c1 == c2:
continue
if all(
scores_matrix[c2][i] >= scores_matrix[c1][i]
for i in range(n_instances)
) and any(
scores_matrix[c2][i] > scores_matrix[c1][i]
for i in range(n_instances)
):
non_dominated.discard(c1) # c1 is dominated by c2
break
# ── Step 4: sample proportional to frequency in the Pareto front ─────────────
frequency = {
c: sum(1 for winners in winners_per_instance.values() if c in winners)
for c in non_dominated
}
# ← Candidates that are "winners" on MANY instances get sampled more often,
# but EVERY non-dominated candidate has nonzero probability
# This is the diversity-preserving mechanism: a candidate that's uniquely
# good at ONE hard instance type still gets explored, not discarded
total = sum(frequency.values())
weights = [frequency[c] / total for c in non_dominated]
selected = random.choices(list(non_dominated), weights=weights, k=1)[0]
return selected
# ── WHY THIS MATTERS: comparing the two strategies ────────────────────────────
# Naive greedy: candidate pool converges to ONE lineage, search tree is a
# single deep branch, budget exhausted trying to improve one strategy
# (paper's Figure 6a shows exactly this failure mode empirically)
# Pareto-based: candidate pool maintains MULTIPLE viable lineages,
# search tree branches, different strategies for different instance types
# survive and get recombined via System Aware Merge
The dominance check in Step 3 is doing real work: a candidate only survives if no other single candidate beats it everywhere it's currently winning. This is what keeps a search from collapsing onto a single "good enough" strategy too early, which is exactly the local-optimum failure mode the paper demonstrates empirically when comparing against naive always-pick-best selection.
It In Action: End-to-End Worked Example
Task: Evolve the second-hop query generation prompt in a multi-hop QA system (HotpotQA), from the paper's own example
Seed prompt (the starting point, before any optimization):
Given the fields question, summary_1, produce the fields query.
This is a minimal, generic instruction. It tells the module almost nothing about what makes a good second-hop query in a multi-hop retrieval system.
Step 1: Rollouts on minibatch reveal a pattern
GEPA executes this seed prompt across several HotpotQA training instances. The traces show: the module frequently generates second-hop queries that simply restate or lightly paraphrase the original question, rather than targeting the specific missing information that the first-hop retrieval didn't cover. The feedback function (extending the retrieval evaluation metric) surfaces this as diagnostic text: queries that overlap heavily with first-hop queries, queries that fail to retrieve new relevant documents.
Step 2: Reflective LLM diagnoses the failure and proposes a new prompt
GEPA's optimized prompt (excerpted from the paper, Figure 2):
You will be given two input fields: question and summary_1. Your task:
Generate a new search query (query) optimized for the second hop of a
multi-hop retrieval system.
Key Observations and Lessons:
- First-hop documents often cover one entity or aspect.
- Remaining relevant documents often involve connected or higher-level
concepts mentioned in summary_1 but not explicitly asked in the
original question. The query should target these missing, but
logically linked, documents.
- Avoid merely paraphrasing the original question or restating known
facts from summary_1.
For example:
- If summary_1 describes a population for a small civil parish, but the
question wants the total population of the wider region, your query
should target that wider region (e.g., "Madeira archipelago population
in 2011").
- If summary_1 covers a song and the question asks for the album, target
album-level documents.
Step 3: What changed and why it works
The evolved prompt does three things the seed prompt could not: it names the specific failure pattern (paraphrasing instead of targeting missing information), it gives a structural heuristic (target broader or related entities mentioned but not the focus of summary_1), and it grounds that heuristic in a concrete worked example pulled directly from an actual failure case (the Madeira archipelago population example). This is not a prompt a human engineer would necessarily write from first principles. It is a prompt distilled from observing actual failures.
Step 4: Measured results
HotpotQA, Qwen3 8B:
Baseline (seed prompt): lowest score on the validation curve
GRPO (24,000 rollouts): improves over baseline, but requires the full
24,000-rollout budget to reach its plateau
MIPROv2: improves faster than GRPO early on, plateaus below GEPA
GEPA: matches or exceeds GRPO's final score using a small fraction
of the rollout budget; continues improving with additional rollouts
Aggregate across four tasks (Qwen3 8B):
GEPA vs GRPO: +10% average, up to +19% on individual tasks
GEPA vs MIPROv2: +14% aggregate gain (MIPROv2 achieves +7%)
Rollout efficiency: up to 35x fewer rollouts than GRPO for comparable
or better performance
Why This Design Works, and What It Trades Away
The reflective mutation step works because natural language is a higher-bandwidth learning signal than a scalar reward, for the specific failure modes that show up in compound AI systems. When a multi-hop QA system fails, the failure is rarely "the policy needs a small nudge in every dimension." It's usually something specific and nameable: the second-hop query paraphrases instead of targeting new information. An LLM reading the trace can name that failure directly. A policy gradient, derived from a scalar reward across a high-dimensional prompt space, cannot localize the problem nearly as precisely, which is part of why RL methods need so many more rollouts to triangulate the same fix.
The Pareto-based selection mechanism works because compound AI systems often have genuinely different optimal strategies for different instance types, and a search that always mutates the single best-scoring candidate will converge on whichever strategy happens to dominate the training distribution on average, even if a different strategy would have generalized better to instance types underrepresented in that average. By keeping multiple non-dominated candidates alive and recombining them through System Aware Merge, GEPA preserves strategic diversity that a greedy search discards.
The minibatch-then-full-eval acceptance gate (Algorithm 1, lines 13-18) is a pragmatic cost control: most proposed mutations don't help, and testing every mutation on the full Pareto validation set would be wasteful. Testing cheaply on a small minibatch first, and only paying the cost of full evaluation when the minibatch result looks promising, is what makes the genetic search affordable within a realistic rollout budget.
What GEPA trades away:
GEPA optimizes Π (prompts) and typically does not touch Θ (weights). For tasks where the necessary capability genuinely requires new knowledge the base model lacks, no amount of prompt engineering, however well-reflected, will close that gap. RL methods that fine-tune weights can in principle learn capabilities a frozen model cannot express through prompting alone.
The reflection step itself costs LLM calls, and the quality of the resulting mutation is bounded by the reflection LLM's ability to correctly diagnose the failure from the trace. A weak or miscalibrated reflection model could propose plausible-sounding but wrong fixes, and the minibatch acceptance gate only catches mutations that fail to improve the visible minibatch score, not mutations that overfit to it.
The feedback function μf requires more engineering than a standard scalar metric μ. Extracting useful diagnostic text from an evaluation pipeline is straightforward for code execution (compiler errors are already text) but less obvious for tasks where the evaluator is itself just a learned scorer with no natural textual byproduct. The paper's strongest results come from tasks where this diagnostic text is naturally available.
Technical Moats
The compound system formalization as a foundation for general-purpose prompt optimization. Defining Φ = (M, C, X, Y) with explicit module-level prompts and weights, separable from the control flow that orchestrates them, is what allows GEPA to optimize arbitrary multi-module pipelines rather than being specialized to one task structure. This formalization, building on prior DSPy-lineage work (Khattab et al., Opsahl-Ong et al.), is a genuine prerequisite for the round-robin module selection and module-level reflective mutation that GEPA depends on.
Pareto-frontier maintenance at the candidate-pool level. Implementing per-instance dominance checking and frequency-weighted sampling correctly, at the scale of potentially hundreds of candidates across the full optimization run, requires careful bookkeeping of the scores matrix and ancestry tracking shown in the paper's Algorithm 1. Getting this wrong (for example, by approximating dominance incorrectly) reintroduces the local-optimum failure mode the mechanism exists to prevent.
The feedback function abstraction. Generalizing from "evaluation metric returns a scalar" to "evaluation metric returns a scalar plus diagnostic text" sounds like a small API change but requires rethinking how evaluators are built across very different task types: code execution, retrieval verification, instruction-following compliance, privacy-aware delegation. The paper's four-task evaluation suite each required a custom μf design, which is non-trivial domain-specific engineering, not a generic wrapper.
Insights
Insight One: GEPA's sample efficiency advantage is fundamentally about credit assignment, not about prompts being inherently better than weights as a representation. The paper's argument is precise: language provides a richer learning medium because an LLM reflecting on a trace can perform targeted, localized credit assignment ("this specific module's prompt caused this specific failure") that a scalar-reward policy gradient cannot do nearly as precisely. If you replaced the reflection step with a weaker LLM that produces vague or generic feedback, GEPA's advantage would likely shrink toward GRPO's. The mechanism, not the parameter type, is the source of the gain.
Insight Two: The paper's own ablation, comparing Pareto-based selection against naive greedy best-candidate selection, is arguably more important than the headline GRPO comparison, because it demonstrates that even within the prompt-evolution paradigm, the search strategy matters enormously. A team that implements reflective mutation correctly but uses naive greedy candidate selection will hit the same local-optimum wall the paper explicitly shows in Figure 6a. The "35x fewer rollouts than GRPO" headline obscures that GEPA's own internal ablations show meaningful gains are also available purely from better search strategy design, independent of the reflection mechanism.
Surprising Takeaway
GEPA shows promising preliminary results as an inference-time search strategy for code optimization, not just an offline prompt-training procedure. This means the same genetic-Pareto reflective search loop that optimizes a system's prompts before deployment can, in principle, run at inference time to search over candidate code implementations, using compiler and profiler feedback as the diagnostic signal in place of a held-out task metric. The boundary between "optimize the system once, then deploy it" and "search for the best solution at the moment you need it" starts to blur when the search mechanism itself, reflective mutation guided by rich textual feedback plus Pareto-preserving selection, is general enough to apply to both regimes. This reframes GEPA less as a prompt-tuning tool and more as a general-purpose reflective search algorithm that happens to have been validated first on prompts.
TL;DR For Engineers
GEPA (arXiv:2507.19457, ICLR 2026 Oral, github.com/gepa-ai/gepa) optimizes prompts for compound AI systems (Φ = modules + control flow + I/O schemas) using natural language reflection on rollout traces instead of policy gradients from scalar rewards.
Results: beats GRPO by 6% average (up to 20%) using up to 35x fewer rollouts across six tasks. Beats MIPROv2 (leading prompt optimizer) by 10%+, including +12% accuracy on AIME-2025. On Qwen3 8B specifically: +10% average vs GRPO, up to +19%, vs GRPO's 24,000-rollout LoRA budget.
Core mechanism: feedback function μf returns (score, feedback_text) instead of just a scalar, surfacing diagnostic text (compiler errors, retrieval misses) that standard RL discards. An LLM reflects on this text plus execution traces to propose targeted, module-specific prompt edits.
Pareto-based candidate selection prevents local optima: candidates are sampled proportional to how often they appear in the per-instance Pareto frontier, not just by global best score. Naive greedy selection gets stuck mutating one dominant strategy and exhausts the rollout budget without exploring alternatives (paper's Figure 6a).
Tasks evaluated: HotpotQA (multi-hop reasoning), HoVer (retrieval verification), IFBench (instruction following), PUPA (privacy-aware delegation), across Qwen3 8B and GPT-4.1 mini. Also shows preliminary promise as an inference-time search strategy for code optimization (NPUEval, KernelBench).
The Scalar Reward Was Always Throwing Information Away
GEPA's contribution is showing, with rigorous head-to-head comparisons against both GRPO and the leading prompt optimizer, that the information discarded when collapsing a rollout trace into a scalar reward is recoverable, and recovering it produces both better final performance and dramatically better sample efficiency. This is not a claim that RL is obsolete. It is a claim that for the large and growing class of compound AI systems built from LLM modules with natural-language-interpretable behavior, reflection is a more efficient way to extract the available learning signal than gradient estimation from sparse rewards.
The Pareto-based selection mechanism is the part of the paper that deserves more attention than it usually gets. Sample-efficient reflection alone is not sufficient; a search strategy that preserves diverse, non-dominated strategies is what prevents that efficient learning signal from collapsing onto a single, possibly suboptimal, local strategy.
References
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning, arXiv:2507.19457, Agrawal, Khattab et al., ICLR 2026 Oral
MIPROv2: the prior state-of-the-art prompt optimizer GEPA outperforms, Opsahl-Ong et al.
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines, Khattab et al. — the compound AI system formalization GEPA builds on
Group Relative Policy Optimization (GRPO), Shao et al. — the RL baseline GEPA compares against
GEPA (Genetic-Pareto, arXiv:2507.19457, ICLR 2026 Oral) optimizes prompts for compound AI systems by reflecting on natural language traces from rollouts, using an LLM to diagnose failures and propose targeted module-level prompt edits, combined with Pareto-frontier-based candidate selection that preserves diverse, non-dominated strategies rather than greedily converging on a single best candidate. Across six tasks it outperforms GRPO by 6% on average (up to 20%) using up to 35x fewer rollouts, and beats the prior leading prompt optimizer MIPROv2 by over 10%. The key mechanism is the feedback function μf, which returns diagnostic text alongside a scalar score, recovering information that standard reinforcement learning discards before learning happens.
Sponsored Ad
If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀
AI Agents Are Reading Your Docs. Are You Ready?
Last month, 48% of visitors to documentation sites across Mintlify were AI agents, not humans.
Claude Code, Cursor, and other coding agents are becoming the actual customers reading your docs. And they read everything.
This changes what good documentation means. Humans skim and forgive gaps. Agents methodically check every endpoint, read every guide, and compare you against alternatives with zero fatigue.
Your docs aren't just helping users anymore. They're your product's first interview with the machines deciding whether to recommend you.
That means: clear schema markup so agents can parse your content, real benchmarks instead of marketing fluff, open endpoints agents can actually test, and honest comparisons that emphasize strengths without hype.
Mintlify powers documentation for over 20,000 companies, reaching 100M+ people every year. We just raised a $45M Series B led by @a16z and @SalesforceVC to build the knowledge layer for the agent era.


