SWE-bench: The Benchmark That Broke Software Engineering AI, Then Got Broken Itself

^{SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | May 5, 2026}

The narrative around SWE-bench runs in two contradictory directions. One direction: AI can now resolve 50%+ of real GitHub issues autonomously, signaling imminent disruption of software development. The other: 30%+ of "successful" patches on the benchmark pass tests by recalling memorized solutions from training data, not by reasoning about the problem. Both claims have evidence. Understanding which is which requires understanding the benchmark's architecture at the engineering level, not the marketing level.

SWE-bench (Jimenez et al., ICLR 2024 oral, arXiv:2310.06770) is a benchmark for evaluating language models on real-world software engineering tasks collected from GitHub. Given a codebase and an issue description, a model generates a patch. The benchmark's evaluation harness applies that patch in a Docker container and runs the repository's original test suite to determine if the issue is resolved. This sounds simple. The engineering required to make it reproducible, valid, and resistant to gaming is the story.

This newsletter dissects SWE-bench as an infrastructure document: how the Docker-based evaluation harness works, what fail-to-pass and pass-to-pass invariants mean in practice, what the contamination research has actually measured, what "Saving SWE-bench" through mutation testing means, and what the SWE-bench++ generation framework implies for the future of software engineering benchmarks.

Scope: SWE-bench original (arXiv:2310.06770), Verified, Lite, the evaluation harness, SWE-bench+ (arXiv:2410.06992), SWE-bench++ (arXiv:2512.17419), the "Saving SWE-bench" mutation paper (arXiv:2510.08996), SWE Context Bench (arXiv:2602.08316), and the leaderboard dissection (arXiv:2506.17208). Not covered: specific agent architectures (SWE-agent, Agentless) in depth, or SWE-bench Multimodal beyond brief mention.

What It Actually Does

SWE-bench provides 2,294 task instances (full benchmark) and three curated splits:

SWE-bench Lite: 300 instances, curated to exclude overly large or ambiguous tasks. The most-used split for ablations.
SWE-bench Verified (August 2024): 500 instances manually confirmed solvable by human software engineers in collaboration with OpenAI Preparedness. This is the current standard split for serious submissions.
SWE-bench Multimodal: extends tasks to include visual elements (screenshots, error UIs), private test split evaluated via sb-cli.

Task instance structure: Each instance contains:

instance_id: unique identifier (e.g., django__django-12345)
repo: the GitHub repository name
version: the specific commit hash at which the issue was filed
problem_statement: the GitHub issue text
hints_text: any additional context from the issue comments
patch: the gold-standard human-authored fix (hidden from the model)
test_patch: a test that fails before the fix and passes after (hidden from the model)
FAIL_TO_PASS: tests that must change from failing to passing
PASS_TO_PASS: tests that must continue to pass (regression prevention)

The metric is resolution rate: resolved = (len(FAIL_TO_PASS tests now passing) == len(FAIL_TO_PASS)) AND (len(PASS_TO_PASS tests still passing) == len(PASS_TO_PASS)). Both conditions must hold. A patch that makes the target tests pass but breaks unrelated functionality fails.

Repositories represented (original benchmark): 12 Python repositories including Django, Flask, Scikit-learn, NumPy, Matplotlib, Pytest, Requests, Sphinx, Sympy, Astropy, PyDicom, and xarray. These were selected for test coverage quality and active issue management.

The Architecture, Unpacked

Focus on the hidden test_patch. The model cannot see which specific test it needs to pass. This is what makes SWE-bench harder than HumanEval: the model cannot work backward from the test. It must understand the issue, locate the bug, and produce a correct fix by reasoning about the codebase, not by targeting a visible test.

The Code, Annotated

Snippet One: Task Instance Loading and Evaluation Harness

from datasets import load_dataset
from swebench.harness.run_evaluation import main as run_evaluation
import json, subprocess

# ← Load SWE-bench Verified: the 500-instance human-confirmed split
# This is the correct split to use for serious submissions.
# SWE-bench Lite (300 instances) is faster but less representative.
# Full SWE-bench (2,294 instances) is for ablations and analysis only.
dataset = load_dataset("SWE-bench/SWE-bench_Verified", split="test")

# Task instance structure (what the agent sees):
for instance in dataset.select(range(1)):
    print(instance['instance_id'])       # e.g., "django__django-12345"
    print(instance['repo'])              # "django/django"
    print(instance['version'])           # "4.2.3" (the buggy version)
    print(instance['problem_statement']) # The GitHub issue text
    # NOT available to agent: instance['patch'], instance['test_patch']
    # ← hiding test_patch is the key design decision that prevents
    #   agents from cheating by reverse-engineering the test

# ← THIS is what the agent must produce: a git-format patch
# The format matters: the harness applies it via `git apply`
# Incorrect formatting causes harness errors, not just wrong answers
example_prediction = {
    "django__django-12345": """
--- a/django/utils/html.py
+++ b/django/utils/html.py
@@ -50,7 +50,7 @@ def format_html(format_string, *args, **kwargs):
     def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
+        super().__init__(*args, *kwargs)
""",
}

# ← Save predictions as JSON
with open("model_patches.json", "w") as f:
    json.dump(example_prediction, f)

# ← Run evaluation: this launches Docker containers per instance
# Each container = one isolated evaluation environment
# --max_workers controls parallelism (limited by available RAM + CPUs)
# Rule of thumb: each container needs ~2-4GB RAM during test execution
result = subprocess.run([
    "python", "-m", "swebench.harness.run_evaluation",
    "--predictions_path", "model_patches.json",
    "--dataset_name", "SWE-bench/SWE-bench_Verified",
    "--max_workers", "4",   # ← 4 parallel Docker containers
    "--run_id", "my_evaluation",
], capture_output=True, text=True)

# Results are in evaluation_results/my_evaluation.json
# Format: {instance_id: {resolved: bool, tests_status: {...}}}

The --max_workers parameter is the primary compute lever. Each worker runs one Docker container with a full test suite. SWE-bench Verified at 500 instances with 4 workers takes approximately 4-6 hours on a modern server. At 1 worker: ~20 hours. The original benchmark at 2,294 instances: multiply accordingly.

Snippet Two: The Fail-to-Pass / Pass-to-Pass Invariant in Detail

# This is what the harness actually checks for each prediction.
# Source: swebench/harness/grading.py (reconstructed from docs + paper)

def grade_instance(
    instance: dict,
    model_patch: str,
    test_output: dict,
) -> dict:
    """
    Determine if a model patch resolves the issue.

    Two conditions must BOTH be satisfied:
    1. FAIL_TO_PASS: tests that were failing with the bug now pass with the fix
    2. PASS_TO_PASS: tests that were passing with the bug still pass with the fix

    ← FAIL_TO_PASS ensures the fix actually addresses the reported issue.
    ← PASS_TO_PASS ensures the fix doesn't break existing functionality.
    Both are necessary. Either alone is insufficient to claim resolution.
    """
    fail_to_pass_tests = instance['FAIL_TO_PASS']  # list of test IDs
    pass_to_pass_tests = instance['PASS_TO_PASS']  # list of test IDs

    # Parse pytest output: which tests passed, which failed
    passed_tests = {t for t in test_output['tests'] if test_output['tests'][t] == 'PASS'}
    failed_tests = {t for t in test_output['tests'] if test_output['tests'][t] == 'FAIL'}

    # Condition 1: FAIL_TO_PASS invariant
    # ← all tests that were failing due to the bug must now pass
    # If even ONE test in fail_to_pass still fails → not resolved
    fail_to_pass_satisfied = all(
        t in passed_tests for t in fail_to_pass_tests
    )

    # Condition 2: PASS_TO_PASS invariant
    # ← all tests that were passing before must STILL pass
    # A patch that fixes the bug but breaks other functionality fails here
    # ← THIS is what catches over-fitted patches that pass the target test
    #   by deleting the functionality being tested (a real failure mode)
    pass_to_pass_satisfied = all(
        t in passed_tests for t in pass_to_pass_tests
    )

    resolved = fail_to_pass_satisfied and pass_to_pass_satisfied

    return {
        'resolved': resolved,
        'fail_to_pass_satisfied': fail_to_pass_satisfied,
        'pass_to_pass_satisfied': pass_to_pass_satisfied,
        'fail_to_pass_tests': fail_to_pass_tests,
        'pass_to_pass_tests': pass_to_pass_tests,
    }

# ← THE CONTAMINATION PROBLEM (from arXiv:2510.08996 + empirical analysis):
# Some patches "pass" via memorization, not reasoning.
# Evidence: LLMs can identify the correct file to edit at up to 76% accuracy
# on SWE-bench tasks using ONLY the issue title (no codebase access).
# This suggests file-level localization is partially a recall task,
# not purely a reasoning task on the specific codebase.

# ← THE MUTATION FIX (Saving SWE-bench, arXiv:2510.08996):
# Apply semantic-preserving mutations to the gold patches to create
# variants that pass the same tests but have different code structure.
# Models that "memorized" the original fix cannot pass mutations.
# True reasoners can.
def create_mutation(gold_patch: str, mutation_type: str) -> str:
    """
    Create a semantically equivalent but syntactically different patch.
    Mutation types: variable renaming, loop refactoring, equivalent operators.
    ← THIS is the test: can the agent fix the issue, or did it recall the patch?
    If an agent resolves the original but not the mutation, it memorized.
    If it resolves both, it understood the problem.
    """
    # ... mutation logic
    pass

The PASS_TO_PASS invariant catches the most insidious failure mode: patches that delete functionality to make the specific test pass. The mutation approach in arXiv:2510.08996 catches a different failure mode: patches that reproduce the memorized human fix without understanding the bug.

It In Action: End-to-End Worked Example

Input: Task instance django__django-15061

Problem statement (what the agent sees):

Bug: QuerySet.select_related() doesn't work with deferred fields
When using defer() followed by select_related(), the deferred fields
are included in the SELECT query even when they should be deferred.
Steps to reproduce:
  qs = Article.objects.defer('headline').select_related('reporter')
  # Expected: headline not in SQL
  # Actual: headline included in SQL query

Agent's task: Navigate the Django codebase (~600,000 lines), locate the bug in the ORM query compiler, produce a correct patch.

Step 1: Agent localizes the bug (typical agent approach)

# Agent uses bash/grep to navigate codebase
grep -r "select_related" django/db/models/query.py | head -20
grep -r "deferred_to_data\|deferred_fields" django/db/models/sql/compiler.py
# Identifies: django/db/models/sql/compiler.py, get_columns() method

Step 2: Agent proposes a patch

--- a/django/db/models/sql/compiler.py
+++ b/django/db/models/sql/compiler.py
@@ -1142,6 +1142,9 @@ class SQLCompiler:
     def get_columns(self, start_alias=None):
         qn = self.quote_name_unless_alias
         qn2 = self.connection.ops.quote_name
+        opts = self.query.get_meta()
+        if self.query.deferred_to_data:
+            return [c for c in columns if c.target.name not in self.query.deferred_to_data]
         columns = []

Step 3: Harness evaluation

Instance: django__django-15061
Docker image: swebench/sweb.eval.x86_64.django__django-15061:latest
Apply patch: git apply model_patch → success

Run FAIL_TO_PASS tests:
  tests/queryset_pickle/tests.py::PickleTests::test_select_related_defer
  → PASS ✓

Run PASS_TO_PASS tests (subset):
  tests/queryset_pickle/tests.py::PickleTests::test_add_q
  → PASS ✓
  tests/db_models/tests.py::SelectRelatedTests::test_field_none
  → FAIL ✗  ← regression: fix broke a related test

Result: resolved = False (PASS_TO_PASS violated)
Reason: The patch correctly addresses the target case but the condition
  check introduces a regression in the broader select_related logic.

This is the correct outcome. A patch that resolves the reported issue but introduces a regression is not a valid fix. The agent must iterate.

Step 4: Agent iterates, produces a more precise fix

--- a/django/db/models/sql/compiler.py
+++ b/django/db/models/sql/compiler.py
@@ -1142,6 +1142,11 @@ class SQLCompiler:
     def get_select_from_parent(self, obj):
-        return [field for field in obj._meta.concrete_fields]
+        deferred = self.query.deferred_loading[0] if self.query.deferred_loading else set()
+        return [
+            field for field in obj._meta.concrete_fields
+            if field.attname not in deferred
+        ]

Step 5: Final evaluation result

Instance: django__django-15061
Apply patch: success
FAIL_TO_PASS: tests/queryset_pickle/tests.py::PickleTests::test_select_related_defer → PASS ✓
PASS_TO_PASS (all 47 tests in scope): ALL PASS ✓

Result: resolved = True
Counted toward resolution rate: Yes

Typical resolution timing: For a capable agent on a medium-complexity Django issue: localization 2-5 minutes, first patch attempt 3-8 minutes, iteration 5-15 minutes, total 10-30 minutes per instance.

Why This Design Works, and What It Trades Away

The Docker-based evaluation harness is the correct engineering choice for a benchmark that claims reproducibility. Software test environments are notoriously brittle: dependency versions, OS-level differences, Python version behavior, and floating point behavior can all affect test outcomes. The Docker image locks all of these at the exact state when the issue was filed. The harness checks both fail-to-pass and pass-to-pass invariants, preventing two distinct classes of patch quality failures.

The hidden test_patch design is the correct choice for measuring reasoning ability, not test-targeting ability. An agent that can see the failing test can potentially reverse-engineer the fix from the test assertions without truly understanding the bug. Hiding the test forces genuine comprehension of the issue description and codebase.

The SWE-bench Verified split's manual human confirmation is the correct meta-benchmark design. The original full benchmark contains many issues where the associated tests are ambiguous, the expected behavior is unclear, or the "correct" fix is debatable. Human engineers confirmed each of the 500 Verified instances is genuinely solvable with a clearly correct solution. This reduces the noise in evaluation that comes from inherently ambiguous tasks.

What SWE-bench trades away:

Contamination resistance. SWE-Bench-Verified overlaps significantly with LLM pretraining data: models are 3-6x more accurate in localizing bug locations on this benchmark than on held-out or decontaminated sets. Reported contamination rates reach 8-10%. This suggests that apparent generalization may instead reflect training recall. Advanced frameworks such as UTBoost and PatchDiff have revealed that leaderboard success rates may be inflated by 6-7 absolute percentage points due to latent test inadequacies and latent behavioral divergences between model and human patches.

Language diversity. All 12 original repositories are Python. Java (SWE-bench-java-verified, 91 instances) and other languages are emerging but not yet at the scale of the Python benchmark.

Real-time dynamics. Static benchmarks cannot capture the ongoing flow of new issues. SWE-bench-Live (arXiv:2505.23419) addresses this with continuously updated tasks from post-cutoff GitHub issues. The performance gap is stark: the same agent setup achieves 43.20% on SWE-bench Verified but only 19.25% on SWE-bench-Live, confirming that a significant portion of benchmark performance comes from training data overlap.

The Contamination Problem, Precisely

The contamination critique requires precision. It has three distinct components that the community often conflates:

Component 1: Solution leakage. Some GitHub issues include the fix in their comment thread (a maintainer posts "try this patch"). An LLM that saw this during training can recall the fix without reasoning. Estimate: direct solution leakage persists in 30%+ of successful "passes" without further filtering.

Component 2: File-level localization memorization. LLMs can identify the correct file to edit at up to 76% accuracy using only the issue title. This means file localization, which was thought to require codebase reasoning, is partially a recall task. A model that "knows" bugs in Django's ORM tend to be in django/db/models/sql/compiler.py has an advantage not available to a model reasoning purely from the issue.

Component 3: Test suite insufficiency. The UTBoost framework reveals that approximately 41% of Lite and 24% of Verified leaderboard entries were mis-scored due to inadequate or incorrectly parsed test suites, affecting up to 345 unique patch assessments. Some patches "resolve" the benchmark task but are actually wrong fixes that happen to pass the limited test suite.

The mutation approach in "Saving SWE-bench" (arXiv:2510.08996) addresses components 1 and 2 by creating semantically equivalent patches that cannot be recalled. A mutation renames variables, refactors loops, or substitutes equivalent operators. The gold test still passes (semantic equivalence), but a memorized patch fails (syntactic difference). Agents that resolve original patches at 50% but mutations at 30% have significant memorization contamination.

The Leaderboard Dissection

The first comprehensive analysis of all SWE-bench submissions (Martinez et al., arXiv:2506.17208) covers 79 Lite and 99 Verified entries representing 80 unique approaches. Key findings:

The findings reveal the dominance of proprietary LLMs (especially Claude 3.5), the presence of both agentic and non-agentic designs, and a contributor base spanning from individual developers to large tech companies.

In SWE-Bench Verified, entries associated with publicly available products (PAP) have consistently achieved state-of-the-art results up to the present date. In contrast, the trend in SWE-Bench Lite differs: since early 2025, only a few PAP-related entries have been published, with the majority originating from non-commercial systems (NCS).

No single workflow dominates; high-performing systems typically blend elements (retrieval, orchestration, self-critique).

The submission process does not require detailed documentation, meaning the architectural design and origin of many solutions remain unclear. This is a governance gap that limits the scientific value of the leaderboard for understanding what actually works.

Technical Moats

The Docker image build pipeline is the most underappreciated technical contribution. Each of the 2,294 original task instances required building a Docker image that precisely captures the repository state at the exact commit when the bug was filed, with pinned dependency versions and the repository's build toolchain. This is non-trivial: Python package versions from 2018-2023 have complex dependency resolution, some dependencies are no longer on PyPI, and some build systems require specific OS-level packages. The SWE-bench team's work building and maintaining these images is what makes reproducible evaluation possible.

The FAIL_TO_PASS test mining pipeline. Identifying task instances where a human-authored PR (a) fixes a GitHub issue (b) with an associated test that (c) was failing before the PR and passes after is a complex multi-stage data collection process involving GitHub API access, automated test execution, and manual validation. The SWE-bench++ framework (arXiv:2512.17419) extends this to scalably generate new benchmark instances from any open-source repository.

SWE-bench Verified's human validation. Confirming that 500 tasks are genuinely solvable required software engineers to attempt each task and verify that a correct solution exists. This is expensive to replicate at scale, which is why automated approaches like SWE-bench++ are necessary for expanding the benchmark's scope.

Insights

Insight One: SWE-bench measures agent-harness performance, not model performance, and the community has only recently started separating these cleanly.

The evaluation harness (file navigation strategy, tool use, iterative refinement approach) accounts for a substantial fraction of the benchmark result. The same underlying LLM with a better agentic harness can achieve dramatically higher resolution rates. This means SWE-bench scores are not comparable across submissions unless the harness is held constant. The recent trend toward reporting results with standardized minimal harnesses (like mini-swe-agent) is the correct methodology, but it is not yet universal. A submission reporting 70% Verified with a highly optimized multi-agent system is not comparable to a submission reporting 55% with a simple bash-only harness, even if both use the same LLM.

Insight Two: The benchmark's 12-repository Python focus was a deliberate, defensible tradeoff, and it is now the primary obstacle to measuring real-world software engineering capability.

The choice to focus on 12 popular Python repositories with high test coverage was the correct decision for a research benchmark published in 2023. It enabled reproducible evaluation infrastructure, consistent measurement, and a manageable data collection process. In 2026, it is the primary limitation. These repositories have been in LLM training data for years. Their bug patterns are well-represented in the pretraining corpus. SWE-bench-Live, SWE-bench Pro, and SWE-bench++ all exist specifically because the community recognizes this limitation. The field is not abandoning SWE-bench; it is building around its known limitations, which is the correct response.

Takeaway

OpenHands with Claude 3.7 Sonnet achieves 43.20% on SWE-bench Verified but only 19.25% on SWE-bench-Live: the same system on genuinely novel tasks performs at less than half its benchmark score.

SWE-bench-Live uses GitHub issues filed and resolved after model training cutoffs, making solution recall impossible. The gap between 43.20% and 19.25% is not fully explained by the higher difficulty of fresh tasks. It is partially explained by the degree to which SWE-bench Verified performance reflects training data overlap rather than generalization. This does not mean SWE-bench Verified scores are meaningless. It means they measure a mixture of reasoning and recall, and the current best estimate is that the memorization component inflates scores by 6-7 absolute percentage points at the benchmark level, and up to 19+ percentage points for specific agents on specific tasks.

TL;DR For Engineers

SWE-bench evaluates models by applying their generated patches in Docker containers and running the repository's original test suite. Resolution requires both FAIL_TO_PASS (the bug is fixed) and PASS_TO_PASS (no regressions). The test_patch is hidden from the model to prevent test-targeting cheating.
Use SWE-bench Verified (500 instances, human-confirmed solvable) for serious submissions. SWE-bench Lite (300 instances) for fast ablations. Avoid comparing scores across different agent harnesses: the harness accounts for a substantial fraction of the result.
Contamination is real and measured: scores are inflated by 6-7 absolute percentage points from test inadequacies (UTBoost finding), and 30%+ of successful passes may involve solution recall. SWE-bench-Live (post-cutoff issues) halves the performance of top agents, quantifying the memorization effect.
The mutation approach (arXiv:2510.08996) provides the cleanest contamination test: create semantically equivalent patches with different syntax. Agents that fail mutations but pass originals are memorizing.
SWE-bench++ (arXiv:2512.17419) provides the framework to generate new benchmark instances at scale from any OSS repository, addressing the 12-repository Python bottleneck. SWE-bench-java-verified (91 instances) is the first non-Python extension.

The Benchmark Is Working. The Benchmark Is Broken. Both Statements Are Correct.

SWE-bench did what good benchmarks do: it created a concrete, reproducible evaluation standard that focused the field's attention, produced real engineering innovations (better agents, better harnesses, better file localization), and revealed the gap between what LLMs claim to do and what they actually do when faced with real codebases and real tests.

It is also saturating in its current form. Top agents exceed 75% on Verified. Contamination has been measured and is not negligible. The repository set is too small and too thoroughly covered by training data to measure genuine generalization. The field knows this, and the response, SWE-bench-Live, SWE-bench Pro, SWE-bench++, mutation testing, multi-language extensions, is the correct one: not replace the benchmark, but extend it, stress-test it, and build next-generation versions that preserve what works and address what does not. This is how scientific measurement tools should evolve.

References

SWE-bench (Jimenez et al., ICLR 2024, arXiv:2310.06770) evaluates LLMs on real-world GitHub issues from 12 Python repositories by applying model-generated patches in Docker containers and checking both FAIL_TO_PASS (target tests now pass) and PASS_TO_PASS (regressions prevented) invariants against the repository's original test suite. The 500-instance Verified split (human-confirmed solvable) is the current standard. The benchmark has two documented validity concerns: contamination (scores inflated 6-7 absolute percentage points from test inadequacies; top agents achieve 43.20% on Verified but only 19.25% on SWE-bench-Live with post-cutoff issues), and harness confounding (the agentic scaffold accounts for a substantial fraction of results, making cross-submission comparisons invalid unless the harness is held constant). The community response, SWE-bench++, SWE-bench-Live, mutation testing (arXiv:2510.08996), and multi-language extensions, represents the correct approach to evolving an infrastructure-heavy benchmark while preserving its reproducibility guarantees.

Sponsored Ad

If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀