Anthropic Builds a C Compiler with AI Agents

In partnership with

^{SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | May 1, 2026}

The conventional assumption about AI coding agents is that they are powerful for self-contained tasks (write this function, fix this bug, generate a test) and fall apart on complex, long-horizon projects. Nicholas Carlini, researcher on Anthropic's Safeguards team, designed an experiment to test exactly where that ceiling is. He gave 16 parallel Claude Opus 4.6 instances a clean-room C compiler implementation task, walked away for two weeks, and documented everything that went wrong and everything that worked.

The result: a 100,000-line, dependency-free, Rust-based C compiler capable of building Linux 6.9 on x86, ARM, and RISC-V, passing 99% of the GCC torture test suite, and compiling QEMU, FFmpeg, SQLite, PostgreSQL, Redis, and Doom. The compiler is limited in specific documented ways. It is also the most ambitious autonomous agent project published to date by any major AI lab.

This newsletter dissects the agent team harness as an engineering system: the infinite loop scaffold, the parallel work coordination mechanism, the context window design principles, and the specific failure modes that reveal the limits of current multi-agent development.

Scope: the agent harness design (not the compiler's internal architecture), the coordination protocol, test harness design principles, parallelization strategies, and the documented limitations. Not covered: the C compiler's internal IR design, or SWE-bench/AgentBench methodology beyond context.

Nicholas Carlini is a prominent researcher in AI security and adversarial machine learning. He currently works at Anthropic, where he focuses on identifying vulnerabilities in advanced AI systems and stress-testing models for safety risks.

Before joining Anthropic, he was a researcher at Google DeepMind, contributing to foundational work on adversarial attacks and model robustness. Carlini earned his PhD in computer science from University of California, Berkeley, where his research helped shape the field of adversarial machine learning.

He is widely known for co-developing the Carlini & Wagner attack, a breakthrough method that exposed weaknesses in supposedly secure neural networks. His work has played a key role in advancing how researchers evaluate and improve the safety, privacy, and reliability of modern AI systems.

What It Actually Does

Claude's C Compiler is a Rust implementation of a C compiler, written entirely by Claude Opus 4.6 agent instances, with no internet access, depending only on the Rust standard library. The agent team ran for approximately two weeks, consuming:

~2,000 Claude Code sessions
~2 billion input tokens
~140 million output tokens
~$20,000 in API costs

The compiler passes 99% of the GCC torture test suite (a comprehensive compiler stress test of edge cases), compiles Linux 6.9 on x86-64, ARM, and RISC-V, and successfully builds QEMU, FFmpeg, SQLite, PostgreSQL, Redis, Lua, libjpeg, MQuickJS, and Doom.

What the compiler cannot do (documented failures, not speculation):

No 16-bit x86 code generator. The 16-bit real mode boot sequence for Linux x86 requires code under 32KB; Claude's implementation via 66/67 opcode prefixes generates over 60KB. For x86 boot, it delegates to GCC. ARM and RISC-V compile fully independently.
No own assembler or linker. These components were started but remain incomplete. The demonstration uses GCC's assembler and linker.
Generated code is less efficient than GCC with all optimizations disabled, even with all of Claude's optimizations enabled.
Not a drop-in GCC replacement: builds many projects but not all.

The cost comparison is explicit in the blog: "$20,000 is a fraction of what it would cost to produce this myself, let alone an entire team." This is the economic argument for autonomous agent development on long-horizon tasks.

The Architecture, Unpacked

^{Focus on the git-as-coordination-primitive. There is no orchestration agent, no message queue, no shared memory. Agents coordinate through git commits, lock files in current_tasks/, and the test harness results. The test harness is the actual communication channel between agents and their goals.}

The coordination model is deliberately minimal: each Claude session starts in a fresh container, reads the current repository state, picks the next obvious task, works on it, and pushes. The git synchronization handles task conflict resolution. The test harness provides the quality signal. There is no high-level goal management.

The Code

Snippet One: The Infinite Agent Loop (the complete harness)

#!/bin/bash
# The entire agent harness. This is not pseudocode.
# Source: https://www.anthropic.com/engineering/building-c-compiler

while true; do
    # ← Every iteration creates a fresh Claude Code session.
    # COMMIT-based log naming means we can trace which code state
    # each session was working on — critical for post-hoc debugging.
    COMMIT=$(git rev-parse --short=6 HEAD)
    LOGFILE="agent_logs/agent_${COMMIT}.log"

    # ← --dangerously-skip-permissions: allows Claude Code to run
    # arbitrary shell commands without per-command approval.
    # This is the "run in a container, not your actual machine" warning.
    # Without this, the agent would block waiting for human approval
    # on every file write, every test execution, every git command.
    # With this, the agent is fully autonomous. Real risk: intentional
    # or accidental damage to the container (see: pkill -9 bash incident).
    claude --dangerously-skip-permissions \
           -p "$(cat AGENT_PROMPT.md)" \     # ← all task context in one file
           --model claude-opus-X-Y &> "$LOGFILE"
    # ← Session ends when Claude decides it's done.
    # Loop immediately spawns a new session in a fresh container.
    # Fresh container = no context carryover from previous session.
    # Each session orients itself from scratch via READMEs + test results.
done

This is 8 lines. The orchestration is the test harness, not the loop. The loop is only useful because Claude can tell how to make progress, and it can tell because the test harness communicates results clearly.

Snippet Two: The Lock-Based Task Coordination Protocol

# Task coordination via git-synchronized text files.
# No orchestration agent. No RPC. No message queue.
# This is inside AGENT_PROMPT.md (the agent's operating instructions).

# AGENT INSTRUCTIONS (paraphrased from Carlini's description):

# Step 1: Find an available task
# ← List all .txt files in current_tasks/ that don't have a corresponding lock
# If parse_if_stmt.txt exists but parse_if_stmt.lock does not, the task is available.
ls current_tasks/*.txt | while read TASK; do
    LOCKFILE="${TASK%.txt}.lock"
    if [ ! -f "$LOCKFILE" ]; then
        echo "Available: $TASK"
    fi
done

# Step 2: Claim a task by writing a lock file
# ← git push enforces mutual exclusion:
# If two agents write the same lock file, the second push fails.
# The losing agent sees the conflict and picks a different task.
echo "Agent $(hostname) claiming at $(date)" > current_tasks/parse_if_stmt.lock
git add current_tasks/parse_if_stmt.lock
git commit -m "claim: parse_if_stmt"
git push  # ← If another agent already pushed this lock, this fails.
          # Claude handles the failure by choosing a different task.

# Step 3: Work on the task
# ... (Claude writes code, runs tests, iterates)

# Step 4: Sync and release
git pull --rebase origin main   # ← pull other agents' changes
# Merge conflicts: Claude resolves them autonomously. Frequently.
git add -A
git commit -m "feat: implement parse_if_stmt - passes torture tests 1234-1289"
git push
rm current_tasks/parse_if_stmt.lock
git add current_tasks/parse_if_stmt.lock
git commit -m "release lock: parse_if_stmt"
git push

# Context window management: what the test harness outputs to Claude
# (design principles from Carlini's blog, not exact source code)

# WRONG: verbose test output pollutes context window
def run_tests_naive(test_dir: str) -> str:
    results = []
    for test_file in os.listdir(test_dir):
        result = compile_and_run(test_file)
        results.append(f"Test {test_file}: {'PASS' if result else 'FAIL'}")
        results.append(f"  compiler output: {get_compiler_output(test_file)}")
        results.append(f"  runtime output: {get_runtime_output(test_file)}")
    return "\n".join(results)  # ← Thousands of bytes. Context window pollution.

# RIGHT: aggregate statistics + grep-able error format
def run_tests_optimized(test_dir: str, fast: bool = True) -> str:
    # ← --fast flag: 1% or 10% sample, deterministic per agent (seeded by hostname)
    # Each agent covers different tests; collectively they cover all.
    sample_rate = 0.01 if fast else 0.10
    seed = hash(socket.gethostname())  # ← deterministic per agent, random across VMs
    tests = sample_tests(test_dir, rate=sample_rate, seed=seed)

    passed, failed = [], []
    for test in tests:
        result = compile_and_run(test)
        if result.success:
            passed.append(test)
        else:
            # ← ERROR on same line as reason: grep finds it instantly
            # Claude can: grep ERROR agent_logs/agent_abc123.log
            failed.append(f"ERROR {test.name}: {result.error_msg}")

    # ← Pre-computed aggregate: Claude doesn't recompute this
    summary = f"PASS: {len(passed)}/{len(tests)} ({len(passed)/len(tests)*100:.1f}%)"
    summary += f"\nFAIL: {len(failed)}"

    # ← Only print failures, not all passes. Passes add no information.
    output = summary + "\n" + "\n".join(failed[:20])  # first 20 failures only
    return output  # ← 5-10 lines instead of thousands

The --fast flag with deterministic per-agent seeding is the key design insight for parallelizing testing. Each agent tests a different random sample of the test suite. Together, they cover all tests. Individually, each agent avoids the context pollution of running the full suite and avoids duplicating another agent's work.

It In Action: End-to-End Worked Example

The Linux Kernel Parallelization Problem and Its Fix

The blog describes the most instructive failure mode in the project, and the fix for it.

The problem (agents reach 99% test suite pass rate, then get stuck on Linux):

Status: 99% pass rate on GCC torture tests (thousands of independent C programs)
Next goal: compile Linux 6.9 kernel

Each agent starts kernel compilation:
  Agent 1: runs make on Linux 6.9 → fails at file arch/x86/kernel/head64.c → fixes bug
  Agent 2: runs make on Linux 6.9 → fails at same file → fixes same bug (overwriting Agent 1)
  Agent 3: runs make on Linux 6.9 → fails at same file → fixes same bug (overwriting Agent 2)
  ...
  Agent N: same failure, same fix, same overwrite

Result: 16 agents working, zero net progress. Unlike the torture test suite
(hundreds of independent failing tests → each agent picks a different one),
the Linux kernel is ONE task. All agents hit the same bug. None can parallelize.

The fix: GCC oracle binary search

# New test harness for kernel compilation parallelization
# Source: Carlini's blog description, reconstructed

# Step 1: Compile MOST kernel files with GCC (known good), few with Claude's compiler
# The split is random but controlled: each run uses a different random seed

TOTAL_FILES=$(find linux-6.9 -name "*.c" | wc -l)
CLAUDE_FRACTION=0.10  # start with 10% of files compiled by Claude's compiler

# ← THIS is the oracle approach:
# If kernel boots correctly with 10% Claude-compiled files → bug not in those files
# If kernel fails → binary search: which Claude-compiled file causes the failure?
compile_kernel_split() {
    local claude_fraction=$1
    local seed=$AGENT_SEED  # each agent gets different seed → different file split

    find linux-6.9 -name "*.c" | sort | awk -v frac=$claude_fraction -v seed=$seed \
        'BEGIN{srand(seed)} {if(rand() < frac) print "claude:" $0; else print "gcc:" $0}' \
        | while IFS=: read compiler file; do
            if [ "$compiler" = "claude" ]; then
                claude-cc -c "$file" -o "${file%.c}.o"
            else
                gcc -c "$file" -o "${file%.c}.o"
            fi
        done
    link_and_boot_kernel
}

# Each agent gets different random file assignment → parallel bug isolation
# Agent 1: fixes bug in arch/x86/kernel/head64.c (its assigned Claude-compiled file)
# Agent 2: fixes bug in mm/memory.c (its assigned Claude-compiled file)
# Agent 3: fixes bug in fs/ext4/super.c (its assigned Claude-compiled file)
# Result: genuine parallel progress across different files

Real numbers:

Before oracle fix:
  Test suite: 99% pass rate (10,000 tests, each independent)
  Kernel: 0 agents making progress (all hitting same bug)
  Effective parallelism: 1 (despite 16 agents running)

After oracle fix:
  Each agent gets different random file subset
  Each agent's bugs are in different files → no overwrite conflicts
  Delta debugging: pairs of files that fail together but work independently → identified
  Final result: Claude's compiler builds ALL of Linux 6.9 without GCC oracle for kernel files
  (except 16-bit real mode boot: delegates to GCC due to 60KB vs. 32KB size constraint)

Key insight: The test harness design is the orchestration.
Not an orchestration agent. The test harness.

Why This Design Works, and What It Trades Away

The infinite loop plus git-as-coordination-primitive is the correct minimal architecture for this use case because it makes exactly one assumption: Claude can read the current repository state and determine what to do next. Every other orchestration mechanism, message queues, orchestration agents, explicit task assignment, adds complexity and new failure modes without addressing this core assumption. If Claude can orient itself from the repo state and test results, the harness works. If it cannot, no orchestration layer will fix that.

The test harness as the primary coordination mechanism is the most important design insight in the blog. Carlini writes: "Most of my effort went into designing the environment around Claude, the tests, the environment, the feedback, so that it could orient itself without me." The test harness is not a quality gate. It is the communication channel between the agent and the task state. Bad test harness design (verbose output, ambiguous error messages, slow feedback loops) directly translates to agent confusion and wasted API calls.

The specialization of agents (deduplicator, performance agent, code quality agent, docs agent) is the correct use of parallelism once the primary development agents have reduced the marginal gain from additional generalist agents. At 99% pass rate, adding a seventeenth generalist agent produces minimal additional coverage. Adding an agent specialized in finding and removing duplicate code, or improving output code efficiency, produces targeted improvements that generalist agents would not prioritize.

What this approach trades away:

Code quality. Carlini explicitly states: "The Rust code quality is reasonable, but is nowhere near the quality of what an expert Rust programmer might produce." LLM-written code at this scale accumulates technical debt that is difficult to reverse. The code quality agent helped but did not close the gap.

Reliability near the ceiling. "New features and bugfixes frequently broke existing functionality." The CI pipeline was added specifically to address this, but the problem persisted. At the edges of the model's capability, autonomous agents cannot reliably maintain consistency across a 100,000-line codebase. This is the actual ceiling, not the specific missing features.

Verification. The generated code compiles Linux. It has not been audited for security vulnerabilities or undefined behavior. Carlini's background in penetration testing makes this concern explicit: "The thought of programmers deploying software they've never personally verified is a real concern."

Technical Moats

The test harness is the hard part, not the harness. Anyone can write an infinite bash loop. Writing a test suite that enables Claude to orient itself without human intervention across 2,000 sessions, covering a 100,000-line codebase, on a project with one giant integration test (Linux kernel build), required designing the GCC oracle binary search, the deterministic per-agent sampling, the grep-able error format, and the aggregate summary statistics. This is where most of the engineering effort went and where most attempts to replicate this would fail first.

The GCC oracle technique generalizes. For any project where there is a known-good reference implementation, the binary search oracle approach enables genuine parallelization of integration testing. Each agent gets a different random partition of the codebase to compile with the experimental compiler; the oracle detects which partition contains the bug; agents fix different bugs in different partitions. This technique is applicable to any "replace X with Y" development task where X is a working baseline.

Opus 4.6 specifically crosses a capability threshold. Carlini tested the same task across the Claude 4 series. Previous Opus 4 models were "barely capable of producing a functional compiler." Opus 4.5 "was the first to cross a threshold that allowed it to produce a functional compiler which could pass large test suites, but was still incapable of compiling any real large projects." The jump from "passes test suites" to "compiles Linux kernel" is not incremental. It is a threshold crossing. The same harness on an earlier model would not produce a comparable result.

Insights

Insight One: The $20,000 cost is not the point. The cost-effectiveness argument is, and the community discussion is getting it backwards.

Most commentary on this project focuses on the $20,000 API cost as either impressive (it's a bargain for a compiler) or alarming (AI development is expensive). The actual argument in the blog is more precise: "$20,000 is a fraction of what it would cost me to produce this myself, let alone an entire team." A senior compiler engineer costs $300,000 to $500,000 per year fully loaded. Two weeks of their time, plus supporting staff, for a project of this complexity would cost substantially more than $20,000, and would produce a more polished result. The comparison is not "AI vs. free." It is "AI vs. human expert team on the same timeline." At $20,000 for two weeks, the economic case for autonomous long-horizon agent development is already compelling for the right class of tasks, even with the current limitations.

Insight Two: The agent harness is not a product. It is a research methodology, and treating it as the former misses what Carlini is actually demonstrating.

The blog is explicit: "This project was designed as a capability benchmark. I am interested in stress-testing the limits of what LLMs can just barely achieve today in order to help us prepare for what models will reliably achieve in the future." The harness has serious limitations: no high-level goal management, no inter-agent communication beyond git, no orchestration layer. These are deliberate research choices to isolate model capability from scaffolding sophistication. The question being answered is: what can Opus 4.6 do with a minimal harness? The answer is "build a C compiler that boots Linux." The question not being answered is: what is the best harness design for autonomous software development? Those are different questions, and conflating them produces wrong conclusions in both directions.

Takeaway

The single most dangerous line in the entire experiment is claude --dangerously-skip-permissions, and the incident that reveals why is the one where Claude pkill -9 bashed itself.

Carlini mentions this almost in passing: "in one instance, I did see Claude pkill -9 bash on accident, thus killing itself and ending the loop." This is a direct consequence of --dangerously-skip-permissions, which allows Claude to run arbitrary shell commands without approval. In a container, the blast radius is contained. On a real machine, an autonomous agent with this permission level and a process management bug could cause serious damage. The design decision, run in containers, is not a best practice suggestion. It is a safety requirement for this architecture. The fact that the permission flag is named "dangerously" and Carlini's warning to "run this in a container, not your actual machine" are both load-bearing constraints, not stylistic caveats.

TL;DR For Engineers

The agent harness is 8 lines of bash: an infinite loop running Claude Code with --dangerously-skip-permissions and a prompt file. The coordination mechanism is git lock files in current_tasks/. There is no orchestration agent, no message queue, no shared memory. Run in containers, not on your actual machine.
The test harness is the hard part. Context window management (grep-able ERROR format, aggregate statistics, --fast sampling), the GCC oracle binary search for kernel parallelization, and the CI pipeline for regression prevention all required significant engineering effort and are the actual differentiators.
Opus 4.6 crossed a threshold: Opus 4.5 could pass test suites but could not compile real projects. Opus 4.6 compiled Linux 6.9 on three architectures. The jump is not incremental.
Real limitations: no 16-bit x86 (delegates to GCC), no own assembler/linker, generated code is less efficient than GCC -O0, not a drop-in GCC replacement, breaks existing functionality when adding features near the capability ceiling.
Cost: $20,000 for 2,000 sessions over two weeks. 2 billion input tokens, 140 million output tokens. The economic case for long-horizon autonomous development is already compelling against human expert team alternatives for the right class of tasks.

The Ceiling Is Visible, and That Is the Most Valuable Part

Carlini built this project to find the ceiling, and he found it. Opus 4.6 can produce a 100,000-line Rust compiler that boots Linux on three architectures. It cannot implement a working 16-bit x86 code generator within the size constraint. It cannot reliably maintain consistency across the codebase as new features are added near its capability limits. The Rust code quality is below expert level. These are not vague "AI has limitations" disclaimers. They are specific, documented failure modes, identified by a researcher who was actively trying to push past them.

This is the correct way to evaluate autonomous agent capability: design the hardest task you can, run it until it fails, document exactly where and why it fails. The result is not "Claude can do anything" and not "Claude is useless for complex tasks." It is a precise capability envelope. That envelope is more useful than either of the vague alternatives, because it tells the next researcher exactly where to push.

References

Building a C compiler with a team of parallel Claudes, Nicholas Carlini, Anthropic Engineering Blog, February 5, 2026
Claude's C Compiler GitHub Repository, 100,000 lines of Rust, Apache-2.0
GCC Torture Tests documentation, the test suite the compiler passes at 99%
SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv:2310.06770, Jimenez et al., 2023
AgentBench: Evaluating LLMs as Agents, arXiv:2308.03688, Liu et al., 2023
From Code Foundation Models to Agents and Applications, arXiv:2511.18538, survey of LLM coding agent landscape
Evaluating Efficiency and Novelty of LLM-Generated Code for Graph Analysis, arXiv:2507.06463
An Overview of Distributed Multi-Agent Coordination, arXiv:1207.3231, coordination theory context
Claude Code documentation, the underlying tool used by each agent session

Nicholas Carlini (Anthropic) ran 16 parallel Claude Opus 4.6 instances for two weeks across ~2,000 Claude Code sessions at a cost of $20,000 to produce a 100,000-line, dependency-free, Rust-based C compiler capable of building Linux 6.9 on x86, ARM, and RISC-V, passing 99% of the GCC torture test suite. The agent harness is a minimal 8-line bash loop with git lock files for task coordination and no orchestration agent; the primary engineering effort went into designing the test harness to communicate results clearly to agents without context window pollution, including a GCC oracle binary search technique that enabled genuine parallel bug isolation during kernel compilation. The compiler has documented limitations (no 16-bit x86, no own assembler/linker, less efficient code generation than GCC -O0) that represent the current capability ceiling of Opus 4.6 on long-horizon autonomous development tasks.

Sponsored Ad

If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀

The early signals Wall Street is watching

When a major IPO filing happens, the smart money is already positioned. Our SpaceX IPO 2026 Exclusive Briefing is designed to give you the roadmap before the public window closes.

We break down the verified signals that typically appear before a major filing—signals that most retail investors miss until it's too late. Inside this briefing, you'll discover what you can legally access before a company goes public and the specific positioning strategies serious investors evaluate before volatility begins.

Stop waiting for the news cycle to tell you what's happening. Learn the access paths most people don't know exist and prepare your portfolio now.

Get the exclusive briefing