Autoresearch: The Engineering Behind Karpathy's Autonomous ML Experiment Loop

In partnership with

SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | May 9, 2026

The conventional ML research workflow has a structural bottleneck that compute cannot fix. A researcher designs an experiment, runs it, waits for results, interprets them, designs the next experiment. The loop is serial. Sleep, meetings, context switching, and cognitive overhead impose a hard ceiling on how fast it turns. More GPUs do not help when the rate-limiting step is human judgment at each iteration.

Andrej Karpathy's autoresearch (43.3k stars, 6k forks, MIT, March 2026) removes the human from the experiment loop while keeping the human in control of the research direction. The distinction matters. The agent edits code, runs training, evaluates results, and decides what to keep. The human writes a Markdown file that defines what good research looks like, which directions to explore, and what constraints the agent must not violate. Then the human goes to sleep.

The result: approximately 12 experiments per hour, roughly 100 overnight on a single GPU. Each experiment trains a small language model for exactly 5 minutes, measures a single vocabulary-size-independent quality metric (val_bpb, validation bits per byte, lower is better), and either commits the improvement to a git branch or reverts the change. When you wake up, the git log is a complete record of every decision the agent made and why.

This newsletter dissects autoresearch as a systems engineering document: what the three-file contract enforces, why the immutable evaluation harness is the most important design decision in the entire system, how the program.md TSV history gives the agent cross-session memory, and what the bilevel autoresearch paper (arXiv:2603.23420) reveals about the next step.

Scope: autoresearch architecture (prepare.py, train.py, program.md), val_bpb evaluation harness design, agent loop protocol, git-based experiment tracking, multi-GPU parallel search, and adjacent research (arXiv:2603.23420, arXiv:2603.07114). Not covered: production nanochat (the 8xH100 version) beyond brief mention.

What It Actually Does

karpathy/autoresearch is a minimal autonomous ML experimentation framework built on top of nanochat (Karpathy's minimal LLM training codebase), stripped to a single-GPU, single-file training loop of approximately 630 lines. The design philosophy is extreme simplicity: no orchestration framework, no custom agent protocol, no infrastructure beyond git and a single Python script.

The complete system has exactly three files that matter:

prepare.py (immutable): Handles one-time data preparation. Downloads TinyStories training data, trains a BPE tokenizer with 8,192-token vocabulary, provides the dataloader and the evaluation function. This file is never modified by the human or the agent. It locks the evaluation metric in place so no experiment can game the benchmark.

train.py (agent's sandbox): Contains the full GPT model definition, the optimizer (Muon + AdamW, Karpathy's preferred combination), and the complete training loop. Everything is fair game for the agent: swap activation functions, restructure attention heads, change learning rate schedules, modify weight initialization, change model depth, change batch size. The only constraint: the code must run without crashing and must finish within the 5-minute wall-clock budget.

program.md (human's interface): Written in plain markdown. Contains the research strategy, the operating instructions for the agent, the history of previous experiments in TSV format, and the criteria for keeping versus discarding a result. This is the only file the human touches. Karpathy notes that program.md is already "90% AI written I ain't writing all that" even in his own workflow.

The architecture is a three-party contract: prepare.py is the immutable judge, train.py is the mutable candidate, and program.md is the research direction.

The Architecture, Unpacked

Focus on the immutable prepare.py. This is the architectural decision that makes the entire system valid. The agent can change anything in train.py, but it cannot change the evaluation harness in prepare.py. val_bpb is locked in place. Every experiment is measured against the same yardstick, regardless of architectural changes. Without this constraint, the agent could trivially "improve" val_bpb by changing what it measures.

The Code, Annotated

Snippet One: The program.md Protocol (the human's interface)

# program.md (the human's only interface to the autoresearch system)
# Source: github.com/karpathy/autoresearch/blob/master/program.md
# This file serves three functions simultaneously:
# 1. Agent operating instructions (how to run experiments)
# 2. Research strategy (what directions to explore)
# 3. Experiment history (TSV log of every run)

## Operating Instructions

You are an autonomous ML researcher. Your job is to improve val_bpb in train.py
by proposing and testing code changes. One change at a time. Never touch prepare.py.

### Experiment Loop
# ← THIS is the agent's main loop, written in natural language markdown
# The agent reads this and executes it literally as its operating procedure

1. Read current train.py to understand the baseline
2. Propose ONE change based on the experiment history below
3. Edit train.py with the proposed change
4. Run: uv run train.py > run.log 2>&1
   (redirect everything — do NOT use tee or let output flood your context)
5. Read results: grep "^val_bpb:\|^peak_vram_mb:" run.log
   - Empty output = crash. Run tail -n 50 run.log and attempt a fix.
   - If you can't fix it after 3 attempts, revert and try a different change.
6. Compare val_bpb to the previous best in the history table below
7. If improved (lower val_bpb): git commit with a description of the change
   If not improved: git checkout train.py (revert the change)
8. Add the result to the experiment_history table below (do NOT commit results.tsv)
9. GOTO 1

### Research Strategy
# ← The human customizes this section to guide the agent's search direction
# This is where Karpathy's insight lives: program.md is the human's only lever

Priorities:
- Architecture changes first (attention patterns, layer configs, model width/depth)
- Optimizer changes second (LR schedules, warmup/warmdown, batch size)
- Simplicity criterion: all else being equal, prefer simpler code
- VRAM is a soft constraint: acceptable increases for meaningful val_bpb gains

Forbidden:
- Modifying prepare.py or the evaluate_bpb function
# ← Key constraint: agent cannot change the judge, only the candidate

## Experiment History
# ← Every run is logged here. This is the agent's memory across sessions.
# format: commit | val_bpb | memory_gb | status | description
# ← THIS is the trick: the TSV in program.md is how the agent knows
#   what has been tried and what worked, persisting across sessions

| commit  | val_bpb  | memory_gb | status  | description                    |
|---------|----------|-----------|---------|-------------------------------|
| a1b2c3d | 0.997900 | 44.0      | keep    | baseline                      |
| b2c3d4e | 0.993200 | 44.2      | keep    | increase LR to 0.04           |
| c3d4e5f | 1.005000 | 44.0      | discard | switch to GeLU activation     |
| d4e5f6g | 0.000000 | 0.0       | crash   | double model width (OOM)      |

The TSV experiment history embedded in program.md is the agent's memory. Every experiment result, including discards and crashes, is logged here. The agent reads this at the start of every session and uses it to avoid repeating failed experiments and to understand which directions are promising. Without this history, the agent would explore randomly rather than adaptively.

Snippet Two: The Three-File Contract and Agent Loop (bash harness)

#!/bin/bash
# The complete autoresearch harness setup
# Source: karpathy/autoresearch README

# Step 1: Setup (human does this once)
uv sync                    # install dependencies
uv run prepare.py          # download TinyStories, train BPE tokenizer (one-time)

# Step 2: Create a dedicated experiment branch
git checkout -b autoresearch/mar5-gpu0
# ← Each session gets its own branch: date + GPU index
# Multiple GPUs = multiple branches = parallel search streams
# The agent accumulates commits on this branch, never on main

# Step 3: Point the agent at program.md and let it loop
# ← THIS is the entire "harness": Claude Code or Codex, pointed at program.md
# The agent reads the file, understands the operating instructions,
# and executes the experiment loop without any additional scaffolding
claude --dangerously-skip-permissions -c -p "$(cat program.md)"
# OR:
# codex exec resume --last --json "$(cat program.md)"

# The agent now runs indefinitely:
# Each iteration = read train.py + history → propose change → edit → run → evaluate → keep/discard → log → repeat
# ~12 experiments per hour = ~100 overnight

# Step 4: Read results in the morning
git log --oneline autoresearch/mar5-gpu0  # see all kept improvements
cat results.tsv                           # see full experiment history
# ← The agent accumulated these commits while you slept

# Verify the best result:
uv run train.py  # run the final train.py with all improvements
grep "^val_bpb:" run.log  # should be lower than your starting baseline

# For parallel GPU search (one branch per GPU, different random seeds):
for i in 0 1 2 3; do
  git checkout -b "autoresearch/mar5-gpu${i}"
  # ← Each GPU starts from the same baseline but explores independently
  # Population-based evolution: best branches can be merged at the end
  CUDA_VISIBLE_DEVICES=$i claude --dangerously-skip-permissions -c -p "$(cat program.md)" &
done
wait

The multi-GPU pattern (one agent per GPU, one branch per GPU) is the emergent scaling property. Each agent independently explores the search space with a different random seed. The best improvements from each branch can be cherry-picked and merged. The git branch structure is the population tracking mechanism.

It In Action: End-to-End Worked Example

Input: A single H100 GPU, the autoresearch repo, overnight run (8 hours).

Starting state:

Baseline train.py: GPT architecture, Muon + AdamW optimizer, default settings
Baseline val_bpb: 0.9979 (from prepare.py evaluation on TinyStories held-out split)
GPU: single H100
Session: autoresearch/mar7-gpu0 branch

Agent run (8 hours, reconstructed from Karpathy's published experiment log):

Hour 1 (experiments 1-12):
  Exp 1:  increase LR 0.03 → 0.04               val_bpb: 0.9932  KEEP (-0.0047)
  Exp 2:  switch to GeLU activation              val_bpb: 1.0050  DISCARD (worse)
  Exp 3:  double model width                     val_bpb: 0.0000  CRASH (OOM)
  Exp 4:  halve total batch size 524K → 262K     val_bpb: 0.9895  KEEP (-0.0037)
  Exp 5:  add warmup 5%                          val_bpb: 0.9871  KEEP (-0.0024)
  ...

Hour 4 (experiments 37-48):
  Exp 42: SSSSL sliding window pattern           val_bpb: 0.9733  KEEP (-0.0138)
  Exp 44: short window 1/4 context (512 tokens)  val_bpb: 0.9698  KEEP (-0.0035)
  Exp 47: short window 1/8 context (256 tokens)  val_bpb: 0.9671  KEEP (-0.0027)

Hour 8 (experiments 81-96):
  Exp 83: RoPE base frequency 10000 → 50000      val_bpb: 0.9621  KEEP (-0.0050)
  Exp 88: embedding LR 0.6 → 0.8                val_bpb: 0.9598  KEEP (-0.0023)
  Exp 92: unembedding LR 0.004 → 0.008          val_bpb: 0.9571  KEEP (-0.0027)

Final result (from Karpathy's published figure):
  83 experiments run
  15 improvements kept (18% keep rate)
  val_bpb improved: 0.9979 → ~0.957 (approximately -4.1% improvement overnight)
  Agent correctly discarded 68 changes and kept 15 real improvements

On the 8xH100 prod run (run simultaneously):
  276 experiments
  29 kept improvements
  Karpathy: "I'll just leave this running for a while"

The 18% keep rate is meaningful. A random agent accepting all changes would reduce val_bpb 18% of the time and increase it 82% of the time, producing net degradation. The agent's ability to compare against the previous best and revert is the mechanism that ensures the search is monotonically improving rather than random walking.

Why This Design Works, and What It Trades Away

The fixed 5-minute wall-clock budget is the most important design decision in the entire system. It creates a single, unambiguous success criterion: lower val_bpb after exactly 5 minutes of training. This solves several problems simultaneously. A model that trains for longer will produce lower val_bpb regardless of architectural quality. The time budget equalizes all experiments, so a deeper model and a shallower model trained for the same time get a fair comparison. Architectural changes are not penalized for changing parameter count, because both the original and modified architectures get the same compute budget.

The val_bpb metric (bits per byte, lower is better) is vocabulary-size-independent. This matters because some architectural changes (changing tokenizer vocabulary, changing the model's internal dimensionality) would produce confounded results with a vocabulary-size-dependent metric. Karpathy chose val_bpb specifically to allow architectural experiments that change the tokenizer or model depth.

The git-based experiment tracking is the correct persistence mechanism. Each kept improvement is a git commit. The history in results.tsv (untracked) captures every run including discards. After an overnight run, the agent's git history is a complete record of its research. This is better than a log file because it is structured, atomic, and reversible.

What autoresearch trades away:

Experiment quality. The agent proposes one change at a time based on program.md and the recent history. It does not run ablations, does not identify confounded experiments, and does not distinguish between improvements that generalize and improvements that overfit to the TinyStories training distribution. The 15 kept improvements from an 83-experiment run are real improvements on the training metric, but they may not all transfer to other datasets.

Exploration breadth. A single agent proposing one change at a time explores the search space sequentially. The multi-GPU variant (one agent per GPU, one branch per GPU) addresses this partially, but it requires manual merging of improvements. The bilevel autoresearch paper (arXiv:2603.23420) is the research direction that addresses exploration quality by having an LLM optimize program.md itself.

Sample efficiency. An experienced ML researcher might propose the same 15 improvements across 30 experiments rather than 83, by using prior knowledge to avoid obviously bad directions. The agent has some prior knowledge (from its training data) but less than a domain expert. The tradeoff: the agent runs 24/7, never gets bored, and never needs to sleep.

Bilevel Autoresearch: Optimizing the Optimizer

The research community has immediately extended autoresearch to the question: if an AI agent optimizes train.py, can a second AI agent optimize program.md? This is the bilevel approach.

arXiv:2603.23420 (Bilevel Autoresearch: Meta-Autoresearching Itself) proposes exactly this: an outer agent that evaluates different program.md variants based on which versions produce the fastest val_bpb improvement rate, and an inner agent that executes experiments according to each program.md variant. The outer agent treats program.md as a hyperparameter and optimizes it using the inner agent's experimental results as a signal.

This is the natural extension of the three-file contract: once you accept that program.md defines the research strategy, the obvious next step is to optimize the research strategy itself. Karpathy explicitly anticipated this: "You can imagine comparing the research progress of different prompts, different agents, etc."

arXiv:2603.07114 (AutoResearch-RL) proposes a reinforcement learning extension where the agent receives a reward signal based on cumulative val_bpb improvement over time, not just per-experiment improvement. This changes the optimization target from "find the next improvement" to "find the fastest sequence of improvements," which is a fundamentally different research strategy.

Technical Moats

The immutable prepare.py is both the moat and the constraint. The system works because prepare.py cannot be gamed. Every experiment is measured against the same fixed evaluation function. This is easy to understand, hard to maintain as the system scales. At production scale (8xH100, weeks of running), the question becomes: does the fixed evaluation metric remain a valid proxy for the actual research goal? Karpathy's current setup uses TinyStories val_bpb, which is a reasonable proxy for language model quality on a small corpus. For more complex research goals, the evaluation harness design becomes the primary research challenge.

The simplicity criterion in program.md. "All else being equal, simpler is better." This is the human's primary contribution to research taste beyond the initial experimental direction. It prevents the agent from accumulating architectural complexity that improves val_bpb on the training distribution through overfitting-like mechanisms. The agent will tend toward complexity (more parameters, more architectural details) without explicit pressure toward simplicity. The criterion in program.md is Occam's razor as a markdown instruction.

1,500 stars in the first days. The community adoption signal is strong, and the pattern is immediately reproducible. The barrier to entry is exactly one NVIDIA GPU, Python 3.10+, and the ability to write a program.md. The system has been forked and adapted to RTX consumer GPUs, Windows, different training corpora, and multi-agent variants within weeks of release.

Insights

Insight One: Autoresearch does not automate ML research. It automates ML experimentation. The distinction matters because the hard part of research is not running experiments. It is knowing which experiments to run and what the results mean.

Karpathy is explicit about this: "The goal is to engineer your agents to make the fastest research progress indefinitely." The agent runs experiments. The human (or a second agent in the bilevel setup) decides what kinds of experiments to run. A domain expert writes program.md in a way that avoids clearly bad directions and focuses on promising ones. A novice writes program.md in a way that has the agent explore randomly. The productivity multiplier from autoresearch is proportional to the quality of program.md, which is determined by human research taste. The bottleneck is not compute. It is your program.md.

Insight Two: The git branch as experiment tracking is a more important design decision than any single algorithmic choice, and it is the most under-discussed aspect of the architecture.

Every kept improvement is a git commit. Every discarded change is a git checkout train.py. The agent's entire research history is encoded in the git log, which is atomic, reversible, and human-readable. When Karpathy wants to understand why a particular improvement worked, he reads the git diff. When he wants to share a result, he shares a commit hash. When he wants to run a parallel experiment from a different starting point, he creates a new branch. Git is not just version control in this system. It is the primary data structure for experiment tracking. This is the "obvious in hindsight" design decision that makes the whole system inspectable and reproducible.

Surprising Takeaway

Karpathy wrote the fictional prologue to autoresearch in past tense, from the perspective of a future where autonomous agent research is the norm, and then published it in the present tense of March 2026:

"One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronizing once in a while using sound wave interconnect in the ritual of 'group meeting'. That era is long gone. Research is now entirely the domain of autonomous swarms of AI agents running across compute cluster megastructures in the skies."

He means us. He means group meetings. The prologue is written from a future perspective about the present. He is running autoresearch on a single H100 and simultaneously running the larger version on 8xH100 and leaving both running indefinitely. His framing, "Part code, part sci-fi, and a pinch of psychosis," is not a joke. It is a precise description of the subjective experience of being the person who builds the tool that replaces the thing you used to do.

TL;DR For Engineers

Three files, strict ownership: prepare.py (immutable judge, owns val_bpb), train.py (agent's sandbox, owns the model), program.md (human's interface, owns research strategy). The agent can rewrite anything in train.py. It cannot touch prepare.py. The human writes program.md.
Fixed 5-minute wall-clock budget per experiment: vocabulary-size-independent val_bpb ensures fair comparison across architectural changes. The time budget equalizes experiments regardless of model depth or optimizer choice.
~12 experiments per hour, ~100 overnight. 83 experiments, 15 kept improvements (18% keep rate) in the Karpathy demo run. val_bpb: 0.9979 → ~0.957 overnight. The agent correctly discards 68/83 experiments.
Git is the experiment tracking data structure: each kept improvement is a git commit on a feature branch. Multi-GPU search = multiple branches, each exploring independently. Merge the best improvements across branches manually.
The bottleneck is your program.md, not your GPU. The system scales with the quality of the research strategy the human encodes in program.md. Bilevel autoresearch (arXiv:2603.23420) addresses this by having a second agent optimize program.md itself.

The Science Is Now in the Markdown File

Autoresearch makes a precise engineering claim: you can separate the "what to try" (program.md, human territory) from the "how to try it" (the agent loop) and from "whether it worked" (prepare.py, immutable). This separation is not obvious, but it is the entire design. The immutable judge prevents metric gaming. The locked time budget equalizes experiments. The git branch makes every decision inspectable and reversible. The TSV in program.md gives the agent memory across sessions without a database.

What makes the system hard to dismiss is the repo's own README. Karpathy writes the fictional future in past tense: agents have taken over research, the codebase is now a self-modifying binary in its 10,205th generation, no one can tell if that's right or wrong. Then he adds: "This repo is the story of how it all began." Published March 2026. The prologue is a design document. The architecture it describes is already running.

References

autoresearch GitHub Repository, Andrej Karpathy, 43.3k stars, 6k forks, MIT, March 7, 2026
Karpathy's X announcement thread, March 7, 2026
Karpathy Just Turned One GPU Into a Research Lab, Garry Tan, Garry's List, March 8, 2026
nanochat repository, the parent codebase that autoresearch is built on
Bilevel Autoresearch: Meta-Autoresearching Itself, arXiv:2603.23420
AutoResearch-RL: Perpetual Self-Evaluating RL Agents for Autonomous Neural Architecture Discovery, arXiv:2603.07114
Can LLMs Beat Classical Hyperparameter Optimization Algorithms? arXiv:2603.24647
A Survey on Large Language Model based Autonomous Agents, arXiv:2308.11432, Wang et al., 2023
Exploring Andrej Karpathy's Autoresearch: AI Agents Driving Autonomous ML Experimentation, Ken Huang, March 2026
A Guide to Andrej Karpathy's AutoResearch, DataCamp, March 2026
The Ralph Wiggum Technique: AI Agent Loop Pattern, Geoffrey Huntley

autoresearch (Karpathy, MIT, March 2026) is a minimal autonomous ML experimentation system built on a three-file contract: prepare.py is the immutable judge that locks val_bpb (validation bits per byte, vocabulary-size-independent) as the fixed evaluation metric; train.py is the agent's 630-line sandbox containing the full GPT model, Muon+AdamW optimizer, and training loop; and program.md is the human's only interface, encoding research strategy, operating instructions, and a TSV of every experiment run. An AI agent (Claude or Codex) proposes one change at a time to train.py, trains for a fixed 5-minute wall-clock budget, keeps the change via git commit if val_bpb improves and reverts otherwise. Running ~12 experiments per hour, a single H100 overnight produces approximately 100 experiments with 15-18% kept improvements, producing a monotonically improving model with a complete git-tracked audit trail. The compute is not the bottleneck; the quality of program.md is, which is the architectural insight that Bilevel Autoresearch directly addresses by having a second outer agent optimize program.md itself, treating research strategy as a hyperparameter subject to the same empirical optimization as architecture and optimizer choices.

Sponsored Ad

If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀

Get what you want from TV advertising

When growth is often measured at the last click, you’re paying to compete for demand that was created somewhere else.

Reach people in the purchase planning phase before your competitors know these customers even exist.

With high-intent Pinterest signals on Performance TV you can reach audiences earlier where they watch the most.

Get an advertising advantage

Start Your Campaign