A peer-reviewed analysis of Claude Code's TypeScript source code (arXiv:2604.14228, MBZUAI + UCL, April 2026) provides the definitive evidence: "The core of the system is a simple while-loop that calls the model, runs tools, and repeats. Most of the code, however, lives in the systems around this loop." This newsletter dissects both what those systems are, from the Claude Code paper's five-layer architecture with its seven-mode permission system and five-stage compaction pipeline, and how to build your own from the Loop Engineering playbook published in June 2026.
SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | July 04, 2026
Loop engineering is the practice of building a small system that finds the work, hands it to the agent, checks the result, records the outcome, and decides the next move, entirely on its own. You design the system once. The system prompts the agent from then on.
The practitioners doing this at scale, including Anthropic's own engineers, now report merging approximately eight times as much code per day as they did in 2024. Anthropic notes this figure is "almost certainly an overstatement of true productivity gains," which is a useful caveat: the underlying mechanism is real, but the headline is the optimistic end of the distribution.
What the Claude Code architectural analysis (Jiacheng Liu, Xiaohan Zhao, Xinyi Shang, Zhiqiang Shen, arXiv:2604.14228) makes precise is why this works: a production coding agent is not a loop with some prompt engineering around it. It is a five-layer system where the loop is the smallest component.
Scope: the Claude Code five-layer architecture from the arXiv paper, the Loop Engineering 14-step framework from the Medium guide, the specific subsystems that matter most (permission system, compaction pipeline, skill system, maker-vs-checker sub-agents, state files), and how these compose into a production loop. Not covered: the OpenClaw comparison from the paper, or the governance and policy directions in the paper's future work section.
What It Actually Does
Loop engineering has two definitions depending on who you ask. The surface definition: run an agent on a schedule instead of manually. The correct definition: design the context management, verification gates, state persistence, permission boundaries, and tool access that make unattended agent execution reliable rather than catastrophic.
The distinction matters because the first definition is a weekend project. The second is what Anthropic shipped as Claude Code.
The Loop Engineering 4-Condition Test (before you build anything):
A loop only earns its setup cost if all four conditions hold:
Condition | Pass criteria | Fail case |
|---|---|---|
Task Repeats | Weekly at minimum | One-time job: use a manual prompt |
Verification is Automated | Test suite, type checker, linter, or passing build can reject bad output WITHOUT a human | No automated check: you are still reading every diff |
Token Budget Fits | Enterprise or team plan absorbs exploration and retry costs | Solo developer on $20 plan: bill arrives before gains do |
Agent Has Senior Tools | Logs, reproduction environment, ability to run the code it just wrote | Blind iteration: "Ralph Wiggum" failure mode, the agent agrees with itself |
The Architecture, Unpacked

Focus on Layer 2 and Layer 3. The while-loop in Layer 2 is six lines. The compaction pipeline it depends on is five stages. The safety system in Layer 3 has seven permission modes, an ML classifier, four extensibility mechanisms, and a four-step authorization pipeline. The loop is not the engineering.
The Code, Annotated
Snippet One: The Minimum Viable Loop with All Five Building Blocks
# Loop Engineering: minimum viable loop (MVL) architecture
# Source: Loop Engineering guide (Medium, Neyzis, Jun 2026)
# Design intent: automation + skill + state file + connector (four pillars only)
# Start here, NOT with multi-agent swarms
import json
import subprocess
from datetime import datetime
from pathlib import Path
# ─── STATE FILE: The Agent Forgets, The File Remembers ────────────────────────
# ← LLM sessions are stateless. Without this, the loop re-derives everything
# from scratch on every run: context, decisions, progress, lessons learned.
# The state file is the structural backbone of every production loop.
STATE_FILE = Path("loop_state.json")
def load_state() -> dict:
if STATE_FILE.exists():
return json.loads(STATE_FILE.read_text())
return {
"loop_id": "ci-triage",
"last_run": None,
"status": {"failures_classified": 0, "fixes_drafted": 0, "escalated": 0},
"in_progress": [],
"lessons_learned": [] # ← accumulated across runs, compounds over time
}
def save_state(state: dict):
state["last_run"] = datetime.utcnow().isoformat()
STATE_FILE.write_text(json.dumps(state, indent=2))
# ─── SKILL: Write Once, Used Every Run ────────────────────────────────────────
SKILL_MD = """
# CI Triage Skill
## Classification Rules
- env: Missing secrets or unprovisioned infrastructure. → Escalate to human.
- flake: Test passes on a clean retry without code changes. → File a report.
- bug: Deterministic failure tied to a recent commit. → Draft a fix.
## Fix Patterns
- Auth tests → Verify src/auth/middleware first.
- Database tests → Check if recent migrations were applied in CI env.
## Never Do
- Never disable a failing test to pass the build.
- Never touch src/payments/ or src/billing/.
"""
# ← This skill prevents the agent from re-learning your architecture on every
# run. Without skills, a loop wastes token budget re-deriving conventions.
# With skills, intent compounds. Historical "don't do this" notes are
# available on every cycle without occupying conversation history.
# ─── AUTOMATION WITH GOAL-DRIVEN TRIGGER ─────────────────────────────────────
def build_agent_prompt(state: dict, ci_failures: list[str]) -> str:
"""
Build the structured prompt sent to the agent each loop iteration.
← The loop designs the prompt; humans stop typing.
"""
return f"""
{SKILL_MD}
Current State:
{json.dumps(state['status'], indent=2)}
Lessons from prior runs:
{chr(10).join(f"- {l}" for l in state.get('lessons_learned', []))}
New CI Failures to Triage:
{chr(10).join(ci_failures)}
Goal: Classify all failures, draft fixes for bugs, escalate env failures.
Success condition (verified by independent checker, not you):
All deterministic failures have a draft fix PR with passing CI.
HARD LIMITS:
- Do not touch: src/payments/, src/billing/, auth/crypto modules.
- Max iterations: 10.
- Human approval required before any merge.
"""
async def run_ci_triage_loop():
"""
The minimum viable loop: automation trigger → skill + state → agent → verify
"""
state = load_state()
# ─── STEP 1: CONNECTOR fetches work from the real world ────────────────────
# ← Without MCP/connectors, the agent can only see your local filesystem.
# Connectors are what make the agent open PRs, update tickets, and alert
# Slack when done, instead of just saying "here's the fix."
ci_failures = await fetch_ci_failures_via_mcp() # MCP connector: GitHub CI
if not ci_failures:
print("No failures to triage.")
return
# ─── STEP 2: GOAL-DRIVEN TRIGGER (maker-vs-checker split) ──────────────────
# ← THIS is the trick: the /goal trigger runs until the CHECKER confirms done
# The AGENT (maker) writes the fixes.
# A SEPARATE SMALLER MODEL (checker) verifies the stop condition.
# "The model that wrote the code is way too nice grading its own homework"
# — Addy Osmani, quoted in Loop Engineering guide
prompt = build_agent_prompt(state, ci_failures)
result = await run_agent_with_goal(
prompt=prompt,
goal="All deterministic failures have a draft fix PR with passing CI.",
checker_model="claude-haiku-4-5", # ← fast, cheap checker
maker_model="claude-opus-4-6", # ← high-reasoning maker
max_iterations=10, # ← HARD STOP: loops need limits
require_human_approval_before_merge=True, # ← never autopilot to prod
)
# ─── STEP 3: UPDATE STATE with outcomes ────────────────────────────────────
state["status"]["fixes_drafted"] += result.fixes_drafted
state["status"]["escalated"] += result.escalated_count
if result.new_lessons:
state["lessons_learned"].extend(result.new_lessons)
save_state(state)
# ─── STEP 4: CONNECTOR notifies humans ──────────────────────────────────────
await notify_slack(f"CI Triage complete: {result.fixes_drafted} fixes drafted, "
f"{result.escalated_count} escalated to humans.")
The maker_model/checker_model split in the goal trigger is the architectural invariant the entire loop depends on. A single model evaluating its own stop condition produces the Ralph Wiggum failure mode: the agent declares success because it is structurally incapable of finding its own errors. The evaluator-optimizer pattern makes this structurally impossible by routing the verification to a separate model with a different prompt and a different objective.
Snippet Two: Claude Code's Compaction Pipeline Logic and Permission System
// Claude Code: compaction pipeline and permission architecture
// Source: arXiv:2604.14228 analysis of TypeScript source v2.1.88
// Design intent: context window is the binding constraint; the loop exists to manage it
// ─── THE FIVE-STAGE COMPACTION PIPELINE ──────────────────────────────────────
// Each stage runs as a "pre-model context shaper" before every model call.
// ← The compaction pipeline is what makes long-horizon tasks possible.
// Without it, the context fills up with tool results, file contents, and
// scaffolding, and the model gets worse as the session progresses.
async function runCompactionPipeline(
messages: Message[],
tokenBudget: number,
): Promise<Message[]> {
let current = messages;
// Stage 1: Budget reduction — prune to token budget via importance scoring
// ← Simplest stage: score each message by recency and relevance, drop lowest
current = await budgetReduction(current, tokenBudget * 0.95);
// Stage 2: Snip — remove irrelevant sections from tool outputs
// ← Bash output often contains large irrelevant sections: log noise, headers.
// Snip removes these surgically without summarizing the whole tool result.
current = await snipToolOutputs(current);
// Stage 3: Microcompact — summarize low-signal recent exchanges
// ← Short back-and-forth clarification rounds compress well.
// "What's the function signature?" / "Here it is." → one-liner summary.
current = await microcompactExchanges(current);
// Stage 4: Context collapse — full history compression into a summary
// ← Used when stages 1-3 are insufficient. The full message history is
// summarized into a single long-form context block.
// Trade-off: summarization loses detail but enables longer sessions.
if (tokenCount(current) > tokenBudget * 0.8) {
current = await contextCollapse(current, tokenBudget * 0.7);
}
// Stage 5: Auto-compact — emergency compression when near limit
// ← Last resort before context overflow. Aggressive but better than crashing.
if (tokenCount(current) > tokenBudget * 0.95) {
current = await autoCompact(current, tokenBudget * 0.5);
}
return current;
}
// ─── THE SEVEN-MODE PERMISSION SYSTEM ────────────────────────────────────────
// The authorization pipeline runs before EVERY tool execution.
// ← Permissions are not static config; they are evaluated per tool call.
enum PermissionMode {
DEFAULT = "default", // standard interactive use
AUTO = "auto", // ML classifier decides per-action (the interesting one)
PLAN = "plan", // plan-only, no execution
REVIEW = "review", // human reviews each action
NORMAL = "normal", // standard permissions
BYPASS = "bypass", // maximum autonomy (CI/headless use)
DOCKER = "docker", // container-isolated execution
}
async function authorizeToolCall(
toolName: string,
toolInput: unknown,
mode: PermissionMode,
): Promise<"allow" | "deny" | "ask_user"> {
// Stage 1: Pre-filtering (fast static deny list)
if (isPreFiltered(toolName)) return "deny";
// Stage 2: PreToolUse hook (extensibility point for custom logic)
const hookResult = await runPreToolUseHook(toolName, toolInput);
if (hookResult === "block") return "deny";
// Stage 3: Rule evaluation (allowlist/denylist from config)
const ruleResult = evaluateRules(toolName, toolInput, mode);
if (ruleResult !== "continue") return ruleResult;
// Stage 4: Permission handler — mode-specific decision
if (mode === PermissionMode.AUTO) {
// ← THIS is the trick: AUTO mode uses an ML-based classifier
// The model predicts whether this specific tool call with this specific
// input is safe to auto-approve given the current session context.
// This is NOT a simple allowlist. It is per-action ML inference.
// Trade-off: adds latency (one extra model call) but enables hands-free
// operation without blanket bypass permissions.
return await mlSafetyClassifier(toolName, toolInput);
}
if (mode === PermissionMode.BYPASS) return "allow";
if (mode === PermissionMode.REVIEW) return "ask_user";
return defaultPermissionHandler(toolName, toolInput, mode);
}
The PermissionMode.AUTO ML classifier is the architectural decision that makes unattended operation safe without requiring bypass permissions. A human reviewing every action (REVIEW mode) is not unattended. Blanket bypass (BYPASS mode) is safe only when the action space is already constrained. AUTO mode is the middle ground: the model infers safety from context rather than applying a static policy. This is what makes Claude Code's permission system qualitatively different from a simple allowlist.
It In Action: End-to-End CI Failure Triage Loop
Setup: A team with 47 tests, CI running on every merge, 6 nightly failures on average.
Step 1: 30-Second Loop Check
✓ Task occurs at least weekly (CI runs nightly)
✓ Test suite rejects bad output without human
✓ Agent has live environment (CI environment access via MCP)
✓ Hard stop set (10 iterations, 2-hour timeout)
✓ Human approval required before any merge
→ LOOP APPROVED. Build the MVL.
Step 2: Schedule trigger fires (2 AM)
Automation: /loop cadence=nightly
State file loaded: {failures_classified: 47 cumulative, fixes_drafted: 23}
Lessons loaded: 3 entries
- "PowerShell runner hits TLS issues; always fallback to bash."
- "E2E checkouts require stripe webhook secret; skip if missing."
- "Database tests after 2026-05-01 need migration check first."
MCP connector: fetches 6 new CI failures from GitHub Actions
Step 3: Maker agent triages failures
Failure 1: auth_refresh_token_test.py — deterministic failure
Classification: BUG (tied to commit 4a9f2b, recent change to token expiry)
Action: draft fix branch claude/fix-auth-refresh
Failure 2: e2e_checkout_test.py — flaky
Classification: FLAKE (lesson: "E2E checkouts need stripe webhook; skip if missing")
Action: file flake report, do not draft fix
Failure 3: db_migration_test.py — env issue
Classification: ENV (migration not applied in CI env)
Action: ESCALATE to human. Never touch payments or billing.
Failures 4-6: lint violations
Classification: BUG (deterministic, recent style update)
Action: draft fix PR claude/lint-batch-fix
Maker agent iterations used: 7 of 10
Model: claude-opus-4-6 (high reasoning, fixes drafted)
Approximate tokens: ~45,000 input + 8,000 output
Step 4: Checker model verifies stop condition (SEPARATE model)
Checker model: claude-haiku-4-5 (fast, cheap)
Verification: "Do all deterministic failures have a draft fix PR with passing CI?"
claude/fix-auth-refresh → CI: PASSING ✓
claude/lint-batch-fix → CI: PASSING ✓
e2e_checkout_test.py → FLAKE, no fix expected ✓
db_migration_test.py → ENV, escalated ✓
Stop condition: MET
Checker tokens: ~2,000 input + 200 output
Step 5: State file updated, connector notifies
State file after run:
{
"last_run": "2026-06-22T02:17:43Z",
"status": {
"failures_classified": 53, ← +6 from tonight
"fixes_drafted": 25, ← +2 from tonight
"escalated": 2 ← +1 from tonight (db env issue)
},
"lessons_learned": [
"PowerShell runner hits TLS issues; always fallback to bash.",
"E2E checkouts require stripe webhook secret; skip if missing.",
"Database tests after 2026-05-01 need migration check first."
]
}
Slack notification sent:
"CI Triage (2026-06-22): 6 failures classified.
2 fix PRs ready for review. 1 escalated (db env issue). 1 flake filed.
Review: github.com/team/repo/pulls (filter: claude/*)"
Human time to review: estimated 15 minutes (vs 90 minutes manual triage)
Human approval required before any merge: YES. Loop does not merge.
Why This Design Works, and What It Trades Away
The Claude Code paper's most useful framing is the three recurring design choices that appear across all five layers: graduated layering over monolithic mechanisms, append-only designs that favor auditability over query power, and model judgment within a deterministic harness. These are not just Claude Code design choices. They are the correct design choices for any production agent loop.
Graduated layering means the permission system has seven modes that escalate progressively, not two modes (on/off). The compaction pipeline has five stages that escalate from cheap (snip) to expensive (auto-compact), not one "compress everything" operation. This pattern keeps the common path fast and cheap while making the expensive path available when needed.
Append-only session storage means every tool call, every model response, and every sub-agent sidechain is written to the session log but never modified. Auditability takes priority over query performance. For a system that runs unattended and makes real changes to real files, the ability to audit exactly what happened and in what order is more valuable than the ability to query the session history efficiently.
Model judgment within a deterministic harness means the model makes reasoning decisions (what to do, how to approach it) inside an infrastructure that enforces hard constraints deterministically (permission checks, stop conditions, iteration limits). The model cannot override the permission system. The harness cannot reason about novel situations. The split between these two types of authority is what makes the loop trustworthy.
What loop engineering trades away:
Comprehension debt: the Loop Engineering guide identifies this explicitly. Running loops fast produces code you understand slowly. Teams that automate their CI triage and dependency updates may find they have accumulated changes they cannot explain because the loop produced them. Human review remains mandatory, not as a safety gate but as a comprehension gate.
Context efficiency versus transparency: the Claude Code paper names this tension directly. Compaction pipelines make long sessions possible. They also make it harder to understand why the model made a specific decision late in a session, because the context that informed that decision may have been summarized away. Auto-compact in particular is aggressive.
Security tax of unattended loops: every loop that touches the real world, opens PRs, posts to Slack, queries databases, is a privileged process running on a schedule. The Loop Engineering guide calls this "the security tax of unattended loops." Credentials, access tokens, and permissions for loop connectors require the same security discipline as any other privileged service. This is not a one-time cost; it is an ongoing operational overhead.
Technical Moats
The ML-based auto-permission classifier. The Claude Code paper identifies this as a per-action safety classifier running as part of the authorization pipeline. This is not a static allowlist; it is an ML inference call that evaluates whether a specific tool invocation in a specific session context is safe to auto-approve. Building this for your own agent loop requires training data, inference infrastructure, and continuous evaluation. Most teams cannot replicate this and instead choose between REVIEW mode (slow) and BYPASS mode (risky).
The five-stage compaction pipeline. The sequence of budget reduction → snip → microcompact → context collapse → auto-compact is a cascading system where each stage addresses a different failure mode of the previous one. Independently implementing snip (surgical tool output pruning) or microcompact (short exchange summarization) is straightforward. Building all five stages with the right thresholds and triggers for production use requires significant investment and evaluation against real session data.
The CLAUDE.md hierarchy + Skills pattern. The cascading instruction hierarchy (repo root CLAUDE.md, subdirectory CLAUDE.md files) combined with the skill system's SKILL.md write-once-read-every-run pattern is a context engineering mechanism that compounds over time. Competitors without this pattern accumulate the cost of context re-derivation on every session. Teams with it build an organizational knowledge base that improves agent behavior on every loop run.
Insights
Insight One: The Loop Engineering guide's "9 out of 10 builders have never written a loop" claim is more meaningful as a barrier analysis than as an adoption stat. The barrier is not technical. Writing a Python script that calls an agent API in a while loop takes thirty minutes. The barrier is that the supporting systems (state file, skill library, automated verification gate, permission boundaries, connector security) each require engineering investment that the loop itself does not make obvious. The paper's observation, "most of the code lives in the systems around the loop," explains why adoption is low: developers build the loop, find that it is unreliable without the surrounding infrastructure, and conclude that loops do not work. The correct conclusion is that the surrounding infrastructure is the product.
Insight Two: The maker-vs-checker split in loop engineering is not a performance optimization. It is a correctness requirement. The same model that drafted the fix cannot be the model that certifies the fix is complete. This is not because large models are bad at self-evaluation. It is because the model's output biases its subsequent evaluations: it is structurally motivated to find that the fix it wrote is adequate. The Loop Engineering guide quotes Addy Osmani on this: "The model that wrote the code is way too nice grading its own homework." The Claude Code paper's separate-model auto-compaction pipeline reflects the same insight: compaction runs as a separate pipeline step, not as a self-assessment by the main loop model. Architectural separation of maker and checker is not a team preference; it is an invariant.
Surprising Takeaway
The Claude Code paper's most underappreciated finding is that session storage is append-oriented by design, not as a storage optimization but as an auditability guarantee. Claude Code never modifies a session record. Every model response, every tool execution, every sub-agent sidechain is appended. This means the full audit trail of every agent action survives the session, regardless of what compaction did to the context window the model saw. The compaction pipeline operates on what the model sees; the session log captures what actually happened. These are two different data structures maintained in parallel. Most teams building agent loops use a single mutable conversation history. When something goes wrong in a long-running unattended loop, they discover they cannot reconstruct what the agent did or why. Claude Code's append-only architecture is not obvious from first principles, but it is exactly the right design for systems that make autonomous changes to production codebases.
TL;DR For Engineers
The Claude Code architectural analysis (arXiv:2604.14228, MBZUAI + UCL, Apr 2026) documents five layers around a simple while-loop: surface, core (loop + five-stage compaction pipeline), safety/action (seven-mode permission system with ML auto-classifier, four extensibility mechanisms: MCP, plugins, skills, hooks), state (CLAUDE.md hierarchy, append-only session storage, sidechains), and backend (subagent delegation with worktree isolation).
Before building any loop: apply the 4-condition test (task repeats, verification is automated, token budget fits, agent has senior tools). Miss one condition and the loop costs more than it returns. Good first loops: CI failure triage, dependency maintenance, lint-and-fix. Bad first loops: auth/crypto/payments, production deployments, architectural work.
Five building blocks for any production loop: Automation (schedule or goal trigger), Worktrees (parallel agents without file collisions), Skills (SKILL.md write-once, used every run), Connectors (MCP for real-world reach), Sub-agents (maker-vs-checker: evaluator-optimizer pattern, separate models, same objective, different roles).
State file is not optional. LLM sessions are stateless. Without a persistent state record outside the conversation, the loop restarts its mental model from zero on every run. Lessons learned, in-progress work, and historical decisions all belong in the state file.
Three recurring design choices from the Claude Code paper apply to any agent loop: graduated layering (escalate through modes, not monolithic on/off), append-only auditability (session log is never modified, even when context is compacted), model judgment within a deterministic harness (the model reasons; the infrastructure enforces hard constraints).
The Loop Is Not the Engineering. The Infrastructure Is.
The Claude Code source code analysis proves what experienced loop engineers already know: the while-loop fits in six lines. The systems that make it reliable at scale, the ML-based permission classifier, the five-stage compaction pipeline, the append-only session audit trail, the maker-vs-checker sub-agent split, the skill system that compounds organizational knowledge over time, fill the other ninety percent of the codebase.
Loop engineering is systems engineering for AI-native workflows. The builders who recognize this and invest in the surrounding infrastructure will find that the 8x productivity claim, even with Anthropic's caveat about overstatement, becomes achievable. The builders who stop at the while-loop will find that loops are unreliable, which is correct, because a bare while-loop is.
References
Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems, arXiv:2604.14228, Liu, Zhao, Shang, Shen, MBZUAI + UCL, April 2026
Loop Engineering: The 14-Step Roadmap from Prompter to Loop Designer, Neyzis, Coinmonks, June 2026
Loop engineering, the practice of building systems that prompt agents automatically rather than typing prompts by hand, is proven viable by the Claude Code architectural analysis (arXiv:2604.14228, MBZUAI + UCL, Apr 2026), which found that Claude Code's core is a six-line while-loop surrounded by a five-stage compaction pipeline, a seven-mode permission system with an ML-based per-action safety classifier, four extensibility mechanisms (MCP, plugins, skills, hooks), subagent delegation with worktree isolation, and append-only session storage. The Loop Engineering framework (Jun 2026) provides the practitioner's implementation guide: a 4-condition test to qualify whether a loop is warranted, five building blocks (automations, worktrees, skills, connectors, sub-agents), a state file for cross-session persistence, and the evaluator-optimizer pattern (maker-vs-checker sub-agent split) as an architectural invariant that prevents the Ralph Wiggum failure mode where the agent grades its own homework.
Sponsored Ad
If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀
How AI-Era Pricing Is Reshaping Finance Operations
Usage-based and hybrid pricing models are changing how B2B companies generate revenue — and creating new headaches for the finance teams behind them.
Tabs co-founder Rebecca Schwartz and PwC Partner Amit Dhir sat down to unpack exactly what that means in practice: how pricing model decisions ripple into revenue recognition, forecasting, and financial ops — and what it takes to scale without piling on manual work.
Watch the on-demand recording to get practical frameworks, real-world examples, and a clear path to operationalizing usage-based revenue — including a forward-looking take on how AI will reshape financial workflows. If your team is navigating pricing complexity heading into the back half of the year, this is worth an hour.


