Vibe Code Bench: The Benchmark That Finally Asks If AI Can Build Software, Not Just Write Code

In partnership with

SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | June 4, 2026

Every leaderboard in AI coding benchmarks is a lie by omission. HumanEval tests a single function. SWE-bench tests patching an existing codebase. LiveCodeBench tests competitive algorithms. None of them ask the question that actually matters in 2026: can an AI take a natural-language spec and ship a running web application, from zero to deployed, with authentication, payments, email, and a real database?

Vibe Code Bench (VCB) asks exactly that question. The answer: even the best model in the world, GPT-5.3-Codex, passes only 61.8% of end-to-end workflows. Claude Opus 4.6 hits 57.6%. The median across 16 frontier models is below 25%.

Vibe coding is not solved. The benchmark proves it with a 5-hour execution harness, 10,131 substeps, and a browser agent clicking through every deployed app.

What It Actually Does

Vibe Code Bench, published by Vals AI (Tran et al., ACM CAIS '26), is a benchmark of 100 realistic web application specifications, split into 50 public validation tasks and 50 held-out test tasks. Each specification is written in plain non-technical language, the kind a founder or product manager might type into a chat box. Models receive the spec and nothing else.

Each model runs inside a sandboxed development environment based on a modified OpenHands agentic scaffold. It has terminal access, a browser, and live integrations with Supabase (PostgreSQL, auth, storage), MailHog (email), and Stripe (payments in test mode). The model has up to 5 hours (wall clock) and 1,000 turns to produce a working application.

The generated app is then handed off to an autonomous browser agent (Browser Use) that executes 964 predefined end-to-end workflows, 10,131 substeps total. A workflow passes when 90% or more of its substeps succeed. Application accuracy is the percentage of workflows that pass.

This is the first benchmark that evaluates the complete zero-to-one loop: spec in, deployed working application out, with user-visible behavior as the only judge.

The Architecture, Unpacked

Focus on the two-agent loop — the generating LLM and the evaluating browser agent never share state, which makes evaluation implementation-agnostic across heterogeneous app stacks.

Three design decisions define this architecture:

Docker-in-Docker isolation. The model runs in an outer container. When it writes and starts its application (inner containers via Docker Compose), those inner containers are isolated per task. This gives strong task isolation, reproducible startup for evaluation, and portable artifacts that can be replayed.

Mandated tech stack. The system prompt locks models into React + Vite frontend, Tailwind CSS, Supabase backend, and Docker Compose packaging. Without this constraint, heterogeneous app stacks make consistent evaluation at scale impossible.

Browser agent as judge. Evaluation uses point-and-click browser interactions rather than DOM selectors or unit tests. This is deliberately implementation-agnostic: the agent doesn't care whether the button uses a <button> or a <div>. It evaluates what a user would evaluate.

The Code, Annotated

Snippet 1: The Generation System Prompt Core Constraints

The system prompt design is the harness's most opinionated artifact. The paper describes iterative development addressing failure modes. Here is a reconstruction of the critical structural constraints:

SYSTEM_PROMPT = """
You are building a complete web application from the following specification.

# MANDATORY TECHNOLOGY STACK
# These are not suggestions. Deviating causes evaluation failure.
- Frontend: React + Vite                  # ← Mandated for Docker build determinism
- Styling: Tailwind CSS
- Backend/Auth/DB: Supabase (self-hosted)  # ← Pre-wired; credentials in env
- Container: Docker Compose               # ← Evaluator starts via `docker compose up`

# ENVIRONMENT VARIABLES
# CRITICAL: Frontend env vars must go in frontend/.env (build-time)
# DO NOT use env_file in docker-compose for frontend — runtime-only, won't work
# Use build.args in docker-compose.yml to pass VITE_* vars               # ← THIS is the trick
# This is the #1 misconfiguration failure mode across all models

# SELF-TEST REQUIREMENT
# Before submitting, open your browser and verify:
# 1. App starts at localhost without errors
# 2. Auth flow works (sign up, sign in, sign out)
# 3. Core CRUD features function end-to-end
# This correlates with final benchmark score (r=0.72)                    # ← Key performance signal

# SUBMISSION
# Use the finish tool ONLY when:
# (a) You have tested the running app in the browser, AND
# (b) No critical features are broken
# Early submission is penalized; you have 5 hours.
"""

The env var constraint (build.args vs env_file) is the single most common failure mode for mid-tier models, causing Docker networking breaks that cascade into all-zero workflow scores.

Snippet 2: Evaluation Workflow Scoring Logic

# Evaluation pipeline: scoring a single workflow
# Input: list of substep results from browser agent
# Output: workflow pass/fail (boolean)

def score_workflow(substep_results: list[dict]) -> dict:
    """
    A workflow is pass/fail — not partial credit at the workflow level.
    The 90% threshold tolerates minor non-critical UI glitches
    while still requiring near-complete correctness.
    """
    total = len(substep_results)
    passed = sum(1 for s in substep_results if s["status"] == "pass")
    
    pass_rate = passed / total  # ← Substep accuracy for this workflow
    
    workflow_pass = pass_rate >= 0.90  # ← Hard threshold, not averaged away
    # Key design: evaluating at workflow level prevents cross-workflow masking.
    # A broken auth substep in workflow 1 cannot be compensated by
    # 10 passing substeps in workflow 2. Each workflow stands alone.  # ← THIS is the trick
    
    return {
        "passed": workflow_pass,
        "substep_accuracy": pass_rate,
        "substeps_total": total,
        "substeps_passed": passed,
    }

def score_application(workflow_results: list[dict]) -> float:
    """
    Application accuracy = % workflows passing.
    Mean of per-application results across 50 test tasks.
    """
    if not workflow_results:
        return 0.0  # Deployment failure: all workflows marked failed
    
    passing = sum(1 for w in workflow_results if w["passed"])
    return passing / len(workflow_results)  # ← Final benchmark metric

The 90% substep threshold plus per-workflow scoring is a deliberate anti-gaming measure: a model cannot mask a broken feature by excelling at unrelated substeps.

It In Action: End-to-End Worked Example

Task: "Zeeter" — a Twitter-like short-message platform

This is one of the hosted VCB applications available at the Vals AI leaderboard page.

Input spec (excerpt):

Build a website called Zeeter where users can write short-form messages 
("Zeets") visible to their followers. Core features:
- Sign up / log in with email
- Create Zeets (short messages)
- Follow / unfollow other users
- View a feed of followed users' Zeets
- Like and comment on Zeets
- Explore page to discover users
- Search functionality

Step 1 — Model receives spec, initializes stack Time: 0–5 min. Model scaffolds React + Vite + Supabase project, writes docker-compose.yml with frontend and Supabase containers, sets up build.args for VITE_SUPABASE_URL.

Step 2 — Core auth and schema Time: 5–25 min. Model writes Supabase migrations (users, zeets, follows, likes, comments tables), configures Row Level Security policies, implements sign-up and sign-in flows.

Step 3 — Feature implementation loop Time: 25–90 min. Model implements each feature surface: feed, profile, explore, search, like/comment/follow. Claude Opus 4.6 spent 26.1% of tool calls on browser self-testing during this phase (vs 13.2% for GPT-5.3-Codex).

Step 4 — Browser self-test Time: 90–100 min. Model opens app in browser, verifies auth flow, creates test Zeets, checks feed. GPT-5.3-Codex spent significantly more total time here (75.8 min average latency vs 21.3 min for Opus 4.6) but achieved 61.8% vs 57.6%.

Step 5 — Evaluator runs 6–23 workflows Each workflow: fresh headless browser session, Browser Use agent (Claude Sonnet 4.5) executes substeps like "Sign up with [email protected], create a Zeet saying 'hello', verify it appears in the feed."

Actual results for Zeeter across models:

GPT-5.3-Codex:     ~71% workflow pass rate (v1.1 run)
Claude Opus 4.6:   ~57% workflow pass rate
Claude Sonnet 4.6: ~51% workflow pass rate
Kimi-K2.5:         ~17% workflow pass rate

Real cost numbers for a single Zeeter generation:

GPT-5.3-Codex: ~$11.91, 75.8 min latency
Claude Opus 4.6: ~$8.69, 21.3 min latency ← Better cost-to-accuracy ratio
Gemini 3 Flash: ~$0.94, 13.4 min, 20.2% accuracy

The most common failure modes observed in Zeeter: missing features (46.7% of behavioral failures across all models), authorization issues (20.4%), and misconfigured RLS policies (14.8% under "Validation or Policy Block").

Why This Design Works (and What It Trades Away)

What works:

Browser-agent evaluation is the right call for heterogeneous apps. A DOM-selector-based eval would break every time a model chose a different CSS framework or component library. An LLM-as-judge with vision and click capability evaluates what a user evaluates.

The 5-hour wall-clock budget with no turn-count normalization allows diverse harnesses (including those without instrumentation) and mirrors real development pressure. The Docker Compose artifact requirement means every submission is reproducible and portable.

Mandating the tech stack sacrifices model freedom but gains evaluation consistency at scale. Without it, 100 apps in 16 different frameworks would be unscoreable.

What it trades away:

Code quality is invisible. A model can ship a working app with hardcoded passwords, zero error handling, and spaghetti schema design and score 100%. Security vulnerabilities are not measured.

Single stack (React + Supabase) means results don't generalize to Next.js, Django, Rails, or mobile. A model optimized for this exact stack has an artificial advantage.

28 of 100 apps require Stripe and/or email. These integrations are far harder (GPT-5.3-Codex drops from 71.25% on no-integration apps to 29.58% on apps requiring both), but 72% of apps test pure CRUD, which may not reflect the real distribution of "vibe coding" requests.

Technical Moats

Self-testing correlation is hard to replicate cheaply. The Pearson r=0.72 between browser tool calls during generation and final accuracy is not a model architecture property; it's a behavior property. You can't buy it by scaling parameters. Models that test their own running application before submission converge on a qualitatively different development loop. This is the closest thing to a "taste for correctness" observable in a benchmark.

Evaluator choice is a hidden variable in every benchmark. The human alignment study is the most underappreciated contribution. GPT-5.2 as evaluator aligns with human graders at 36.1%. Claude Sonnet 4.5 aligns at 86.4%. These two evaluators, applied to the same apps, produce structurally different leaderboards. Any benchmark using LLM-as-judge without an evaluator alignment study is publishing results with an unknown bias.

The bimodal distribution is the real finding. The most common score buckets for GPT-5.3-Codex using 12.5-point bins are 0–12.5% and 87.5–100%. The model either builds a working app or it doesn't; it doesn't consistently build mediocre apps. Model improvement is driven by reducing complete failures, not incremental gains. This means the capability is threshold-gated: something causes hard failures (networking, RLS misconfiguration, early termination), and fixing those is more valuable than general capability improvements.

Insights

Insight 1: Thinking mode is not worth the cost for app development.

Claude Opus 4.6 (Thinking) achieves 53.50% at $8.28 and 23.1 min. Claude Opus 4.6 (non-thinking) achieves 57.57% at $8.69 and 21.3 min. The thinking variant is slower, nearly the same cost, and 4 points worse. For vibe coding tasks, chain-of-thought reasoning may actively hurt by burning time and tokens on planning that could be spent on self-testing and iteration. The extended reasoning loop that helps on math and logic benchmarks does not transfer to multi-hour agentic development sessions.

Insight 2: Open-weight models are not competitive at all, not just a little behind.

The leaderboard gap between the best closed model and the best open-weight model is not the typical 5–15 point gap seen on SWE-bench. MiniMax M2.5 (the best open-weight model) scores 14.85%. GPT-5.3-Codex scores 61.8%. That's a 47-point gap. The paper notes that the gap between MiniMax M2.5 and Claude Opus 4.6 on VCB is 42.7 percentage points vs 2.8 points on SWE-bench. VCB is not a harder version of existing benchmarks; it measures a qualitatively different capability where open models currently fail by a wide margin. Anyone claiming open-weight parity for coding tasks should qualify that statement to "isolated task parity."

Surprising Takeaway

The evaluator is as important as the model being evaluated.

The human alignment study shows pairwise agreement ranging from 31.8% to 93.6% across evaluator pairs. GPT-5.2 as evaluator disagrees with Claude Sonnet 4.5 (86.4% human-aligned) on 64% of substeps. That means a leaderboard run with GPT-5.2 as judge produces rankings that are structurally different from a leaderboard run with Claude Sonnet 4.5 as judge. Evaluator selection is not a methodological footnote; it is a first-order variable that can invert rankings. The benchmark community has been treating LLM-as-judge as interchangeable. It is not.

TL;DR For Engineers

The best model in the world (GPT-5.3-Codex) passes 61.8% of end-to-end web app workflows. Reliable vibe coding is not solved.
Self-testing during generation (browser usage while building) predicts final accuracy with Pearson r=0.72. It's not just about more time; partial correlation controlling for latency stays at 0.72.
The most common failure is missing features (46.7% of behavioral failures), not startup crashes. Models write code for days and omit core product surfaces.
Evaluator choice shifts leaderboard outcomes by up to 50 percentage points. GPT-5.2 as judge = structurally different ranking from Claude Sonnet 4.5 as judge.
Open-weight models score 14–24% vs 52–62% for top closed models on this benchmark. VCB is the most discriminative coding benchmark published.

The Question Is Not Whether AI Can Write Code

The question Vibe Code Bench answers is not whether models can write a function or patch a bug. It's whether they can build software: scaffold a project, wire external services, configure deployment, implement authorization, and ship something a user can actually click through.

The answer today is: sometimes, for the easiest tasks, with the most expensive models. The 61.8% ceiling is not a rounding error. It reflects a hard boundary in current model capabilities around multi-service coordination, long-horizon consistency, and self-correction under real deployment conditions.

The benchmark's most important contribution is not the leaderboard. It's the infrastructure: a reproducible harness, a browser-agent evaluator validated against human graders, and a variance decomposition showing that generation quality, not evaluator noise, explains the rank differences. That infrastructure enables the community to track progress toward a bar that actually matters.

Every point above 61.8% earned on this benchmark is a point earned against real software complexity.

References

Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development — Tran, Nashold, Krishnan, Bigeard, Gu (Vals AI / MIT, ACM CAIS '26)
Vibe Code Bench Leaderboard v1.1 — Live results, hosted app demos, trajectory analysis
SWE-bench: Can Language Models Resolve Real-World GitHub Issues? — Jimenez et al., 2024
OpenHands: An Open Platform for AI Software Developers as Generalist Agents — Wang et al., 2024
SWE-Lancer: Can Frontier LLMs Earn $1M from Real-World Freelance Software Engineering? — Miserendino et al., 2025
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces — Merrill et al., 2025
Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations — Miller, 2024

Vibe Code Bench is the first benchmark that evaluates whether frontier AI models can build a complete, deployed web application from a plain-language spec, with real auth, payments, email, and database integration. Across 16 models, the top performer (GPT-5.3-Codex) passes 61.8% of 10,131 browser-verified substeps, and self-testing during generation correlates with success at r=0.72. The most counterintuitive finding: evaluator choice shifts leaderboard outcomes by more than 50 percentage points, making evaluator alignment a first-class benchmark design concern.

Sponsored Ad

If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad, it helps us keep building and delivering value 🚀

What if AI handled your job search tonight?

Job hunting is exhausting.

AIApply makes it automatic.

Your AI Career Agent works 24/7 to:

Find relevant jobs online
Tailor your resume instantly
Generate personalized cover letters
Auto-apply while you sleep
Help you land more interviews

No more endless tabs.
No more repetitive applications.
No more wasting hours every week.

AIApply helps you apply faster, smarter, and at scale so you can focus on what actually matters: getting hired.

Start landing interviews faster.