gstack: Why the Y Combinator CEO Turned His Claude Code Setup Into a Software Factory With 23 Specialist Roles

In partnership with

SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | June 1, 2026

Free-form prompting treats Claude as a generalist who can do anything in one message. gstack treats Claude as a team of specialists who pass work between each other. The difference is not cosmetic. When you tell a language model "you are the engineering manager who locks architecture and reviews PRs," you constrain the model's attention to a specific role's priorities and outputs. When you tell it "fix this bug," it tries to wear every hat simultaneously and does each one worse.

The problem with most AI coding setups is that they are collections of prompts, not workflows. Individual prompts can be good. Without process structure connecting them, work accumulates without review, bugs get shipped, documentation drifts, and the model's outputs are only as consistent as the user's discipline in applying the prompts correctly.

gstack is the most concrete public implementation of an alternative approach: role-based agent orchestration with a fixed development lifecycle, shipped as a production tool by someone running it to ship real code. The repo itself was co-authored with Claude Opus 4.7, which is the best possible signal: the workflow builds itself.

Scope: gstack architecture (SKILL.md format, the 23 skills, the Think/Plan/Build/Review/Test/Ship/Reflect loop), GBrain persistent memory, the role isolation design decision, deployment on Claude Code and 8 other runtimes, and what the research literature says about why role identity improves LLM output quality. Not covered: individual skill implementation details beyond the key ones (/ship, /qa, /document-release), or comparison with LangGraph and CrewAI multi-agent frameworks beyond design philosophy.

What It Actually Does

gstack is a collection of SKILL.md files that install into Claude Code (or 8 other AI coding runtimes) as slash commands. Each skill is a structured Markdown prompt that assigns Claude a specific role identity, a set of priorities, constraints, and a defined output format.

Installation:

git clone --single-branch --depth 1 \
  https://github.com/garrytan/gstack.git \
  ~/.claude/skills/gstack

cd ~/.claude/skills/gstack && ./setup

The seven roles and their primary skills:

Role	Key Slash Commands	What It Does
CEO / Founder	`/plan-ceo-review`, `/office-hours`	Scope review: expansion vs reduction vs hold; product thinking
Eng Manager	`/eng-review`, `/plan-eng-review`	Architecture lock, PR review, sprint planning
Designer	`/design-review`	UI/UX review, catches "AI slop" visual patterns
QA Lead	`/qa`	Real Chromium browser automation, regression test generation
Release Manager	`/ship`	Sync + test + audit + push + open PR in one chain
Doc Engineer	`/document-release`	Cross-references diff, updates README/ARCHITECTURE/CLAUDE.md
Security Officer	`/cso-review`	OWASP + STRIDE security audit

Plus utility commands: /careful, /guard, /autoplan, /pair-agent, /context-save, /context-restore, /canary, /sync-gbrain, /gstack-upgrade

Supported runtimes: Claude Code (primary), OpenAI Codex CLI, Cursor, Factory Droid, and five others

The Architecture, Unpacked

Focus on the smart review routing. gstack automatically determines which review roles to invoke based on what changed: CEO doesn't review infrastructure bug fixes, design review is skipped for pure backend diffs. This role-appropriate routing is the part most teams skip when building their own AI workflows, and it is what prevents review fatigue and keeps each specialist role focused.

The Code, Annotated

Snippet One: SKILL.md Format and Role Identity Architecture

<!-- gstack/eng-review/SKILL.md (reconstructed from pattern) -->
<!-- Source: garrytan/gstack (MIT) -->
<!-- SKILL.md is the format: role identity + constraints + output format -->

# Engineering Manager Review

## Role Identity
You are a senior engineering manager conducting a production code review.
Your job title is Engineering Manager. You are NOT a developer right now.
You are NOT doing creative exploration. You are enforcing standards.

<!-- ← THIS is the trick: explicit NOT constraints prevent role leakage -->
<!-- Without "You are NOT a developer," Claude hedges and produces a hybrid -->
<!-- With the NOT constraint, the role stays bounded and produces manager-level output -->

## Your Priorities (in order)
1. Architecture correctness: does this compound or fix technical debt?
2. Production safety: will this cause incidents at scale?
3. Maintainability: can the next engineer understand and change this?
4. Performance: are there O(n²) patterns or missing indexes?
5. Test coverage: is the behavior tested?

## Your Constraints
- Do NOT suggest features that are out of scope for this PR
- Do NOT praise the code unless specific praise is warranted
- Do NOT be diplomatic about production risks: be direct
- Do NOT review formatting or style issues (that's the linter's job)

<!-- ← Explicit constraint list prevents the "good job, but..." pattern -->
<!-- that makes AI reviews useless: starts with praise, buries the issue -->

## Output Format
### Architecture Assessment
[One sentence: does this PR improve or degrade the architecture?]

### Production Risks
[Numbered list. Any item here is a blocker. Leave blank if none.]

### Required Changes Before Merge
[Numbered list. These are blockers. Non-negotiable.]

### Suggested Improvements
[Optional. These are nice-to-haves. Non-blocking.]

### Test Coverage
[Pass/Fail. What is covered? What is missing?]

<!-- ← Structured output means the next skill in the chain can parse results -->
<!-- /ship can check: are there Required Changes? If yes, block the PR. -->
<!-- Role-based output chaining: one skill's output becomes next skill's input -->

The explicit NOT constraints are the single most important design decision in the SKILL.md format. Without them, Claude defaults to diplomatic output that hedges every critique with praise and softens every requirement. The NOT constraints enforce role identity by defining what the role explicitly refuses to do.

Snippet Two: /ship and /qa (The Power Commands)

# /ship: the release chain (one command, full pipeline)
# Source: garrytan/gstack/ship/SKILL.md (MIT)

# What /ship does:
# 1. git pull --rebase (sync with upstream)
# 2. Run full test suite
# 3. Coverage audit: output percentage, flag regressions
# 4. If tests pass: git push
# 5. Open PR with generated title + description
# 6. Trigger /document-release on the diff

# ← THIS is the design: /ship is a chain, not just a push
# Each step is a gate: if tests fail, stop. If coverage drops, warn.
# The PR description is auto-generated from the diff + commit history.

# Why one command for the whole chain?
# Without /ship, engineers skip steps under pressure:
#   "I'll run tests after this PR" → never happens
#   "I'll update the README next sprint" → README drifts for 6 months
# /ship makes the correct process the path of least resistance.

# /qa: real Chromium browser automation (not simulated)
# Source: garrytan/gstack/qa/SKILL.md (MIT)
# Uses Playwright or Puppeteer inside Claude Code's execution environment

# What /qa does:
# 1. Opens a REAL Chromium browser (not simulated)
# 2. Navigates to the local dev server or specified URL
# 3. Runs through user flows defined in the task
# 4. Finds bugs (broken flows, JS errors, visual regressions)
# 5. Fixes bugs it finds (writes the code change)
# 6. Generates regression tests for each bug fixed

# Example /qa invocation:
# User: /qa "Test the checkout flow for a new user"
#
# Claude actions:
#   1. browser.goto('http://localhost:3000')
#   2. browser.click('[data-testid="new-user-signup"]')
#   3. browser.fill('[name="email"]', '[email protected]')
#   4. ... complete checkout flow
#   5. Detects: "Cart total shows $0.00 after applying coupon code"
#   6. Traces bug to: /src/cart/CartSummary.tsx line 47 (coupon math)
#   7. Fixes the bug
#   8. Generates regression test:
#      it('cart total updates correctly after coupon', async () => { ... })

# ← Why real browser instead of unit tests?
# Unit tests test the implementation. Browser tests test the experience.
# LLMs can reason about what a user would see and do.
# A real browser catches integration failures that unit tests miss entirely.
# /qa bootstraps test frameworks if none exist ("100% test coverage is the goal")

# gstack tracks all QA runs: what was tested, what failed, what was fixed
# The coverage audit at /ship cross-references the QA history

The real Chromium browser in /qa is not a convenience feature. It is an architectural choice: LLMs can reason about user experience in ways that static analysis cannot. A QA agent that can click, scroll, and observe what breaks is running the same cognitive process a human QA engineer runs, and the regression tests it writes cover exactly the paths that failed.

It In Action: End-to-End Worked Example

Scenario: Ship a new feature: user profile photo upload with image resizing

Sprint setup (before the loop):

# Clone and setup gstack
git clone --single-branch --depth 1 \
  https://github.com/garrytan/gstack.git ~/.claude/skills/gstack
cd ~/.claude/skills/gstack && ./setup

# Index codebase into GBrain (persistent memory)
gbrain sources add .
gbrain sync --strategy code
# → Indexes all source files into PGLite vector DB
# → Injects ## GBrain Search Guidance into CLAUDE.md

Step ①: THINK

/office-hours "I want to add user profile photo upload with image resizing"

Claude (as YC partner):
  "Who specifically requested this? What's the evidence it matters?
   Drag race: can you ship a simpler version (gravatar) in 1 hour
   that serves 80% of the use case, versus spending a sprint on uploads?
   The engineering is straightforward. Is this the right use of time?"

Step ②: PLAN

/plan-ceo-review

CEO Assessment:
  Mode: SELECTIVE (implement, but scope-locked)
  Scope: Upload + resize only. No filters, no crop UI, no social sharing.
  Dependency risk: S3 or equivalent required. Do we have this configured?
  Definition of done: User can upload JPEG/PNG ≤5MB, see 128×128 thumbnail.

/plan-eng-review

Eng Manager Plan:
  Architecture: presigned S3 URL upload (client → S3 direct, no server memory)
  Not: server-side upload proxying (memory risk at scale)
  Required: sharp for resize, multer for validation, 3 unit tests minimum
  PR scope: ImageUpload component + resize API route + S3 config

Step ③: BUILD

Claude Code builds:
  src/components/ImageUpload.tsx     (350 lines)
  src/api/upload/route.ts            (120 lines)
  src/lib/image-resize.ts            (45 lines)
  src/__tests__/image-resize.test.ts (89 lines)

Build time: ~25 minutes (Garry's workflow: 10K-20K LOC/day total across parallel sprints)

Step ④: REVIEW

Smart routing detects: frontend component + API route change
Routes to: /eng-review (backend) + /design-review (frontend component)
Skips: /cso-review (no authentication changes), /plan-ceo-review (no scope change)

/eng-review output:
  Architecture: PASS (presigned URL correct)
  Production Risk: File validation missing on API route (can upload non-image)
  Required Change: Add MIME type check + file size validation in route.ts
  Test Coverage: PASS (3 tests present)

/design-review output:
  Upload progress indicator: MISSING (UX regression vs current app patterns)
  Required: Add progress state, match existing spinner component style

Step ⑤: TEST

/qa "Test the profile photo upload: upload a valid JPEG, then try an invalid file type, then try a file over 5MB"

Browser opens → navigates to /settings/profile
Test 1 (valid JPEG): PASS
Test 2 (invalid file): FAIL — no error message shown for .gif upload
Test 3 (file >5MB): FAIL — silent failure, no feedback to user

/qa auto-fixes:
  route.ts: adds .gif to rejected MIME types
  ImageUpload.tsx: adds error state display
  Writes regression tests:
    it('rejects gif uploads with error message', ...)
    it('rejects files over 5MB with error message', ...)

Step ⑥: SHIP

/ship

git pull --rebase → no conflicts
npm run test → 5 tests pass (3 original + 2 regression)
Coverage: 94% (up from 91%)
git push → pushed
PR opened: "feat: user profile photo upload with 128x128 resize"
PR description: auto-generated from diff + commit history

/document-release
  CLAUDE.md: updated with new S3_PRESIGNED_URL_BUCKET env var requirement
  README: added upload feature to feature list
  ARCHITECTURE.md: updated data flow diagram for file uploads

Total time (Think → Ship): ~2.5 hours
Lines of production code: ~604 (code + tests + docs)

Step ⑦: REFLECT

/retro

Sprint retrospective:
  ✓ Smart routing correctly skipped CEO/security review
  ✓ /qa found 2 bugs that /eng-review missed
  ✓ /document-release prevented README drift
  ! Design review finding (progress indicator) could have been caught in /plan-eng-review
  Action: Add "UI feedback patterns" checklist to /plan-eng-review scope

Why This Design Works, and What It Trades Away

The role isolation design is the correct answer to the problem of LLM context leakage. A language model generating code, reviewing that code, and deciding whether to ship it in the same conversation context will consistently undervalue the review because it is motivated to complete the generation task. Assigning separate roles with explicit constraints forces the model into a different priority ordering for each phase.

The SKILL.md format is the right implementation of this idea. Plain text Markdown files are readable by any LLM, version-controllable, shareable, and forkable. They require no framework, no SDK, and no API. The design philosophy is: the skill is the prompt, the format is the contract, and the loop is the process. Any team can read, understand, and modify every gstack skill in an afternoon.

The GBrain persistent memory system addresses a real limitation of LLM context windows: a codebase that grows beyond the context window loses coherent indexing. GBrain indexes the codebase into a local vector database (PGLite) that Claude can query through MCP. The per-repo trust triad (read-write / read-only / deny) is the correct access control model for a coding agent: the agent should have write access to source, read-only access to credentials, and no access to sensitive configuration.

What gstack trades away:

Self-reported metrics and individual workflow fit. The 600K LOC in 60 days, 10K-20K LOC/day figures are from Garry Tan's own reports. They represent what one specific person achieved running one specific workflow in their specific context (YC President, experienced developer, specific tech stack). Your results will vary based on project complexity, tech stack, test suite quality, and how much you adapt the prompts.

Prompt engineering as a maintenance burden. Every SKILL.md file is a structured prompt. Prompts need maintenance: as Claude's capabilities change, as your codebase evolves, as new edge cases emerge. Teams that adopt gstack are committing to owning and updating 23 prompt files, not installing a library.

One-person workflow optimized for one person's preferences. The CEO/founder role in /plan-ceo-review and /office-hours is grounded in YC's framework for evaluating startups. For teams that do not share that framework, these skills need rewriting to reflect their own product decision-making priorities.

Technical Moats

The loop is the product. Most AI coding tools ship individual features: autocomplete, code generation, chat. gstack ships a development lifecycle. The loop (Think → Plan → Build → Review → Test → Ship → Reflect) is what makes individual skills coherent rather than isolated. A team that only uses /qa in isolation gets browser test automation. A team that runs the full loop gets a workflow where bugs caught in /qa produce regression tests that appear in the /ship coverage audit. The integration of skills into a loop is harder to replicate than any individual skill.

Self-referential development. The repo's own commit history shows Claude Opus 4.7 as a co-author on most commits. gstack is built using gstack. This produces a unique flywheel: the skills improve because they are used to improve the skills. Teams adopting gstack and contributing back to the repo are improving the tool using the tool. This self-referential quality is rare in developer tooling and accelerates the quality floor.

The 105k star social proof compounds. The skill files are shareable SKILL.md text. Every community contribution (new skill, modified role, better output format) is visible to 105k developers. The community is doing quality improvement at a scale no individual team can match.

Insights

Insight One: gstack is not a productivity tool. It is a process specification written in prompt form. The productivity numbers (600K LOC, 100 PRs/week) are a byproduct of having a consistent process. Teams that adopt the skills without adopting the loop will get some value. Teams that adopt the loop without adapting the skills to their context will get friction. The process is the point, not the prompts.

The CAMEL paper (arXiv:2303.17760) on communicative agents found that role-playing LLM agents achieve tasks that single agents cannot. The key mechanism: role identity limits the action space and focuses attention on role-appropriate outputs. gstack operationalizes this insight in a practical developer workflow: the engineering manager role's action space excludes feature suggestions and style feedback, which means its output density on architecture and production risk is higher than an unconstrained review.

Insight Two: The /qa skill is the hardest to replicate and the most valuable to adopt. Real Chromium browser automation that finds bugs and writes regression tests requires Claude Code's computer use capabilities combined with a well-structured QA role identity. Teams that try to replicate this without the role constraints will get generic browser scripts. Teams that add the QA role identity without the real browser will get simulated tests that miss integration failures. The combination is what makes /qa qualitatively different from standard LLM-assisted testing.

The Voyager paper (arXiv:2305.16291) demonstrated that skill accumulation in an iterative agent loop produces emergent capabilities that single-pass agents cannot achieve. /qa's regression test accumulation is the gstack implementation of this: each QA run adds tests that make the next QA run faster and more reliable. The skill library grows with the codebase.

Surprising Takeaway

gstack's most impressive signal is not the star count or the LOC numbers. It is that Garry Tan built 600,000 lines of production code while running Y Combinator full-time, using a 23-skill workflow that is entirely public, and that the workflow is simple enough to explain in a README. The surprise is that the tool is not sophisticated. It is 23 Markdown files following a consistent format. The sophistication is in the loop design and the role constraints, not the technology. Markdown files do not need to be upgraded, have no dependencies, and cannot break. This is the correct technology choice for a workflow tool that needs to be reliable at the most critical moments of a development sprint.

The contrast with LangGraph, CrewAI, and AutoGen is instructive. Those frameworks are powerful and complex. gstack is simple and specific. Simple tools survive contact with production workflows. Complex tools often do not.

TL;DR For Engineers

gstack (garrytan/gstack, MIT, 105k stars, March 2026) is Garry Tan's (YC CEO) personal Claude Code workflow: 23 slash commands that assign specialist roles (CEO, Eng Manager, Designer, QA Lead, Release Manager, Doc Engineer, Security Officer) to Claude through a fixed Think → Plan → Build → Review → Test → Ship → Reflect loop. Bun + TypeScript, SKILL.md format. Works on Claude Code, Codex CLI, Cursor, and 6 other runtimes.
The role isolation design is the core: each skill has explicit NOT constraints that prevent role leakage, explicit output format for chain parsing, and role-appropriate priority ordering. /qa uses a real Chromium browser, finds bugs, fixes them, and generates regression tests. /ship is sync + test + audit + push + PR in one chain.
GBrain: PGLite local (or Supabase) vector DB that indexes the codebase, MCP-registered to Claude Code, injected into CLAUDE.md. Per-repo trust triad: read-write / read-only / deny.
Smart review routing: CEO doesn't review infra bug fixes, design review skipped for backend-only diffs. The loop determines which roles are invoked based on what changed.
Self-reported metrics: 600,000 LOC in 60 days, 10K-20K LOC/day, 100 PRs/week. Treat as a directional signal, not a guarantee. The workflow's value is the process consistency, not any specific throughput number.

The Loop Is the Product

gstack's lasting contribution is not any individual skill. It is the demonstration that a fixed development lifecycle specified as structured prompts and applied consistently produces qualitatively different outcomes than free-form AI coding assistance. The Think → Plan → Build → Review → Test → Ship → Reflect loop is not novel as a software process. What is novel is that it is implemented entirely in Markdown, runs on any LLM-powered coding tool, and has been validated by one of the most publicly credentialed developers currently shipping code with AI.

The 105k stars say the rest of the engineering community recognized the value. Whether you clone gstack directly or extract the design principles to build your own loop, the underlying insight is transferable: AI coding agents need process, not just prompts.

References

gstack GitHub Repository, Garry Tan, MIT, 105k stars
ReAct: Synergizing Reasoning and Acting in Language Models, arXiv:2210.03629 — the reasoning+acting loop that underpins gstack's skill execution
CAMEL: Communicative Agents for "Mind" Exploration, arXiv:2303.17760 — role-playing multi-agent systems; foundational for gstack's role identity design
Voyager: An Open-Ended Embodied Agent with Large Language Models, arXiv:2305.16291 — skill accumulation in iterative agent loops; gstack's /qa regression test library mirrors this pattern
Claude Code Documentation — the primary runtime gstack is built for
Augment Code analysis of gstack — detailed external analysis of the architecture

gstack (garrytan/gstack, MIT, 105k stars, March 12 2026) is Garry Tan's (Y Combinator CEO) personal Claude Code workflow, comprising 23 specialist slash commands (CEO, Eng Manager, Designer, QA Lead, Release Manager, Doc Engineer, Security Officer) implemented as SKILL.md Markdown files that run through a fixed Think → Plan → Build → Review → Test → Ship → Reflect loop on Claude Code, Codex CLI, Cursor, and 6 other AI coding runtimes. Key technical decisions: explicit NOT constraints for role identity, smart review routing (role-appropriate review selection per diff type), /qa with real Chromium browser automation and regression test generation, /ship as a full release chain, and GBrain (PGLite/Supabase vector DB) for persistent codebase memory. Self-reported: 600,000 LOC in 60 days. The repo itself was co-authored with Claude Opus 4.7.

Sponsored Ad

If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀

Global HR shouldn't require five tools per country

Your company going global shouldn’t mean endless headaches. Deel’s free guide shows you how to unify payroll, onboarding, and compliance across every country you operate in. No more juggling separate systems for the US, Europe, and APAC. No more Slack messages filling gaps. Just one consolidated approach that scales.

Get the free guide today