Feynman: The AI Research Agent That Verifies Before It Summarizes

In partnership with

SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | June 2, 2026

The standard AI research workflow is: input a question, get an LLM to recall what it knows, present the recall as a summary. This is fast. It is also structurally unreliable. The model's training data has a cutoff. The model confabulates. Citations are fabricated or misattributed. The summary may be wrong in ways that look completely correct.

The research community has been building alternatives. Perplexity adds web search. OpenAI's Deep Research adds planning and multi-step browsing. DeepResearcher (arXiv:2504.03160) trained an RL agent end-to-end in real web environments and improved open-domain research scores by up to 28.9 points over prompt engineering baselines and 7.2 points over RAG-based RL agents. The empirical evidence is clear: grounding in real-time sources with verification produces better research outputs than retrieval from training data.

Feynman (companion-inc, MIT, 7k stars, v0.2.17, April 2026) is the open-source implementation of this philosophy. It is a CLI-native research agent built for scientists, ML researchers, and engineers who need sourced, reproducible research outputs rather than hallucinated summaries. The positioning is direct: "Claude Code for research." The architecture: a primary coordinator dispatching four specialized sub-agents in parallel, each operating under behavioral contracts defined in Markdown, every output verified and linked.

Scope: Feynman's five-command interface, four-agent architecture, five-phase pipeline (Planning, Dispatch, Extraction, Synthesis, Verification), paper audit and experiment replication capabilities. Not covered: the Pi runtime internals beyond their role in agent coordination, or Feynman's hosted feynman.is service beyond brief mention.

What It Actually Does

Feynman is a CLI research agent with six primary commands:

Command	What It Does	Output
`feynman "query"`	Research brief: papers + web, structured synthesis	Summary, Background, Key Findings, Open Questions, References
`feynman deepresearch "topic"`	Multi-agent parallel investigation + synthesis + verification	Full research report with live citations
`feynman lit "topic"`	Literature review: consensus, disagreements, open questions	Scope, Consensus, Disagreements, Timeline, Bibliography
`feynman audit 2401.12345`	Compares paper claims against public codebase	Claim-by-claim match/mismatch report
`feynman replicate "claim"`	Runs experiments locally (Docker) or cloud (Modal, RunPod)	Replication result with full execution log
`feynman recipe "task"`	Finds ranked implementable ML training recipes	Ranked recipes from papers, datasets, docs, code

Installation:

npm install -g @companion-inc/feynman
feynman --version  # → v0.2.17

# Or run directly:
npx feynman "what do we know about scaling laws for language models?"

Key design principle: every claim in every output links to a direct URL (paper, doc, repo). No source-free assertions. The Verifier sub-agent checks every citation before output is finalized.

The Architecture, Unpacked

Focus on the parallel dispatch in Phase 2. All four sub-agents run simultaneously rather than sequentially. This is not just faster: it prevents the Researcher's findings from anchoring the Reviewer's critique, the Writer's synthesis from anchoring the Verifier's check. Independence between sub-agents is the design choice that makes the multi-agent architecture more than a sequential chain with role labels.

The Code, Annotated

Snippet One: Research Brief Pipeline and Source-Grounded Output

// Feynman research pipeline core (reconstructed from architecture docs)
// Source: companion-inc/feynman (MIT)
// The source-grounded constraint is enforced throughout, not just at the end

interface ResearchClaim {
  text: string;
  source_url: string;   // ← MANDATORY: every claim must have a direct URL
  source_type: 'paper' | 'doc' | 'repo' | 'web';
  doi?: string;         // for academic papers
  verified: boolean;    // set by Verifier agent in Phase 5
}

interface ResearchBrief {
  summary: string;
  background: string;
  key_findings: ResearchClaim[];   // ← each finding links to its source
  open_questions: string[];
  references: Reference[];
  verification_status: 'verified' | 'partial' | 'failed';
  failed_citations: string[];      // URLs that did not resolve
}

// Phase 2: parallel agent dispatch (the architectural decision)
// ← THIS is the trick: dispatch ALL agents simultaneously
//   Not: researcher → reviewer → writer → verifier (sequential, anchoring risk)
//   But: all four run in parallel with independent data access
async function dispatchAgents(plan: ResearchPlan): Promise<AgentResults> {
  // ← Promise.all: parallel execution, not sequential
  // Each agent gets the plan but NOT the other agents' outputs
  // This prevents anchoring: Reviewer cannot be biased by Researcher's framing
  const [
    researcherOutput,
    reviewerOutput,
  ] = await Promise.all([
    // Researcher: gathers evidence from alphaXiv, web, GitHub
    researcherAgent.gather({
      query: plan.query,
      sources: ['alphaxiv', 'gemini', 'perplexity', 'github'],
      prioritize_surveys: plan.scope === 'landscape',
      date_filter: plan.date_range,   // e.g., "2022-2026"
    }),

    // Reviewer: independently assesses research landscape
    // ← Does NOT see Researcher output during execution
    // Receives only the plan, not the evidence yet
    reviewerAgent.assess({
      query: plan.query,
      focus: 'consensus_disagreements_gaps',
      // ← Identifies: where researchers agree, where they contradict
      //   where there are open questions vs settled questions
    }),
  ]);

  return { researcherOutput, reviewerOutput };
}

// Phase 5: citation verification (the quality gate)
// ← Runs AFTER synthesis, BEFORE output delivery
// This is the "verify first, summarize second" posture operationalized
async function verifyAllCitations(brief: ResearchBrief): Promise<VerifiedBrief> {
  const verificationResults = await Promise.all(
    brief.key_findings.map(async (finding) => {
      // Check every URL: HEAD request + DOI resolution
      const urlOk = await fetch(finding.source_url, { method: 'HEAD' })
        .then(r => r.ok)
        .catch(() => false);

      const doiOk = finding.doi
        ? await resolveDOI(finding.doi)
        : true;  // no DOI to check

      return {
        ...finding,
        verified: urlOk && doiOk,
        // ← Failed citations are NOT silently dropped
        //   They are explicitly flagged in the output with HTTP status
        //   The user knows which citations could not be verified
      };
    })
  );

  const failed = verificationResults
    .filter(f => !f.verified)
    .map(f => f.source_url);

  return {
    ...brief,
    key_findings: verificationResults,
    verification_status: failed.length === 0 ? 'verified'
                       : failed.length < 3   ? 'partial'
                       : 'failed',
    failed_citations: failed,
  };
}

The Promise.all() parallel dispatch is the implementation of the independence guarantee. The Reviewer's assessment of the research landscape is computed without seeing the Researcher's evidence. This is not just an efficiency optimization: it is an epistemic choice that prevents any single agent's framing from anchoring the others.

Snippet Two: Paper Audit Command (feynman audit) and Experiment Replication

// feynman audit: compare paper claims against public codebase
// Source: companion-inc/feynman (MIT)
// This is the hardest and most valuable capability in the stack

interface PaperClaim {
  claim_text: string;
  section: string;       // e.g., "Section 3.2 Methodology"
  claim_type: 'algorithmic' | 'empirical' | 'theoretical';
  verifiable_in_code: boolean;
}

interface AuditResult {
  claim: PaperClaim;
  verdict: 'supported' | 'contradicted' | 'absent' | 'unverifiable';
  evidence: string;      // specific code location or explanation
  confidence: number;    // 0-1
}

// feynman audit 2401.12345
// ← Fetches paper from alphaXiv, finds the linked GitHub repo
//   For each verifiable claim, searches the codebase for evidence

async function auditPaperVsCode(arxivId: string): Promise<AuditReport> {
  // Step 1: Fetch paper from alphaXiv
  const paper = await alphaXiv.fetch(arxivId);

  // Step 2: Extract all verifiable claims
  // ← Not all claims are verifiable: theoretical claims, complexity bounds,
  //   claims about dataset statistics may not appear in code
  //   The agent explicitly marks unverifiable claims rather than skipping them
  const claims = await claimExtractor.extract(paper, {
    focus: ['algorithmic', 'empirical'],
    exclude_theoretical: false,   // still report, but mark unverifiable
  });

  // Step 3: Find the codebase (from paper's links, GitHub search)
  const repo = await findLinkedRepo(paper);

  // Step 4: For each claim, search the codebase
  // ← THIS is the key: each claim is independently verified against code
  //   "Our method uses 3 attention heads" → check model config
  //   "We train for 100k steps" → check training script
  //   "Architecture uses ReLU activations" → check model definition
  const results: AuditResult[] = await Promise.all(
    claims.map(async (claim) => {
      if (!claim.verifiable_in_code) {
        return {
          claim,
          verdict: 'unverifiable',
          evidence: 'Claim type cannot be verified from source code',
          confidence: 1.0,   // confident that it's unverifiable, not wrong
        };
      }

      // Search codebase for evidence of the claim
      const codeEvidence = await repo.search({
        query: claim.claim_text,
        strategy: 'semantic',  // finds conceptually similar code, not just text match
      });

      return {
        claim,
        verdict: assessVerdict(claim, codeEvidence),
        evidence: codeEvidence.best_match?.location ?? 'No matching code found',
        confidence: codeEvidence.confidence,
      };
    })
  );

  return {
    paper_id: arxivId,
    total_claims: claims.length,
    supported: results.filter(r => r.verdict === 'supported').length,
    contradicted: results.filter(r => r.verdict === 'contradicted').length,
    absent: results.filter(r => r.verdict === 'absent').length,
    // ← 'absent' is different from 'contradicted':
    //   absent = code doesn't say; contradicted = code says something different
    //   This distinction matters: absent may be implementation detail;
    //   contradicted may be a reproducibility issue
    unverifiable: results.filter(r => r.verdict === 'unverifiable').length,
    results,
  };
}

// feynman replicate: run the experiment
// ← Connects to Docker (local) or Modal/RunPod (cloud GPU)
async function replicateExperiment(
  claim: string,
  infrastructure: 'docker' | 'modal' | 'runpod' = 'docker',
): Promise<ReplicationResult> {
  const recipe = await feynman.recipe(claim);
  // recipe: ranked list of papers + code + datasets + instructions

  const runner = infrastructure === 'docker'
    ? new DockerRunner({ sandbox: true })      // ← sandboxed, no host access
    : new CloudRunner({ provider: infrastructure });

  const result = await runner.execute(recipe.top_recipe);
  return {
    claim,
    recipe_source: recipe.top_recipe.source,
    execution_log: result.log,
    result_value: result.output,
    matches_paper_claim: compareToClaimedResult(result.output, claim),
  };
}

The verdict: 'absent' versus verdict: 'contradicted' distinction is the most precise design choice in the audit command. Most paper review tools treat "not found in code" and "found to be different in code" as the same failure. Feynman's explicit distinction matters for reproducibility: absent claims may simply be default implementations; contradicted claims indicate potential replication failures.

It In Action: End-to-End Worked Example

Input: feynman deepresearch "mechanistic interpretability for LLMs"

Phase 1: Planning (surfaced to user for confirmation)

Research Plan:
  Topic: Mechanistic interpretability for large language models
  Scope: Survey of methods, key results, open problems
  Source priorities: arXiv papers (primary), GitHub repos (secondary), web (tertiary)
  Date filter: 2021-2026
  Estimated depth: 15-25 papers
  Sub-agents: Researcher (papers + repos), Reviewer (consensus/gaps), Writer, Verifier
  Estimated time: 2-4 minutes

Confirm this plan? [y/n/modify]:

Phase 2: Parallel dispatch (simultaneous)

[Researcher] → alphaXiv: searching "mechanistic interpretability" + "circuits" + "features"
[Researcher] → GitHub: searching for interpretability toolkits (TransformerLens, nnsight)
[Reviewer]   → Independently assessing: what is settled, what is contested?
[All agents running in parallel...]

Phase 3: Data extraction (~90 seconds)

Researcher found: 23 relevant papers
  High relevance: Mathematical Frameworks for Circuits, Toy Models of Superposition,
                  Scaling Monosemanticity, Interpretability in the Wild
  Key repos: TransformerLens (5.6k stars), nnsight, pyvene
  Reviewer identified: 3 areas of consensus, 2 major disagreements, 4 open questions

Phase 4: Synthesis (~30 seconds)

Writer synthesizing:
  Chronological: 2021 early circuits work → 2022 toy models → 2023 superposition →
                 2024 scaling monosemanticity → 2025 automated interpretability
  Thematic: circuits approach vs. feature-based approach vs. probing classifiers
  Conflicting work flagged: debate on whether circuits are the right unit of analysis

Phase 5: Citation verification (~20 seconds)

Verifier checking 23 citations:
  21/23 URLs resolve ✓
  1 URL returns 404 (paper retracted) → flagged
  1 DOI redirects to preprint version → flagged with note
  verification_status: 'partial'

Output (saved to outputs/deepresearch-mechanistic-interpretability-2026-06-01.md):

# Mechanistic Interpretability for LLMs: Research Brief
Verification: PARTIAL (21/23 citations verified, 2 flagged)

## Summary
Mechanistic interpretability research seeks to understand how transformer
models implement computations at the level of individual neurons, attention
heads, and circuits. The field has two dominant paradigms...
[Source: https://arxiv.org/abs/2211.09169]

## Key Findings
1. Superposition: Models represent more features than dimensions by storing
   multiple features in superposed form [https://arxiv.org/abs/2209.11895]
2. Circuits: Small subgraphs implement interpretable algorithms in some cases
   [https://arxiv.org/abs/2202.11809]
...

## Disagreements
- Whether circuits are the right unit of analysis vs. features
  [Position A: https://... | Position B: https://...]

## Open Questions
1. How to scale interpretability to 100B+ parameter models
...

## References
[23 cited works, 21 verified, 2 flagged]

Timing:

Phase 1 (Planning):          ~5 seconds
Phase 2 (Dispatch setup):    ~2 seconds
Phase 3 (Parallel extraction): ~90 seconds (bottleneck: paper fetching)
Phase 4 (Synthesis):         ~30 seconds
Phase 5 (Verification):      ~20 seconds
Total:                       ~2.5 minutes

Why This Design Works, and What It Trades Away

The parallel dispatch architecture is the correct design for research tasks where grounding and coverage matter more than synthesis speed. A sequential pipeline (Researcher → Reviewer → Writer → Verifier) allows the Researcher's framing to anchor every downstream step: the Reviewer will focus on what the Researcher found, the Writer will synthesize what the Reviewer emphasized, and the Verifier will check what the Writer cited. Parallel dispatch breaks this anchoring by giving each agent access to the query and plan but not to other agents' live outputs.

The citation verification step is the architectural moat. Synthesizing from the web is cheap. Verifying that every synthesized citation resolves to an accessible, correct source is computationally expensive: it requires HTTP HEAD requests, DOI resolution, and semantic comparison between the claimed citation content and the actual document. Most research tools skip this step entirely. Feynman makes it mandatory and surfaces failures explicitly.

The feynman audit command represents the field's most direct implementation of the paper-code discrepancy problem. The Agent Laboratory paper (arXiv:2501.04227) documents this as a critical bottleneck in autonomous scientific research: agents must be able to compare paper claims to available implementations to determine whether claimed results are reproducible. Feynman operationalizes this as a direct CLI command.

What Feynman trades away:

Speed on simple queries. For a simple factual question, feynman "what is RLHF" dispatches four agents, runs a planning phase, extracts from multiple sources, and verifies citations. A direct LLM call answers in 2 seconds. The Feynman pipeline takes 60-90 seconds minimum. The overhead is justified for complex literature synthesis; it is not justified for simple lookups.

Source freshness limitations. alphaXiv provides arxiv coverage. Gemini/Perplexity provide web coverage. But conference proceedings without arXiv preprints, paywalled journals, and private technical reports are not accessible. For ML research (heavily on arXiv), this is minor. For life sciences or clinical research (heavily paywalled), it is a significant limitation.

No formal privacy policy at feynman.is. As of April 2026, Companion, Inc. has not published a privacy policy covering what happens to queries sent through the service. For institutional or research use with sensitive queries, the CLI installation (self-hosted) is the appropriate path, not the feynman.is hosted service.

Technical Moats

The audit and replicate commands. Most research agents produce summaries. Feynman's feynman audit produces a claim-by-claim paper-versus-code comparison, and feynman replicate actually executes experiments. These commands are qualitatively harder to build than synthesis pipelines: they require structured claim extraction, semantic code search, sandboxed execution environments (Docker, Modal, RunPod), and comparison logic. The research value is also qualitatively higher: a synthesis tells you what papers claim; an audit tells you whether the code matches the claim; a replication tells you whether the claim holds empirically.

Behavioral contracts as Markdown. Like gstack (which Feynman implicitly resembles in design philosophy), Feynman's agent roles are defined as Markdown behavioral contracts rather than code-embedded prompts. These are inspectable, versionable, and auditable. The Verifier's behavior is readable in a file. The Researcher's source prioritization rules are readable in a file. For research contexts where the agent's reasoning process must itself be auditable, this is the correct design.

alphaXiv integration. Feynman uses alphaXiv (the enhanced arXiv interface with structured metadata and improved search) as its primary paper source rather than raw arXiv. This provides better semantic search, more structured extraction of claims and results, and better citation resolution. The integration is not trivial: alphaXiv's API access and query structure are specific to the platform.

Insights

Insight One: The "verify first, summarize second" posture is described as a design philosophy but implemented as "verify last, deliver only after verification." This is the correct sequence for a research tool but the opposite of how all current commercial research agents work. Perplexity, Deep Research (OpenAI), and Gemini Research generate the synthesis and present it to the user; citation verification, if it happens at all, is post-hoc and not integrated into the delivery. Feynman's sequential commitment (synthesis → verify → deliver) means users cannot see a result until verification completes. For research contexts where correctness matters more than speed, this is the right tradeoff. For consumer use cases where users want immediate results and are comfortable doing their own validation, it is friction.

Insight Two: The feynman audit command solves a problem that is vastly harder than it appears. Comparing a paper claim ("our method uses 3 attention heads") to a codebase requires: (1) extracting the claim precisely, (2) finding the right file and line in the codebase, (3) understanding whether the code implements the same concept under a different variable name, and (4) distinguishing between "not implemented" (may be a default) and "implemented differently" (potential discrepancy). The Agent Laboratory paper (arXiv:2501.04227) explicitly identifies this as a bottleneck in autonomous research replication. Feynman's explicit absent vs. contradicted verdict taxonomy is the correct engineering response: it forces the agent to make a specific claim about WHY the claim and code do not match, rather than reporting a generic failure.

Surprising Takeaway

Feynman's most valuable command is feynman recipe, not feynman deepresearch. The recipe command finds ranked, implementable ML training recipes from papers, datasets, documentation, and code for a stated task. This is the command that converts research knowledge into engineering action. "How do I fine-tune a small model for math reasoning?" is a question that requires synthesizing: which papers to read, which datasets to use, which training configurations work, which codebases to start from. feynman recipe produces a ranked list of implementable approaches with direct links to every component. This is the gap between "understand the research" and "run the experiment" that most research tools leave open. Feynman closes it as a first-class command.

TL;DR For Engineers

Feynman (companion-inc/feynman, MIT, 7k stars, v0.2.17, April 2026) is a CLI research agent built on the "verify first, summarize second" posture: four parallel sub-agents (Researcher, Reviewer, Writer, Verifier), five-phase pipeline (Plan → Dispatch → Extract → Synthesize → Verify), every claim source-grounded with a direct URL.
Six commands: feynman (research brief), deepresearch (multi-agent investigation), lit (literature review), audit (paper vs. codebase), replicate (experiment execution on Docker/Modal/RunPod), recipe (ranked implementable ML training recipes). Install: npm install -g @companion-inc/feynman.
The parallel dispatch design prevents anchoring: all four agents run simultaneously without seeing each other's outputs. The absent vs. contradicted audit taxonomy distinguishes "code doesn't say" from "code says something different."
Research context benchmark: DeepResearcher (arXiv:2504.03160) showed RL-trained research agents improve up to 28.9 points over prompt engineering baselines, validating the class of tools Feynman implements. Feynman is the open-source, deployable implementation, not the RL-trained model.
Best use case: ML researchers who need sourced literature reviews, paper-versus-code audits, or experiment replication workflows. Not optimized for: simple factual lookups (too much overhead), paywalled research (limited source access), or real-time information (arXiv coverage only for papers).

Source-Grounded or Nothing

Feynman's central claim is that research outputs should be traceable to their sources, every claim should link to an accessible document, and verification should happen before delivery rather than as an afterthought. This posture is correct. It is also the most demanding requirement in the space: it means every component of the pipeline must be designed around source provenance from the start, not added later.

The feynman audit and feynman replicate commands are where this philosophy produces its most distinct outputs: not "here is what the paper says" but "here is whether the code matches what the paper says" and "here is whether the claimed result replicates." These are different questions with different answers and different value for research workflows.

References

Feynman GitHub Repository, companion-inc, MIT, 7k stars
DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments, arXiv:2504.03160 — RL research agent achieving +28.9 points over prompt baselines
Agent Laboratory: Using LLM Agents as Research Assistants, arXiv:2501.04227 — autonomous research pipeline context; paper-code comparison bottleneck
Deep Research Agent: A Systematic Examination and Roadmap, arXiv:2506.18096 — taxonomy of research agent architectures; Feynman's design fits the "grounded multi-agent" category
A Review of Prominent Paradigms for LLM-Based Agents: Tool Use, Planning, and Feedback Learning, arXiv:2406.05804 — survey of agent design patterns underlying Feynman's architecture
InfiAgent: An Infinite-Horizon Framework for General-Purpose Autonomous Agents, arXiv:2601.03204 — long-horizon agent execution patterns relevant to feynman replicate

Feynman (companion-inc/feynman, MIT, 7k stars, v0.2.17, April 2026) is a CLI research agent built on source-grounded intelligence: four parallel sub-agents (Researcher using alphaXiv/Gemini, Reviewer for consensus/gap analysis, Writer for structured synthesis, Verifier for citation validation) executing a five-phase pipeline (Plan → Dispatch → Extract → Synthesize → Verify) where every output claim links to a direct URL and citation verification runs before delivery. Key commands: feynman deepresearch for multi-agent investigation, feynman audit [arXiv-ID] for paper-vs-codebase claim comparison (with explicit absent vs. contradicted taxonomy), feynman replicate for sandboxed experiment execution, and feynman recipe for ranked implementable ML training workflows. Grounded in the DeepResearcher research tradition (arXiv:2504.03160: +28.9 points over prompt baselines) and the Agent Laboratory framework for autonomous research pipelines.

Sponsored Ad

If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀

22 ChatGPT Agents Built for Every Marketing Job

Most marketers use ChatGPT to do general research and then call it an AI strategy. The ones outperforming them are deploying specialized agents built for specific jobs.

We put together 22 plug-and-play ChatGPT marketing agents that handle the work eating your week, each with built-in instructions and structured outputs ready to go in under 5 minutes.

Subscribe to Marketing Against the Grain and get all 22 free.

Inside you'll find:

Competitive intelligence agent that visits competitor websites and builds detailed comparison matrices automatically
Customer feedback analyzer that ranks improvement opportunities by business impact
Social listening specialist that monitors brand mentions and flags reputation risks before they escalate
Campaign optimization agents that handle attribution analysis and surface what is actually driving results

Your competitors are already running agents like these.

Get 22 ChatGPT Marketing Agents free when you subscribe to Marketing Against the Grain today.

Get The Guide