In partnership with

The technical result that makes this worth dissecting: the routing head in Trinity (arXiv:2512.04695, ICLR 2026), one of the two papers underlying Fugu, has approximately 10,000 parameters. A 10K-parameter head, trained via evolutionary strategy and NOT gradient descent, coordinates GPT-5, Gemini, and Claude. On Sakana's reported benchmarks: fugu-ultra scores 95.1 on GPQAD (vs Gemini 3.1 high 94.4), 93.2 on LCBv6 (vs Opus 4.6 max 92.4), and 54.2 on SWEPro (vs 53.4 for Opus 4.6 with Anthropic's own scaffold).

SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | June 29, 2026

The implied theory of scaling in 2026 is that bigger models produce better outputs. Sakana Fugu challenges this at the coordination layer: the entity deciding how to use a pool of frontier models has approximately 10,000 parameters in its decision head, and it was not trained by backpropagation. It was evolved.

The Fugu product is the commercial instantiation of two ICLR 2026 papers. Trinity (arXiv:2512.04695) introduces an evolved lightweight coordinator that assigns Thinker/Worker/Verifier roles across frontier models. Conductor (arXiv:2512.04388) introduces a 7B model trained with reinforcement learning that learns to write custom communication topologies and targeted instructions in natural language. Fugu combines and extends both approaches into a product with an OpenAI-compatible API.

Important context before diving into the architecture: as of June 22, 2026, no independent third party has reproduced Fugu's benchmark results. Every figure in Sakana's table is vendor-reported, with competitor configurations chosen by Sakana. This is not unusual for a model on launch day. It is, however, worth noting that the orchestration design makes Fugu harder to independently benchmark than a single model: reproducing a Fugu result requires knowing which pool models were used, in what configuration, with what topology, at what compute budget. The claims are plausible given the research lineage. They are claims until someone else runs them.

Scope: Trinity's evolutionary coordinator architecture, Conductor's RL-trained natural language orchestration, the recursive self-calling test-time scaling mechanism, and Fugu's deployment model. Not covered: Sakana's earlier AB-MCTS work, or the ShinkaEvolve and AI Scientist product lines beyond brief mention.

What It Actually Does

Fugu sits between the user and a pool of frontier models. When a user submits a task, Fugu does not answer it directly. It decides which models to call, what instructions to give each one, what output from one model to show to another, and how many passes to run. The user gets a single answer from a single API. Behind that API, an orchestrated workflow ran with a topology Fugu designed from scratch for that specific input.

The two-model architecture:

Component

Architecture

Training

Size

Trinity coordinator

Compact LM + routing head

Evolutionary strategy (no gradient descent)

~0.6B + ~10K params

Conductor orchestrator

Qwen2.5-7B fine-tuned

GRPO (RL, gradient-based)

7B

These are not competing architectures. Trinity learns to route roles across a fixed model pool using an evolved routing policy. Conductor learns to write custom instructions and communication topologies in natural language, adapting to arbitrary agent pools. Fugu as a product uses both.

Two product variants:

  • fugu-mini 🐟: latency-optimized, lighter orchestration

  • fugu-ultra 🐡: full orchestration system, all coordination strategies enabled

Integration: drop-in OpenAI-compatible endpoint. Existing GPT/Claude/Gemini API integrations swap one endpoint string.

The Architecture, Unpacked

Focus on the separation of concerns: the Conductor decides the workflow topology in natural language, the Trinity routing head decides which role goes to which model on each turn, and the frontier models execute the actual reasoning. The coordinator components never need to possess the domain skills they are orchestrating.

The Code, Annotated

Snippet One: Calling Fugu as an API Drop-In

# Sakana Fugu API: OpenAI-compatible endpoint
# Source: sakana.ai/fugu-beta documentation
# Design intent: zero integration cost for existing API users
# The entire orchestration is hidden behind a single endpoint call

from openai import OpenAI

# ─── BEFORE: direct frontier model call ─────────────────────────────────────
before_client = OpenAI(
    api_key="sk-anthropic-key",
    base_url="https://api.anthropic.com/v1",
)

before_response = before_client.chat.completions.create(
    model="claude-opus-4-6",
    messages=[{"role": "user", "content": "Solve the following optimization problem..."}],
)

# ─── AFTER: Fugu drops in with one endpoint change ────────────────────────
after_client = OpenAI(
    api_key="fugu-api-key",
    base_url="https://api.sakana.ai/v1",
    # ← THIS is the drop-in: same OpenAI SDK, different base_url
    # Everything else in your code is unchanged
)

# ← Same call structure as before, but now:
#   1. Fugu's Conductor reads the task and designs a workflow
#   2. Trinity's routing head assigns roles to frontier models
#   3. Multiple models may execute in sequence or parallel
#   4. Fugu synthesizes the final answer
#   5. You receive a standard OpenAI-format chat completion response
fugu_response = after_client.chat.completions.create(
    model="fugu-ultra",          # or "fugu-mini" for latency-sensitive tasks
    messages=[{"role": "user", "content": "Solve the following optimization problem..."}],
    # ← Optional: hint the orchestrator about compute budget
    # stream=True for streaming output (supported, same as standard OpenAI streaming)
)

print(fugu_response.choices[0].message.content)
# Output: synthesized answer from whatever topology Fugu chose internally
# May have involved: Gemini for initial planning, GPT-5 for execution,
# Claude for verification — user sees none of this routing

# ─── CHECKING WHAT HAPPENED (if Fugu exposes workflow metadata) ──────────────
# Fugu has been observed to return metadata about the coordination in some contexts:
import json
if hasattr(fugu_response, 'usage') and fugu_response.usage:
    print(f"Total tokens (across all models): {fugu_response.usage.total_tokens}")
    # ← Token count includes ALL API calls Fugu made to worker models
    # ← This is important for cost estimation: a hard task may invoke 3-5 models
    # ← fugu-mini is more cost-efficient; fugu-ultra may call more models per task

The single endpoint change is the correct integration model for a coordinator. If adopting Fugu required changing the structure of API calls, wrapping the client, or managing separate calls to the Conductor and then to worker models, adoption would be far slower. Hiding the orchestration behind a standard interface means teams can A/B test Fugu against direct model calls without changing any surrounding code.

Snippet Two: Trinity's Evolutionary Coordinator Design and the Thinker/Worker/Verifier Loop

# Trinity coordinator: evolutionary training + Thinker/Worker/Verifier role system
# Reconstructed from arXiv:2512.04695 architecture description
# Design intent: separate role routing from skill execution

import torch
import torch.nn as nn
from typing import Literal

# ─── THE TRINITY COORDINATOR COMPONENTS ──────────────────────────────────────
# Two separate components with very different sizes and training methods:
#
# 1. Compact LM (~0.6B): processes the query context and prior outputs
#    to produce hidden states used for role assignment
#
# 2. Routing head (~10K params): takes the LM's hidden state and decides
#    which role to assign to which model in the pool
#    ← THIS is the surprising part: 10K parameters makes the routing decision
#    ← Trained by EVOLUTIONARY STRATEGY, not gradient descent

RoleType = Literal["THINKER", "WORKER", "VERIFIER"]

class TrinityRoutingHead(nn.Module):
    """
    The Trinity routing head: ~10K parameters, evolved (not gradient-trained).
    Takes compressed query representation → outputs (role, model_index).
    
    ← WHY evolutionary strategy and not backpropagation?
      The routing decision is discrete (which model? which role?), and
      gradient descent through a discrete selection is not straightforward.
      Evolutionary strategies optimize over the discrete role-assignment policy
      without needing differentiable relaxations.
      
    ← WHY so small (~10K parameters)?
      The coordinator only needs to learn: "given this query pattern, which
      combination of Thinker/Worker/Verifier roles produces the best outcome?"
      The domain expertise is in the worker models. The coordinator only routes.
      You don't need 70B parameters to learn: "coding tasks need Thinker→Worker→Verifier;
      factual questions need just Worker."
    """
    def __init__(self, hidden_dim: int = 128, n_models: int = 6, n_roles: int = 3):
        super().__init__()
        # The "head" that makes the actual routing decision
        # Small: hidden_dim * n_models * n_roles ≈ 128 * 6 * 3 ≈ 2,304 params
        self.role_router = nn.Sequential(
            nn.Linear(hidden_dim, 64),
            nn.ReLU(),
            nn.Linear(64, n_models * n_roles),  # ← logits over (model, role) pairs
        )
        # ← At ~10K total params, this fits the paper's description
        # ← Evolved via CMA-ES or similar derivative-free optimizer, not Adam/SGD

    def forward(self, context_hidden: torch.Tensor) -> tuple[int, RoleType]:
        """
        Given compressed context representation, output (model_index, role).
        Evolutionary training optimizes the weights to maximize task success rate
        across the training distribution.
        """
        logits = self.role_router(context_hidden)         # [batch, n_models * n_roles]
        logits = logits.view(-1, n_models, n_roles)       # [batch, n_models, n_roles]
        # In practice: greedy selection during inference
        model_idx = logits.max(dim=-1).values.argmax(dim=-1).item()
        role_idx = logits[0, model_idx].argmax().item()
        role = ["THINKER", "WORKER", "VERIFIER"][role_idx]
        return model_idx, role


# ─── THE THINKER/WORKER/VERIFIER TURN LOOP ──────────────────────────────────
async def trinity_turn_loop(
    query: str,
    routing_head: TrinityRoutingHead,
    context_encoder,         # the ~0.6B compact LM
    model_pool: list[dict],  # [{"name": "gpt-5.4", "client": ...}, ...]
    max_turns: int = 5,
) -> str:
    """
    Trinity's multi-turn orchestration loop.
    
    Each turn:
    1. Context encoder reads query + all prior outputs → hidden state
    2. Routing head assigns (model, role) for this turn
    3. Assigned model runs with role-specific prompt
    4. Output appended to context for next turn
    
    ← The loop terminates when the Verifier approves, or max_turns is reached.
    ← No explicit stopping rule is learned: VERIFIER role acceptance IS the stopping signal.
    """
    context = [{"role": "user", "content": query}]
    outputs_by_role = {r: [] for r in ["THINKER", "WORKER", "VERIFIER"]}
    final_answer = None

    for turn in range(max_turns):
        # Step 1: encode all accumulated context
        context_hidden = context_encoder.encode(context)   # [1, hidden_dim]

        # Step 2: routing head decides who does what
        model_idx, role = routing_head(context_hidden)
        selected_model = model_pool[model_idx]

        # Step 3: construct role-specific instruction
        # ← The prompting style per role is what was learned during evolutionary training
        role_instructions = {
            "THINKER": f"You are a THINKER. Analyze the problem and develop a step-by-step plan. Do NOT implement yet.\nQuery: {query}\n",
            "WORKER": f"You are a WORKER. Implement the following plan:\n{outputs_by_role['THINKER'][-1] if outputs_by_role['THINKER'] else 'No plan available. Solve directly.'}\n",
            "VERIFIER": f"You are a VERIFIER. Validate this solution:\n{outputs_by_role['WORKER'][-1] if outputs_by_role['WORKER'] else 'No solution to verify.'}\nRespond with ACCEPT or REJECT and explain why.",
        }
        prompt = role_instructions[role]

        # Step 4: call the assigned model
        response = await selected_model["client"].chat.completions.create(
            model=selected_model["name"],
            messages=[{"role": "user", "content": prompt}],
        )
        output_text = response.choices[0].message.content

        # Step 5: record output and update context
        outputs_by_role[role].append(output_text)
        context.append({"role": "assistant", "content": f"[{role} - {selected_model['name']}]: {output_text}"})

        # ← Termination condition: Verifier accepts
        if role == "VERIFIER" and "ACCEPT" in output_text.upper():
            final_answer = outputs_by_role["WORKER"][-1] if outputs_by_role["WORKER"] else output_text
            break

    return final_answer or outputs_by_role.get("WORKER", [query])[-1]

The role_instructions dictionary shows why the evolutionary training is doing useful work. The coordinator is not just picking a model, it is picking a model AND framing what that model should do in a role-specific way. The Thinker gets told not to implement; the Worker gets the Thinker's plan; the Verifier gets the Worker's implementation. This structured information flow is what the evolved routing policy learns to optimize.

Snippet Three: Conductor's Natural Language Topology and Recursive Self-Calling

# Conductor: RL-trained orchestrator generating natural language topologies
# Reconstructed from arXiv:2512.04388 architecture and methodology
# Base: Qwen2.5-7B, trained with GRPO

# ─── WHAT CONDUCTOR OUTPUTS (natural language workflow) ───────────────────────
# For a hard coding problem, Conductor might generate a plan like:
example_conductor_output = """
STEP 1: [Gemini-2.5-Pro | PLANNER]
Instructions: "Analyze the problem and identify the key algorithmic challenges.
List the data structures and algorithmic approach needed. Do not write code."
Input: [original query]

STEP 2: [GPT-5.4 | CODER]
Instructions: "Implement a solution for the following problem.
Use efficient algorithms as outlined in the plan. Include error handling."
Input: [original query] + [STEP 1 output]

STEP 3: [DeepSeek-R1-Distill-Qwen-32B | VERIFIER]
Instructions: "Critically review this code for correctness, edge cases, and efficiency.
Identify any bugs. Run through at least 3 test cases mentally."
Input: [STEP 2 output]

STEP 4: [GPT-5.4 | REVISER]
Instructions: "Fix the identified bugs. Apply the suggested improvements."
Input: [STEP 2 output] + [STEP 3 output]
"""

# ← THIS is what makes Conductor different from Trinity:
#   Trinity's routing head assigns roles via evolved discrete policy
#   Conductor writes the FULL workflow as text, including targeted custom instructions
#   per model, information routing (what each model sees), and step dependencies

# ─── CONDUCTOR TRAINING: GRPO ON 960 PROBLEMS ─────────────────────────────────
# Training setup (from paper):
# - Base model: Qwen2.5-7B
# - Training method: GRPO (Group Relative Policy Optimization), a form of RL
# - Training set: 960 problems across math, coding, reasoning
# - Randomized agent pools: each training problem uses a different subset of models
#   ← THIS is the key to pool-agnostic generalization:
#     training with random pools forces the Conductor to learn model capabilities
#     in the abstract, not overfit to one specific set of frontier models

# ─── RECURSIVE SELF-CALLING ─────────────────────────────────────────────────
async def conductor_with_recursion(
    query: str,
    conductor_model,      # the 7B Qwen2.5-GRPO model
    model_pool: list[str],
    max_depth: int = 3,   # recursion depth = tunable compute budget
    depth: int = 0,
) -> str:
    """
    When Conductor selects itself as a worker, it enters a new orchestration turn.
    
    ← WHY this enables test-time scaling without retraining:
      At depth=0, Conductor plans the workflow.
      If it selects itself as a worker, it re-reads its own prior output
      and decides: "Was my first coordination strategy good enough?"
      If not, it spins up a DIFFERENT coordination strategy.
      
      This is fundamentally different from simple reflection:
      The model is not just refining its own answer.
      It is reconsidering the COORDINATION STRATEGY itself.
      
      Compute budget at inference = recursion depth × cost per orchestration cycle.
      You tune depth per task difficulty with no retraining.
    """
    if depth >= max_depth:
        # Base case: forced termination, return best current answer
        return await single_model_fallback(query, model_pool)

    # Step 1: Conductor generates a coordination plan
    plan_response = await conductor_model.generate(
        prompt=f"Task: {query}\nDesign an optimal coordination workflow for the following agent pool: {model_pool}",
        context=f"[Depth {depth}] Prior coordination attempts: {depth} rounds completed."
    )
    plan = plan_response.text

    # Step 2: Execute the plan against the model pool
    # ← Conductor can include "fugu-coordinator" in the plan as a worker
    intermediate_result = await execute_plan(plan, model_pool, query)

    # Step 3: Self-evaluation: was this coordination strategy good enough?
    eval_response = await conductor_model.generate(
        prompt=f"Original task: {query}\nCoordination plan used: {plan}\nResult: {intermediate_result}\n\nEvaluate: Is this result satisfactory? If not, what coordination strategy would improve it?",
    )

    if "SATISFACTORY" in eval_response.text.upper():
        return intermediate_result  # ← exits recursion at this depth
    else:
        # ← THIS is recursive self-calling: Conductor re-plans with new strategy
        # Depth counter is the inference-time compute knob
        return await conductor_with_recursion(
            query=query,
            conductor_model=conductor_model,
            model_pool=model_pool,
            max_depth=max_depth,
            depth=depth + 1,  # ← tunable: higher max_depth = more compute, better quality
        )

The recursive self-calling mechanism (depth as a compute axis) is the most significant inference-time scaling insight in Fugu. Standard test-time compute scaling (chain-of-thought, reflection, sampling) applies within a single model's generation. Fugu's recursion applies at the orchestration level: the coordinator reconsidering its coordination strategy. This is scaling over the meta-reasoning about which models to use and how, not just the reasoning within any single model.

It In Action: End-to-End Worked Example

Task: "Implement and verify a Rust solution for the following competitive programming problem: given N cities and M roads, find the minimum spanning tree weight using Kruskal's algorithm."

Step 1: Conductor reads the task and designs a workflow

Conductor (Qwen2.5-7B, GRPO-trained) analyzes:
  - Problem type: competitive programming, systems implementation
  - Language requirement: Rust (specificity suggests need for expert)
  - Algorithmic task: Kruskal's (graph algorithm, requires correctness)
  - Verification importance: high (competitive programming = exact answers)

Conductor output (natural language plan):
  STEP 1: [Gemini-2.5-Pro | THINKER]
    "Outline the Kruskal's algorithm approach for MST in Rust.
     Identify: Union-Find data structure requirements, edge sorting,
     and Rust-specific considerations (no GC, ownership model)."
     
  STEP 2: [GPT-5.4 | CODER]
    "Implement a complete, idiomatic Rust solution for Kruskal's MST.
     Use the algorithmic outline provided. Handle N up to 10^5, M up to 2×10^5.
     Include: proper stdin reading, Union-Find with path compression + rank."
     
  STEP 3: [DeepSeek-R1-Distill-Qwen-32B | VERIFIER]
    "Verify this Rust Kruskal implementation:
     1. Trace through example: N=4, M=5 edges
     2. Check Union-Find correctness
     3. Verify edge case: disconnected graph
     4. Verify time complexity is O(M log M)"

Step 2: Thinker (Gemini-2.5-Pro) runs

Gemini output: 
  "Kruskal's requires:
   1. Sort edges by weight O(M log M)
   2. Union-Find with path compression and union by rank for near-O(α(N))
   3. In Rust: Vec<(usize, usize, i64)> for edges, Vec<usize> parent and rank
   4. Rust-specific: use iter().position() carefully, avoid unnecessary clones..."
   
Tokens used: ~380 (targeted planning prompt, not a generic request)

Step 3: Worker (GPT-5.4) runs with Thinker's context

// GPT-5.4 output: complete Rust implementation
fn find(parent: &mut Vec<usize>, x: usize) -> usize {
    if parent[x] != x { parent[x] = find(parent, parent[x]); }
    parent[x]
}
fn union(parent: &mut Vec<usize>, rank: &mut Vec<usize>, x: usize, y: usize) -> bool {
    let (px, py) = (find(parent, x), find(parent, y));
    if px == py { return false; }
    if rank[px] < rank[py] { parent[px] = py; }
    else if rank[px] > rank[py] { parent[py] = px; }
    else { parent[py] = px; rank[px] += 1; }
    true
}
fn main() {
    // ... stdin reading, edge sorting, Kruskal loop ...
}

Tokens used: ~720 (implementation + algorithmic context injected)

Step 4: Verifier (DeepSeek-R1-Distill-Qwen-32B) runs with Worker's code

DeepSeek output:
  "ACCEPT: Implementation is correct.
   Trace N=4, edges [(1,2,1),(2,3,2),(1,3,3),(3,4,4),(2,4,5)]:
   Sort: (1,2,1),(2,3,2),(1,3,3),(3,4,4),(2,4,5)
   Add (1,2,1): components=3, mst=1
   Add (2,3,2): components=2, mst=3
   Skip (1,3,3): cycle
   Add (3,4,4): components=1, mst=7
   MST weight: 7 ✓
   Time complexity: O(M log M) confirmed ✓
   Disconnected graph: returns partial MST with fewer than N-1 edges ✓"
   
VERIFIER accepted. Trinity loop terminates at turn 3.
Tokens used: ~590 (targeted verification context)

Final output (single synthesized response to user):

The Rust Kruskal's MST implementation (working code above).
Total API calls: 3 (Thinker + Worker + Verifier)
Total tokens across all models: ~1,690
Total wall time: ~8-12 seconds (3 sequential model calls)

Comparison if Opus 4.6 handled this directly:
  Single model, no role separation
  Typically: plan-in-one-pass + implement + self-check in a single response
  Risk: Opus might implement without verifying, or verify incorrectly
  SWEPro benchmark: Opus 4.6 max = 53.4 vs fugu-ultra = 54.2

Why This Design Works, and What It Trades Away

The Trinity evolutionary training result is the most counterintuitive finding: a 10K-parameter head, optimized by derivative-free evolution (no backpropagation through the role-assignment decisions), can learn to route queries across frontier models effectively. The reason evolutionary methods work here is that the objective, aggregate task success rate, is evaluable but not differentiable. You cannot backpropagate through "did the correct answer emerge from the multi-turn exchange?" but you can evaluate it for thousands of training examples and apply a population-based optimizer to the routing policy weights.

The Conductor's GRPO training on randomized agent pools is the correct architectural choice for building a system that generalizes across model configurations. A Conductor trained on one fixed set of models would learn routing policies specific to GPT-4 + Claude-3 + Gemini-1.5, and would fail when any of those models is swapped out. Training with randomized pools forces the Conductor to model agent capabilities abstractly rather than memorizing which specific model is best for which query type. This is why Conductor claims to work with "arbitrary sets of open- and closed-source agents."

The Thinker/Worker/Verifier role structure is the correct decomposition for tasks where planning, execution, and validation require different capabilities. A single model that tries to do all three in one pass has no mechanism to "change character" between reasoning about an approach and implementing it. The role structure provides explicit information routing: the Worker sees the Thinker's plan but not the user's raw query history; the Verifier sees the Worker's output with explicit verification instructions. This controlled information flow is what the evolutionary strategy learns to optimize.

What Fugu trades away:

Latency. Three sequential frontier model calls with orchestration overhead means hard tasks take 8-20 seconds minimum, versus 1-3 seconds for a single frontier model on the same query. For applications where latency matters more than a few percentage points of accuracy, fugu-mini reduces this, but the architectural overhead of orchestration is non-zero.

Cost opacity. When Fugu routes a query to GPT-5.4 + Gemini-2.5-Pro + DeepSeek, the user is paying for all three, plus the Conductor and Trinity coordinator calls. The total token spend for a hard task is significantly higher than a single frontier model call. The benchmarks compare accuracy, not cost-adjusted accuracy. On a cost-per-correct-answer metric, the comparison changes substantially.

Independent reproducibility. Fugu's benchmarks are vendor-reported with no harness released. The orchestration makes independent benchmarking harder, not easier: to reproduce a Fugu result, you need the specific workflow topology Fugu chose, the same model pool with the same configurations, and the same task setup. None of these have been published.

Technical Moats

Two ICLR 2026 papers as the research foundation. Trinity and Conductor are peer-reviewed, with OpenReview links (5HaRjXai12 and U23A2BUKYt respectively). This is not marketing. The evolutionary training of a 10K-parameter routing head and the GRPO training of a 7B orchestrator are specific, replicable (in principle) contributions. Competing products built on simpler routing logic (LLM router, single-model selection, fixed workflow templates) do not have this theoretical grounding.

Randomized pool training → arbitrary model generalization. The Conductor's ability to work with any agent pool is not incidental. It is a direct consequence of training with randomized pools. A team trying to replicate this benefit must run the full GRPO training pipeline over 960 curated problems with varying pool compositions. This is a non-trivial training cost, not just a prompt engineering exercise.

Recursive self-calling as a tunable compute axis. The insight that Fugu can call itself recursively, with recursion depth as an inference-time parameter requiring no retraining, provides a clean interface for trading cost for quality. Competing approaches to test-time scaling (larger models, more samples, longer chains of thought) require either a different model or longer generation, not a new coordination strategy. Fugu's recursion scales at the orchestration meta-level.

Insights

Insight One: The Fugu benchmark table contains a comparison that should trigger skepticism, not celebration. Fugu-ultra scores 54.2 on SWEPro versus Opus 4.6 max at 53.4, where Anthropic's self-reported score uses "a custom scaffold," and Sakana notes "frequent timeouts during our evaluation trials" prompted using Anthropic's own numbers. This is not a Sakana-specific problem. Every comparison table for multi-model orchestrators is a mixed-methods comparison: Fugu is a routing system that calls Opus internally, while the comparison "Opus score" represents Opus running alone. A 0.8-point gain on SWEPro from Fugu calling Opus as one of its workers versus Opus running directly is a very different claim than "our system outperforms Opus." Until independent reproduction with unified evaluation harnesses exists, these margins should be read as signals, not measurements.

Insight Two: Trinity's evolutionary training choice is the most publishable part of the architecture, but it may not be the most important for Fugu's real-world performance. The 10K-parameter head trained by derivative-free optimization is an intellectually elegant result: small evolved coordinators can route large frontier models effectively. But Conductor's RL-trained natural language topology generation is operationally richer: it can generate arbitrary communication structures, write targeted per-model instructions, and adapt to any pool composition. The paper split between these two methods suggests they are complementary, not competing. Fugu's production system almost certainly relies more heavily on Conductor's flexible natural language workflow generation than on Trinity's fixed Thinker/Worker/Verifier template for hard tasks. The Trinity result is the publishable contribution; the Conductor is the production engine.

Surprising Takeaway

The recursive self-calling behavior described in Sakana's footnote, where Fugu reads its own prior output and decides whether to revise its coordination strategy, is a qualitatively different kind of test-time compute scaling than anything currently in production. Standard inference-time scaling scales the amount of thinking within one model's generation (more tokens, more reasoning steps, more samples, best-of-N). Fugu's recursion scales the meta-reasoning about which model coordination strategy to use. When Fugu calls itself at depth 1, it is not generating more reasoning; it is generating a different coordination plan. At depth 2, it is asking: "Given that my first coordination strategy and my revised strategy both failed, what entirely different workflow should I try?" This creates a hierarchy of meta-reasoning levels that has no analogue in any current production system, and it requires no additional training, just additional inference budget. The architectural implication: Fugu's compute scaling is not bounded by any single model's context window or generation length. It is bounded by how many orchestration cycles are economically justifiable per task.

TL;DR For Engineers

  • Sakana Fugu (GA June 22, 2026, OpenAI-compatible API, fugu-mini and fugu-ultra) is a trained multi-model coordinator based on two ICLR 2026 papers: Trinity (arXiv:2512.04695, ~0.6B compact LM + ~10K routing head trained by evolutionary strategy, Thinker/Worker/Verifier roles) and Conductor (arXiv:2512.04388, 7B Qwen2.5-GRPO, generates natural language communication topologies and custom per-model instructions, pool-agnostic via randomized pool training).

  • Benchmarks (vendor-reported, no independent reproduction): fugu-ultra 95.1 GPQAD / 93.2 LCBv6 / 54.2 SWEPro vs Opus 4.6 max 92.7 / 92.4 / 53.4. The asterisk on Opus SWEPro is Anthropic's own scaffold with frequent timeouts during Sakana's trials; apples-to-oranges comparison for a model that routes to Opus internally.

  • Integration: base_url="https://api.sakana.ai/v1", model "fugu-ultra" or "fugu-mini". Standard OpenAI SDK, same call structure, orchestration is fully transparent to the caller.

  • Recursive self-calling: Fugu can invoke itself as a worker, re-reading its coordination output and reconsidering the strategy. Recursion depth is a tunable inference-time compute parameter requiring no retraining. This is test-time scaling at the orchestration meta-level, not within any single model.

  • Critical caveat: as of June 22, 2026, no third party has reproduced any Fugu benchmark. The orchestration design makes independent validation harder. Treat all figures as vendor claims until external reproduction exists.

The Model That Learns to Manage Models

Sakana Fugu's core insight, that a small trained coordinator can outperform any single large model by orchestrating a pool of them, has been proven at the research level in Trinity and Conductor. Whether the benchmark margins hold under independent evaluation is a different question. The architecture is sound: evolutionary routing of roles, RL-trained natural language topology generation, pool-agnostic generalization from randomized training, and recursive self-calling as a test-time compute dial. The product is real. The numbers need external verification.

The more important implication: if coordination improves with training, every frontier model provider now faces competition not from a larger model but from a better-trained coordinator that uses their model as a commodity resource. That is a structurally different competitive dynamic.

References

Summary

Sakana Fugu (GA June 22, 2026, OpenAI-compatible) is a trained multi-model coordinator grounded in two ICLR 2026 papers: Trinity (arXiv:2512.04695), a ~0.6B compact LM + ~10K-parameter routing head trained by evolutionary strategy (no backpropagation) that assigns Thinker/Worker/Verifier roles to frontier models across multi-turn orchestration; and Conductor (arXiv:2512.04388), a 7B Qwen2.5-GRPO model that generates natural-language communication topologies and per-model targeted instructions for arbitrary agent pools. Vendor-reported benchmarks show fugu-ultra at 95.1 GPQAD / 93.2 LCBv6 / 54.2 SWEPro versus frontier model competitors, with a recursive self-calling mechanism enabling test-time scaling at the orchestration meta-level (recursion depth as a tunable compute axis requiring no retraining), though no independent reproduction exists as of launch.

Sponsored Ad

If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀

The Lithium Boom is Heating Up

Lithium stock prices have more than doubled in the past year in response to ballooning costs and shortages. $ALB climbed 185%. $SQM, 133%.

This $1B unicorn’s patented technology can recover up to 3X more lithium than traditional methods. That’s earned investment from leaders like General Motors.

Now they’re preparing for commercial production just as experts project 5X demand growth by 2040. EnergyX is tapping into 100,000+ acres of lithium deposits in Chile, a potential $1.1B annual revenue opportunity at projected market prices.

Energy Exploration Technologies, Inc. (“EnergyX”) has engaged Beehiiv to publish this communication in connection with EnergyX’s ongoing Regulation A offering. Beehiiv has been paid in cash and may receive additional compensation. Beehiiv and/or its affiliates do not currently hold securities of EnergyX.

This compensation and any current or future ownership interest could create a conflict of interest. Please consider this disclosure alongside EnergyX’s offering materials. EnergyX’s Regulation A offering has been qualified by the SEC. Offers and sales may be made only by means of the qualified offering circular. Before investing, carefully review the offering circular, including the risk factors. The offering circular is available at invest.energyx.com/.

Comparisons to other companies are for informational purposes only and should not imply similar results. Past performance is not indicative of future results. Market shortfall are forward‑looking estimates and are subject to substantial uncertainty.

Recommended for you