vLLM Semantic Router: The Infrastructure Layer That Decides Which Model Should Handle Your Request Before the Model Sees It

In partnership with

SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | June 6, 2026

Most teams building multi-model LLM deployments solve the routing problem the same way: a cascade of if-else statements that grows organically as requirements accumulate. First you add a keyword check for safety. Then a cost check for expensive models. Then a privacy check for regulated data. Then a latency check for real-time requests. Each check is added independently, conflicts between them go undetected, and the routing logic becomes the undocumented critical path that nobody wants to touch.

vLLM Semantic Router (arXiv:2603.04444, Liu, Chen, et al., 30+ authors, Feb 2026) is a systematic replacement for this ad-hoc accumulation. The key insight: request routing is a signal extraction and composition problem, not a model quality problem. Prior work (RouteLLM, RouterDC, AutoMix) addressed model selection in isolation. The Semantic Router unifies signal extraction, safety enforcement, multi-provider backend management, and plugin extensibility into a single composable framework.

Two versions ship: v0.1 "Iris" (January 5, 2026) and v0.2 "Athena" (March 10, 2026), under the vllm-project organization. The repo has 4.3k stars, 698 forks, 1,493 commits, and an AMD collaboration announced December 2025.

Scope: the Signal-Decision-Selection-Plugin architecture, the two signal tiers (heuristic and neural), the HaluGate hallucination pipeline, the composable Boolean policy DSL, and deployment across local vLLM and six cloud providers. Not covered: the OATS tool selection optimization (arXiv:2603.18174) or the full Workload-Router-Pool architecture vision paper (arXiv:2603.21354) beyond brief mention.

What It Actually Does

vLLM Semantic Router is a gateway layer that sits in front of a heterogeneous fleet of models and routes each incoming request to the optimal backend based on composable signal-driven decisions. It is not a model. It is infrastructure.

What it routes across:

Backend type	Examples
Local vLLM instances	On-premise Llama, Gemma, Mistral deployments
Cloud frontier providers	OpenAI, Anthropic, Azure OpenAI, Amazon Bedrock, Gemini, Vertex AI
Mixed fleets	Any combination: local + cloud, multiple providers

What it enforces at routing time (not model time):

Cost constraints (route budget-sensitive traffic to cheaper models)
Privacy constraints (route regulated data to local, no-cloud models)
Latency constraints (route real-time traffic to fastest available backend)
Safety constraints (jailbreak detection, PII filtering, hallucination detection before and after generation)

Installation:

git clone https://github.com/vllm-project/semantic-router
cd semantic-router

# Development:
pip install -e ".[dev]"

# Production (Docker Compose):
docker compose -f deploy/docker-compose.yml up -d
# → starts router API, dashboard, backend registry

The Architecture, Unpacked

Focus on the Decision Engine's Boolean composability. The same signal extraction layer serves every deployment scenario. Different deployments are expressed as different Boolean rule sets compiled at load time, not different code paths at runtime. Policy conflict detection at load time (not production runtime) is the feature that prevents the common failure mode of conflicting routing rules discovered only when a production request triggers both.

The Code, Annotated

Snippet One: Signal Extraction and Decision Composition

# vLLM Semantic Router: signal extraction and Boolean decision composition
# Source: vllm-project/semantic-router, config/ + router/ directories
# Reconstructed from arXiv:2603.04444 architecture description

from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
import time

class SignalTier(Enum):
    HEURISTIC = "heuristic"   # sub-millisecond, no model inference
    NEURAL    = "neural"      # milliseconds, requires embedding/classifier

@dataclass
class Signal:
    name: str
    tier: SignalTier
    value: any              # bool, str, float, or categorical
    latency_ms: float       # how long extraction took

@dataclass
class SignalVector:
    """All signals extracted from a single request."""
    # Tier 1: heuristic (always extracted, sub-millisecond)
    keyword_match: Optional[list[str]] = None  # matched blocklist terms
    language:      Optional[str]       = None  # ISO 639-1 code
    context_length: Optional[int]      = None  # token count
    role:          Optional[str]       = None  # "public", "premium", "admin"

    # Tier 2: neural (extracted conditionally)
    domain:        Optional[str]       = None  # "medical", "coding", etc.
    embed_sim:     Optional[float]     = None  # similarity to route exemplars
    modality:      Optional[str]       = None  # "text", "image", "audio"
    factual_density: Optional[float]  = None  # 0-1, triggers HaluGate if high

    extraction_total_ms: float         = 0.0


class SignalExtractor:
    """
    Extracts all signals from an incoming request.

    Two-tier design rationale:
    ← Tier 1 (heuristic) runs on EVERY request with no GPU cost.
      If heuristic signals are sufficient to make the routing decision, stop here.
      No embedding model is invoked, no classifier runs.

    ← Tier 2 (neural) runs CONDITIONALLY: only when heuristic signals
      are insufficient for a definitive decision.
      This is the Shannon-inspired "information maximization" view:
      extract the minimum signal needed for the routing decision.
    """

    def __init__(self, embedding_model, domain_classifier):
        self.embed_model = embedding_model   # mmBERT-embed-32k-2d-matryoshka
        self.domain_clf  = domain_classifier

    def extract(self, request: dict, run_neural: bool = True) -> SignalVector:
        t0 = time.monotonic()
        sv = SignalVector()

        # ── TIER 1: HEURISTIC (sub-millisecond) ──────────────────────────────
        # Keyword matching: scan prompt text for blocklist patterns
        # ← No tokenization needed: simple string scan
        sv.keyword_match = self._keyword_scan(request["messages"])

        # Language detection: FastText-based, ~0.1ms per request
        sv.language = self._detect_language(request["messages"])

        # Context length: count tokens (or estimate from char count)
        sv.context_length = self._count_tokens(request["messages"])

        # Role-based authorization from request headers / API key scope
        sv.role = request.get("x_caller_role", "public")

        # ← Decision: if heuristic signals alone determine routing, skip neural
        # Example: PII keyword match → ROUTE_LOCAL_ONLY immediately
        if sv.keyword_match and "PII" in sv.keyword_match:
            sv.extraction_total_ms = (time.monotonic() - t0) * 1000
            return sv  # ← skip neural tier entirely

        # ── TIER 2: NEURAL (conditional) ─────────────────────────────────────
        if run_neural:
            # Embed the request for domain and similarity routing
            embedding = self.embed_model.encode(request["messages"])
            # ← 2D Matryoshka: embedding is usable at any prefix length
            #   [0:128] for fast routing, [0:768] for precision routing
            sv.embed_sim = self._compute_similarity(embedding)
            sv.domain    = self.domain_clf.predict(embedding)
            sv.modality  = self._detect_modality(request)
            sv.factual_density = self._estimate_factuality(embedding)

        sv.extraction_total_ms = (time.monotonic() - t0) * 1000
        return sv


class DecisionEngine:
    """
    Composes signals into routing decisions via Boolean rules.

    ← The rules are compiled at LOAD TIME (not request time).
      Conflict detection happens during compilation:
      if rule A says "route to local" and rule B says "route to cloud"
      and both can fire on the same request, the engine flags it.
      No surprises in production.

    ← The same engine with different rule configs produces different
      deployment behaviors: cost-optimized, privacy-regulated, etc.
    """

    def __init__(self, rules: list[dict]):
        self.compiled_rules = self._compile(rules)
        # ← Conflicts detected here, not at request time
        self._check_conflicts(self.compiled_rules)

    def decide(self, sv: SignalVector) -> list[str]:
        """Evaluate all rules and return list of triggered decisions."""
        decisions = []
        for rule in self.compiled_rules:
            if rule.evaluate(sv):
                decisions.append(rule.decision)
        return decisions


# ── EXAMPLE CONFIGS FOR TWO DIFFERENT DEPLOYMENTS ────────────────────────────
# Same architecture, different rules = different behavior

COST_OPTIMIZED_RULES = [
    # Route to cheap model unless domain is medical/legal (risk-sensitive)
    {"if": "domain NOT IN ['medical', 'legal'] AND context_length < 8000",
     "then": "ROUTE_EFFICIENT_MODEL"},
    {"if": "context_length >= 8000",
     "then": "ROUTE_LONG_CONTEXT_POOL"},
]

PRIVACY_REGULATED_RULES = [
    # ← ALL requests stay on-premise: no cloud routing allowed
    {"if": "keyword_match CONTAINS 'PII' OR role != 'premium'",
     "then": "ROUTE_LOCAL_ONLY"},
    {"if": "modality == 'text'",   "then": "ROUTE_LOCAL_VLLM"},
    {"if": "modality == 'image'",  "then": "ROUTE_LOCAL_VISION"},
]
# ← Same signal extraction, different Boolean rules, no code changes

The _check_conflicts(self.compiled_rules) call at load time is the most important line in the Decision Engine. Most routing systems discover conflicting rules at production runtime when an edge-case request fires two contradicting rules simultaneously. The Semantic Router makes this a load-time error. Conflicting policies are caught before a single production request is processed.

Snippet Two: HaluGate and Plugin Chain Configuration

# vLLM Semantic Router: deployment configuration
# Source: vllm-project/semantic-router/config/ directory structure
# Shows how signal → decision → selection → plugin chain composes in YAML

# ─── Signal configuration ───────────────────────────────────────────────────
signals:
  heuristic:
    keyword_match:
      enabled: true
      blocklist_path: ./config/blocklists/pii_terms.txt
      categories: ["PII", "JAILBREAK", "COMPETITOR"]
    language_detect:
      enabled: true
      backend: fasttext
      min_confidence: 0.85
    context_length:
      enabled: true
      estimator: tiktoken
      model: gpt-4o  # use for consistent token counting across backends

  neural:
    domain_classifier:
      enabled: true
      # ← Only runs when heuristic signals don't resolve the decision
      model: ./models/domain-classifier-v2
      threshold: 0.7
    embedding_similarity:
      enabled: true
      # 2D Matryoshka model: use prefix of 128 dims for fast routing
      model: llm-semantic-router/mmbert-embed-32k-2d-matryoshka
      dim: 128        # ← use 128-dim prefix for routing, not full 768
      # ← 98x speed improvement (arXiv:2603.12646) via Flash Attention
      #   + prompt compression + near-streaming embedding

# ─── Decision rules ─────────────────────────────────────────────────────────
decisions:
  - name: route_medical_premium
    condition: "domain == 'medical' AND role == 'premium'"
    target_pool: frontier_pool
    plugins: [pii_filter, halugate_medical]

  - name: route_local_privacy
    # ← Privacy: PII or public role → never leave local infrastructure
    condition: "keyword_match CONTAINS 'PII' OR role == 'public'"
    target_pool: local_vllm_pool
    plugins: [pii_redact, pii_restore]

  - name: route_long_context
    condition: "context_length >= 32000"
    target_pool: long_context_pool
    selection_algorithm: latency_optimal  # ← override default for long context

# ─── Backend pools ───────────────────────────────────────────────────────────
pools:
  local_vllm_pool:
    backends:
      - type: vllm
        url: http://localhost:8000
        model: Llama-3.1-70B-Instruct
      - type: vllm
        url: http://localhost:8001
        model: gemma-4-12b-it

  frontier_pool:
    backends:
      - type: openai
        model: gpt-4o
      - type: anthropic
        model: claude-opus-4-6
      - type: bedrock
        model: us.amazon.nova-pro-v1

  long_context_pool:
    backends:
      - type: openai
        model: gpt-4o-128k
      - type: gemini
        model: gemini-2.0-flash

# ─── Plugin definitions ──────────────────────────────────────────────────────
plugins:
  halugate_medical:
    # 3-stage hallucination detection for medical domain
    # ← Medical + factual claims → run full HaluGate pipeline
    stages:
      - pre_check:    {method: factual_density, threshold: 0.6}
      - consistency:  {method: source_grounding, retrieval: pubmed_rag}
      - post_check:   {method: factuality_verify, backend: gpt-4o-mini}
    # ← If any stage fails: return uncertainty flag in response metadata
    on_failure: flag_response  # not block: medical decisions need human review

  pii_redact:
    method: presidio           # Microsoft Presidio for PII detection
    entities: [PERSON, EMAIL, PHONE, SSN, CREDIT_CARD]
    action: pseudonymize       # replace with consistent placeholder
    # ← pii_restore plugin maps placeholders back in the response

The dim: 128 embedding configuration for routing (versus full 768 dims for retrieval) is the practical application of the 2D Matryoshka model. The paper's companion work (arXiv:2603.12646) documents 98x routing speedup from Flash Attention, prompt compression, and near-streaming at reduced embedding dimension. Full-precision embeddings are reserved for the cases where 128-dim similarity is insufficient.

It In Action: End-to-End Worked Example

Setting: Multi-cloud enterprise deployment, mixed workload of medical queries, coding requests, and general chat. 3 local vLLM backends + 3 cloud providers.

Incoming request:

{
  "messages": [
    {"role": "user", "content": "What is the recommended dosage of metformin 
     for a patient with John Smith (DOB 1965-03-12) and mild renal impairment?"}
  ],
  "x_caller_role": "premium",
  "x_request_id": "req_7f3a91b"
}

Signal extraction (~4.2ms total):

Tier 1 (heuristic): 0.8ms
  keyword_match:    ["PII"]   ← "John Smith" + DOB detected as PII
  language:         "en"
  context_length:   48 tokens (short)
  role:             "premium"

Tier 2 (neural): 3.4ms  ← runs because role=premium, not blocked by PII alone
  domain:           "medical" (confidence: 0.94)
  factual_density:  0.87      ← high: dosage + disease = factual claims
  modality:         "text"

Decision evaluation:

Rule: route_medical_premium
  Condition: domain == 'medical' AND role == 'premium'
  Evaluation: TRUE
  Result: → frontier_pool + [pii_filter, halugate_medical]

Rule: route_local_privacy
  Condition: keyword_match CONTAINS 'PII' OR role == 'public'
  Evaluation: TRUE (PII detected)

⚠ CONFLICT DETECTED at load time for requests matching BOTH rules:
  premium medical request with PII → fires both route_medical_premium AND route_local_privacy
  Resolution: priority ordering in config (privacy > frontier for PII + premium medical)
  → ROUTE_LOCAL_VLLM + [pii_redact, halugate_medical, pii_restore]
  ← Conflict caught at config load, not at runtime. PII wins over frontier routing.

Plugin chain execution:

1. pii_redact (pre-generate):
   "John Smith" → "PERSON_1"
   "DOB 1965-03-12" → "DOB_1"
   Redacted request forwarded to local vLLM

2. Local vLLM inference (Llama-3.1-70B-Instruct):
   Input: "...dosage of metformin for PERSON_1 (DOB_1) and mild renal impairment?"
   Output: "For patients with mild renal impairment, metformin 500mg twice daily..."
   Inference time: ~1.8s

3. halugate_medical (post-generate):
   Stage 1 (density): factual_density=0.87 → proceed to stages 2+3
   Stage 2 (consistency): checks "500mg twice daily" against PubMed RAG
     → confirms: consistent with KDIGO 2022 guidelines for eGFR 45-60
   Stage 3 (factuality): gpt-4o-mini verification
     → confidence: 0.91  → PASS
   Metadata added: {halugate_verified: true, sources: ["KDIGO2022", "FDA_label_metformin"]}

4. pii_restore:
   "PERSON_1" → "John Smith"
   "DOB_1" → "DOB 1965-03-12"

Total routing overhead: 4.2ms (signal) + 12ms (plugins)
Model inference: 1.8s
Total request latency: ~1.82s
Backend used: local_vllm_pool (not frontier, due to PII policy)

What changed vs. a naive implementation:

Without Semantic Router:
  PII data sent to cloud endpoint (policy violation)
  No hallucination detection on medical dosage advice
  Routing logic: if-else in application code

With Semantic Router:
  PII stays local (policy enforced at routing time)
  Medical claim verified against clinical literature (3-stage HaluGate)
  Policy conflict (PII + medical-premium) detected at config load
  Routing logic: declarative YAML, auditable, conflict-free

Why This Design Works, and What It Trades Away

The two-tier signal extraction design is the correct architecture for a production router. Heuristic signals (keyword match, language detection, context length, RBAC) are extractable in under 1ms on CPU. Neural signals (domain classification, embedding similarity, modality detection) require milliseconds and compute. Running neural signals on every request that could be resolved by heuristics is unnecessary overhead. Running only heuristics misses the nuanced routing decisions that require semantic understanding of the request content. The conditional escalation from heuristic to neural is the efficiency mechanism that makes the router deployable without adding significant latency to every request.

The Boolean decision DSL with load-time conflict detection is the correct approach to policy management. Production routing systems that accumulate rules without conflict detection eventually produce silent failures: a request matches two contradicting rules, one fires, the other is silently ignored, and the behavior is non-deterministic depending on rule ordering. The Semantic Router's load-time compilation catches this explicitly. Deployment proceeds only with a conflict-free policy set.

The plugin chain design separates routing logic from safety enforcement. jailbreak detection, PII filtering, and hallucination detection are not routing decisions. They are policies that apply per-decision. A request routed to a local model for privacy reasons may still need PII filtering if the local model's output is logged. Attaching plugins to decisions rather than to backends keeps the safety enforcement model-agnostic.

What the Semantic Router trades away:

Routing latency overhead. The 4.2ms signal extraction in the worked example is fast but not zero. Applications where the model inference itself is under 100ms (very small models, cached responses) will find the routing overhead proportionally significant. For short-context requests to fast models, the router's contribution to end-to-end latency is measurable.

Multi-turn consistency complexity. The paper explicitly identifies multi-turn statefulness as an open challenge: routing decisions must be consistent across conversation turns, requiring session management. If turn 1 routes to the local model and turn 2 routes to a cloud model due to a different signal reading, conversation context breaks. The current system requires explicit configuration of session pinning to prevent this.

Plugin chain ordering is a manual concern. The YAML configuration determines plugin execution order. A misconfigured plugin chain (e.g., pii_restore before halugate_post) produces incorrect behavior: the hallucination detector sees pseudonymized text rather than real content, reducing detection accuracy. The system does not automatically enforce semantically correct plugin ordering.

Technical Moats

The mmBERT-embed-32k-2d-matryoshka embedding model. The router ships with a custom multilingual embedding model trained with 2D Matryoshka representation learning: the embedding is useful at any prefix dimension (128 for fast routing, 768 for precision routing). Training a model specifically for routing signal extraction, rather than using a general-purpose embedding model, captures the routing-relevant features of request text more accurately. The 98x speedup from the companion paper (arXiv:2603.12646, Flash Attention + prompt compression + near-streaming) makes sub-millisecond neural routing viable even for high-traffic deployments.

The HaluGate three-stage pipeline. Most LLM gateway products offer hallucination detection as a binary pass/fail check using a single classifier. HaluGate's three-stage design (pre-generate density assessment, consistency checking against source context, post-generate factuality verification) catches different failure modes at each stage. Pre-generate density assessment avoids running the expensive post-generate stage on requests that are unlikely to hallucinate. The multi-stage design is harder to replicate than a single classifier because it requires the retrieval pipeline and the factuality verification model to be integrated, not just added.

The declarative policy DSL with conflict detection. The DSL paper (arXiv:2603.18174) formalizes conflict-free policy language for probabilistic ML predicates, grounding the routing rules in a formal semantics that allows static analysis. This is not a feature that can be added to an existing if-else routing system without restructuring the entire policy representation. The DSL is the architectural commitment that makes load-time conflict detection possible.

Insights

Insight One: The Semantic Router is not primarily a model quality optimizer. It is a deployment policy enforcement system. RouteLLM, RouterDC, and AutoMix focus on routing to the better model for a given query. The Semantic Router adds the constraint that routing decisions must simultaneously satisfy cost budgets, privacy policies, latency requirements, and safety standards. A system that routes to the highest-quality model without enforcing these constraints is not deployable in regulated industries. The Semantic Router's value proposition is compliance at routing time, not quality maximization. These are different problems.

Insight Two: The 98x routing speedup paper (arXiv:2603.12646) reveals a specific and important fact about production routing systems: the routing overhead itself must be nearly free relative to the inference cost. At 98x speedup, the routing computation is negligible compared to even a 100ms model inference. Without this speedup, a routing layer that adds 200ms overhead to a 400ms inference would be unacceptable in real-time applications. The companion paper that achieves this speedup is not a research novelty: it is the engineering prerequisite that makes the architectural design deployable. The main paper presents the architecture; the speedup paper makes it production-viable.

Takeaway

The Semantic Router's architecture explicitly models the Shannon channel view of routing: the request is an information source, the router is a channel, and the optimal routing decision minimizes information loss about the request's true characteristics. This framing explains why the system uses composable signals rather than a single end-to-end classifier. An end-to-end classifier trained to route directly from request text to model backend compresses all signal types into a single prediction, losing the fine-grained interpretability needed for policy enforcement. The signal-decision decomposition preserves interpretability at each stage: you can inspect which signals fired, which rules triggered, and which decision was made, without reverse-engineering a black box classifier. In regulated deployments where routing decisions must be auditable, this interpretability is not a design aesthetic: it is a compliance requirement.

TL;DR For Engineers

vLLM Semantic Router (arXiv:2603.04444, vllm-project/semantic-router, 4.3k stars, v0.2 "Athena" March 10 2026) is a signal-driven routing gateway for heterogeneous LLM fleets: local vLLM + OpenAI, Anthropic, Azure, Bedrock, Gemini, Vertex AI. Routes on cost, privacy, latency, safety, modality, and domain simultaneously.
Signal extraction: Tier 1 heuristics (sub-ms: keyword, language, context length, RBAC) → Tier 2 neural conditionally (domain classifier, embedding similarity, modality detection) using mmBERT-embed-32k-2d-matryoshka. 98x routing speedup documented (arXiv:2603.12646).
Decision engine: composable Boolean rules compiled at load time with conflict detection. Same architecture, different YAML rules → cost-optimized vs. privacy-regulated deployment without code changes. Policy conflicts caught at config load, not runtime.
Plugin chains per decision: jailbreak detection, PII filtering (Presidio), HaluGate 3-stage hallucination detection (pre-generate density, consistency, post-generate factuality). Plugin order is manual, semantic correctness not enforced by the system.
Key limitation: multi-turn statefulness (session pinning) requires explicit configuration. If turn 1 routes to local and turn 2 routes to cloud, conversation context breaks without pinning.

The Router Is the Policy

vLLM Semantic Router's correct framing is: inference-time policy enforcement with model selection as one of the enforced dimensions. The routing layer is where compliance happens, where cost is controlled, where safety is enforced, and where the request characteristics are understood in enough detail to make all of these decisions simultaneously. Moving any of these functions into the application layer, into individual backend configurations, or into post-hoc monitoring is architecturally incorrect because none of those locations have the complete picture of the request, the available backends, and the deployment constraints simultaneously. The router does.

The load-time conflict detection, the two-tier signal extraction, and the declarative policy DSL are the three design decisions that distinguish this system from ad-hoc if-else routing logic. The research program around it (WRP architecture, OATS, HaluGate, the speedup paper) is systematic: each paper removes one bottleneck that would otherwise limit the architecture's deployability.

References

vLLM Semantic Router: Signal Driven Decision Routing for Mixture-of-Modality Models, arXiv:2603.04444, Liu, Chen, et al., Feb 2026
vllm-project/semantic-router GitHub, 4.3k stars
98x Faster LLM Routing without a Dedicated GPU, arXiv:2603.12646 — Flash Attention + prompt compression + near-streaming speedup
Conflict-Free Policy Languages for Probabilistic ML Predicates, arXiv:2603.18174 — the formal DSL for routing policies
The Workload-Router-Pool Architecture, arXiv:2603.21354 — WRP vision paper for full-stack inference optimization
RouteLLM: Learning to Route LLMs with Preference Data, Ong et al. 2024 — prior work on model quality routing; Semantic Router extends with constraint enforcement
vLLM Semantic Router website — publications and blog posts

vLLM Semantic Router (arXiv:2603.04444, vllm-project/semantic-router, 4.3k stars, v0.2 "Athena" March 10 2026) is a signal-driven routing gateway for heterogeneous LLM fleets that extracts two-tier signals (sub-ms heuristics: keyword, language, context length, RBAC; neural conditionally: domain, embedding similarity, modality) and composes them through Boolean decision rules into deployment-specific routing policies with load-time conflict detection. Each decision triggers a model selection algorithm (12+ options) and a plugin chain (jailbreak detection, PII filtering via Presidio, HaluGate 3-stage hallucination pipeline) before and after generation. The same architecture expresses cost-optimized, privacy-regulated, and latency-sensitive deployments as different YAML configurations without code changes. Companion paper (arXiv:2603.12646) documents 98x routing speedup via Flash Attention, prompt compression, and 2D Matryoshka embeddings at 128-dim prefix.

Sponsored Ad

If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀

AI Agents Are Reading Your Docs. Are You Ready?

Last month, 48% of visitors to documentation sites across Mintlify were AI agents, not humans.

Claude Code, Cursor, and other coding agents are becoming the actual customers reading your docs. And they read everything.

This changes what good documentation means. Humans skim and forgive gaps. Agents methodically check every endpoint, read every guide, and compare you against alternatives with zero fatigue.

Your docs aren't just helping users anymore. They're your product's first interview with the machines deciding whether to recommend you.

That means: clear schema markup so agents can parse your content, real benchmarks instead of marketing fluff, open endpoints agents can actually test, and honest comparisons that emphasize strengths without hype.

Mintlify powers documentation for over 20,000 companies, reaching 100M+ people every year. We just raised a $45M Series B led by @a16z and @SalesforceVC to build the knowledge layer for the agent era.

Make Your Docs Agent-Ready