In partnership with

SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | April 20, 2026

LLM evaluation benchmarks are lying to you. MMLU scores, HumanEval pass rates, perplexity on WikiText, none of them tell you which model a real human would actually prefer in a conversation. FastChat and Chatbot Arena are the infrastructure built to measure the thing everyone else is measuring wrong.

This newsletter dissects FastChat not as a deployment tool, but as a complete research platform: a distributed multi-model serving system, an evaluation framework that invented LLM-as-a-judge, and the production system running the only benchmark in the field grounded in 1.5 million real human votes.

What It Actually Does

FastChat from the LMSYS group at UC Berkeley is three things packaged together: a distributed serving system for running multiple LLMs simultaneously with an OpenAI-compatible API, a fine-tuning pipeline that produced Vicuna (fine-tuned from LLaMA-2 on 125K ShareGPT conversations), and the evaluation infrastructure powering Chatbot Arena (lmarena.ai), which has served over 10 million chat requests across 70+ LLMs and accumulated over 1.5 million human preference votes.

The serving system is the engineering core. The evaluation system is the research contribution. The Vicuna model is what got the community's attention in 2023. The interaction between all three is what makes FastChat worth dissecting.

FastChat is not a production-grade high-throughput serving system. vLLM with PagedAttention is the correct tool for maximizing tokens per second at scale. FastChat's serving system is designed for a different problem: running many different models simultaneously, routing requests between them, and collecting human preference data across model pairs. That problem requires a different architecture.

The Architecture, Unpacked

FastChat's serving architecture has three components, each running as a separate process, communicating over HTTP.

Caption: Focus on the controller's worker registry. It is a simple HTTP dispatch table, not a sophisticated scheduler. Each worker registers itself on startup and sends periodic heartbeats. The controller routes requests to workers by model name with speed-weighted load balancing. This simplicity is intentional.

The controller is deliberately simple. It does not do continuous batching, PagedAttention, or KV cache management. Those are the model worker's responsibility (or vLLM's, when running the vLLM worker backend). The controller's job is routing: which worker hosts the requested model, and which of potentially multiple workers for that model is least loaded.

The conversation template system is FastChat's underappreciated contribution. Every LLM has its own prompt format: Llama-2-chat uses [INST]...[/INST] markers, Vicuna uses USER: and ASSISTANT: prefixes, ChatML uses <|im_start|> tokens. FastChat's Conversation class standardizes these behind a single interface. When you register a new model, you implement one conversation template and one model adapter — and the model works with the entire FastChat ecosystem (CLI, web UI, OpenAI API, Arena) automatically.

The Code

Snippet One: Launching a complete multi-model serving stack

# Terminal 1: Start the controller (worker registry and router)
# Port 21001 is the default; workers register here on startup
python3 -m fastchat.serve.controller --host 0.0.0.0 --port 21001

# Terminal 2: Launch first model worker
# ← Each worker is a separate process; GPU isolation via CUDA_VISIBLE_DEVICES
# ← Worker registers itself to the controller automatically on startup
# ← controller-address must match Terminal 1
CUDA_VISIBLE_DEVICES=0 python3 -m fastchat.serve.model_worker \
    --model-path lmsys/vicuna-7b-v1.5 \
    --controller http://localhost:21001 \
    --port 21002 \
    --worker http://localhost:21002

# Terminal 3: Launch second model worker (different model, different GPU)
# ← Multiple workers can serve the same model (for throughput)
# ← Or different models (for multi-model routing)
CUDA_VISIBLE_DEVICES=1 python3 -m fastchat.serve.model_worker \
    --model-path lmsys/longchat-7b-32k-v1.5 \
    --controller http://localhost:21001 \
    --port 21003 \
    --worker http://localhost:21003

# Terminal 4: Launch OpenAI-compatible API server
# ← This is the single endpoint clients talk to
# ← It proxies to whichever worker has the requested model
# ← Drop-in replacement for https://api.openai.com/v1
python3 -m fastchat.serve.openai_api_server \
    --host 0.0.0.0 \
    --controller-address http://localhost:21001 \
    --port 8000

# Test with the standard OpenAI Python SDK — no code changes needed
# ← THIS is the key design decision: zero client-side migration cost
python3 -c "
import openai
client = openai.OpenAI(
    api_key='EMPTY',
    base_url='http://localhost:8000/v1'  # ← Just change the base URL
)
resp = client.chat.completions.create(
    model='vicuna-7b-v1.5',  # ← Model name maps to the registered worker
    messages=[{'role': 'user', 'content': 'Explain attention in 2 sentences.'}],
    stream=True
)
for chunk in resp:
    print(chunk.choices[0].delta.content or '', end='', flush=True)
"

Caption: The four-process architecture is explicit by design. Each component can be scaled, replaced, or upgraded independently. Swap the model worker for a vLLM worker to get PagedAttention throughput. Swap the API server for a Gradio web server to get the Arena UI. The controller stays the same.

Snippet Two: Running MT-Bench evaluation (LLM-as-a-judge pipeline)

# Step 1: Install the judge dependencies
pip install -e ".[model_worker,llm_judge]"

# Step 2: Generate model answers on MT-Bench questions
# MT-Bench has 80 multi-turn questions across 8 categories:
# Writing, Roleplay, Reasoning, Math, Coding, Extraction, STEM, Humanities
# ← --num-gpus-per-model controls model parallelism for large models
python3 fastchat/llm_judge/gen_model_answer.py \
    --model-path lmsys/vicuna-13b-v1.5 \
    --model-id vicuna-13b-v1.5 \
    --num-gpus-per-model 1
# Output: data/mt_bench/model_answer/vicuna-13b-v1.5.jsonl

# Step 3: Use GPT-4 as judge to score each answer
# ← THIS is the key insight: GPT-4 achieves >80% agreement with human judges
# ← Much cheaper than human annotation at scale
# ← --parallel 2 runs 2 concurrent GPT-4 API calls to speed up evaluation
python3 fastchat/llm_judge/gen_judgment.py \
    --model-list vicuna-13b-v1.5 gpt-3.5-turbo \
    --judge-model gpt-4 \
    --parallel 2
# Output: data/mt_bench/model_judgment/gpt-4_single.jsonl

# Step 4: Show results with per-category breakdown
python3 fastchat/llm_judge/show_result.py \
    --model-list vicuna-13b-v1.5 gpt-3.5-turbo
# Sample output:
# Model           Score   Writing  Roleplay  Reasoning  Math   Coding
# gpt-3.5-turbo   7.94    8.88     7.65      6.50       7.25   7.93
# vicuna-13b-v1.5 6.57    7.63     7.20      5.75       4.00   5.25
# ← Math and Coding gap between open and proprietary models is stark

Caption: MT-Bench's two-turn structure is the key design decision. First turn asks the question. Second turn is a follow-up that tests whether the model can maintain context and reasoning. This exposes failure modes that single-turn benchmarks miss entirely, particularly for instruction following and multi-step reasoning.

It In Action: End-to-End Worked Example

Scenario: Research team wants to compare three models on code generation tasks before committing to one for production. They need: side-by-side outputs, human preference collection, automated scoring.

Step 1: Configure the Arena with local and API-based models

{
    "vicuna-13b": {
        "model_name": "vicuna-13b-v1.5",
        "api_type": "worker",
        "api_base": "http://localhost:21001"
    },
    "gpt-3.5-turbo": {
        "model_name": "gpt-3.5-turbo",
        "api_type": "openai",
        "api_base": "https://api.openai.com/v1",
        "api_key": "sk-..."
    },
    "claude-haiku": {
        "model_name": "claude-haiku-20240307",
        "api_type": "anthropic_message",
        "api_key": "sk-ant-..."
    }
}

Step 2: Launch Chatbot Arena UI

python3 -m fastchat.serve.gradio_web_server_multi \
    --register-api-endpoint-file api_endpoints.json

The user sees two anonymous model outputs side by side. Model names are hidden until after the vote. This is the critical design choice: anonymity prevents brand bias from corrupting preference data.

Step 3: User submits a coding prompt

Input: "Write a Python function that takes a list of integers and returns all pairs that sum to a target value. Include error handling and a docstring."

Model A output (vicuna-13b-v1.5, ~3.2 seconds on single A100):

def find_pairs(numbers: list[int], target: int) -> list[tuple[int, int]]:
    """
    Find all pairs of integers in a list that sum to target.
    Args:
        numbers: List of integers to search
        target: Target sum value
    Returns:
        List of tuples containing pairs that sum to target
    """
    if not isinstance(numbers, list):
        raise TypeError("Input must be a list")
    seen = set()
    pairs = []
    for num in numbers:
        complement = target - num
        if complement in seen:
            pairs.append((complement, num))
        seen.add(num)
    return pairs

Model B output (gpt-3.5-turbo, ~1.8 seconds via API):

def find_pairs(lst, target):
    result = []
    for i in range(len(lst)):
        for j in range(i + 1, len(lst)):
            if lst[i] + lst[j] == target:
                result.append((lst[i], lst[j]))
    return result

Step 4: Vote recorded, Elo updated

User votes Model A better (better docstring, O(n) vs O(n²) complexity, type hints, error handling).

The Elo update is computed as:

Expected score of A = 1 / (1 + 10^((EloB - EloA) / 400))
New EloA = EloA + K * (actual_score - expected_score)

Where K=4 for established models. This dampens rating swings for models with many battles while allowing new models to move quickly.

Step 5: MT-Bench automated scoring for the same prompt category

GPT-4 judge score for Model A (Coding category): 7/10. Reasoning: "Correct algorithm, good O(n) complexity, proper docstring and type hints. Deducted points for not handling duplicate pairs and not sorting output."

GPT-4 judge score for Model B (Coding category): 5/10. Reasoning: "O(n²) brute force approach, no type hints, no error handling, no docstring."

Real throughput numbers (from community benchmarks):

  • FastChat default model worker (HuggingFace transformers): ~15-25 tok/s on A100 for 7B models

  • FastChat with vLLM worker backend: ~200-400 tok/s on A100 for 7B models (8-16x improvement)

  • Latency difference for a 200-token response: ~8-13 seconds (default) vs ~0.5-1 second (vLLM)

The default worker is not competitive for production throughput. It is designed for correctness, compatibility, and simplicity.

Why This Design Works (and What It Trades Away)

The three-process architecture (controller, worker, API server) is not the most efficient design for serving a single model at high throughput. It is the correct design for a platform that needs to serve 70+ models simultaneously, route requests between them, swap models in and out without downtime, and collect human preferences across model pairs.

The controller's speed-weighted load balancing routes new requests to the worker that is currently generating tokens fastest. This is measured in tokens per second from the worker's heartbeat. A worker that is currently processing a long context will report lower speed and receive fewer new requests. This is a practical approximation to optimal scheduling, not a theoretically optimal solution.

The Conversation template system is the correct abstraction for a multi-model platform. Every new model has an idiosyncratic prompt format. Vicuna uses USER: prefixes. Llama-2-chat uses special tokens. ChatML uses <|im_start|> markers. Without a unified template layer, every client would need to know which format each model expects. FastChat's adapter pattern hides this complexity behind a single interface, enabling the Arena's anonymous model comparison: both models receive the same raw user input, each worker applies its own conversation template internally.

What FastChat trades away:

Throughput. The default HuggingFace transformers backend does not implement continuous batching, PagedAttention, or flash attention. It processes requests one at a time. For the Arena use case (many different models, low sustained concurrency per model), this is acceptable. For production serving of a single model under high load, the vLLM worker backend is required.

Active development. The last FastChat release was v0.2.36 in February 2024. The project is effectively in maintenance mode. Chatbot Arena (lmarena.ai) continues to run and collect votes, but the open-source codebase itself is not receiving new features. Teams building on FastChat should expect to maintain forks.

Technical Moats

What makes FastChat hard to replicate:

The 1.5 million human vote dataset is the real moat. The Elo leaderboard is only meaningful because the underlying preference data is large, diverse, and collected under controlled conditions (anonymous models, side-by-side comparison, no prompt cherry-picking). Collecting that data required building and operating a platform at scale for years. No paper, no code release, and no fine-tuned model replicates that dataset.

The LLM-as-a-judge methodology is the research contribution with the longest tail. The "Judging LLM-as-a-Judge" paper (NeurIPS 2023) established that GPT-4 agrees with human preferences at over 80% accuracy, the same inter-human agreement rate. This validated using LLMs to automate evaluation at scale. The paper also documented the failure modes: position bias (favoring first answer in pairwise), verbosity bias (favoring longer answers), and self-enhancement bias (models preferring their own outputs). These biases are now standard knowledge in LLM evaluation, but FastChat's paper is where they were rigorously measured.

The conversation template adapter system covers more model-specific prompt formats than any competing serving framework. Supporting a new model in FastChat requires implementing a Conversation template and a BaseModelAdapter subclass, both well-defined interfaces. This is why FastChat supported over 50 model architectures before any other serving framework had half that many.

Insights

Insight One: Chatbot Arena's Elo ratings are a measure of user preference, not model quality, and conflating the two has caused serious benchmark misuse.

The Arena Elo system measures which model users prefer when given two anonymous responses side by side. This is a real and useful signal. But it is not a measure of factual accuracy, code correctness, or safety. A model that writes confident, fluent, well-formatted incorrect answers will consistently beat a model that writes hesitant but accurate ones. Verbosity bias is documented in the FastChat paper: GPT-4 as judge, and humans, both prefer longer responses independent of accuracy. The Chatbot Arena leaderboard tells you which model is more persuasive. That is not the same as which model is more correct. Teams using Arena Elo as a proxy for coding benchmark performance or factual accuracy are using the metric for something it was not designed to measure.

Insight Two: FastChat's default model worker is deliberately slow, and this is not a bug. It is the correct architectural decision for its actual use case, and most engineers misunderstand why.

The FastChat model worker uses HuggingFace transformers with standard autoregressive generation. It does not implement continuous batching, speculative decoding, or PagedAttention. A 7B model on an A100 generates roughly 15-25 tokens per second. A vLLM backend on the same hardware generates 200-400 tokens per second. Engineers who discover this and complain about FastChat being slow are correct in their observation but wrong in their conclusion. FastChat's default worker is slow because it prioritizes broad model compatibility over throughput. It supports quantization formats (GPTQ, AWQ, ExLlama V2), unusual architectures (RWKV, T5 encoder-decoder), CPU offloading, and CPU-only inference that vLLM does not support. The correct conclusion is: use the default worker for compatibility and evaluation, switch to the vLLM worker for throughput in production.

Takeaway

Vicuna-13B, which achieved claimed "90% ChatGPT quality" in 2023, was fine-tuned from LLaMA-2 on just 125K conversations scraped from ShareGPT.com, using 4 A100 GPUs for approximately $300 in cloud compute. The MT-Bench evaluation that established this claim was run using GPT-4 as judge, meaning the benchmark that validated Vicuna's quality was itself an LLM making pairwise judgments. This circular dependency, using one LLM to benchmark another, is now the standard methodology in the field, and almost nobody talks about the epistemic consequences.

The Vicuna fine-tuning used a learning rate of 2e-5, cosine scheduler, 3 epochs, global batch size 128, max context 2048. Standard hyperparameters, nothing unusual. The data quality work was the real effort: filtering ShareGPT HTML to markdown, removing low-quality samples, splitting long conversations to fit context windows. The "90% ChatGPT quality" claim came from GPT-4 scoring both models on 80 questions. GPT-4 preferred Vicuna-13B 90% as often as it preferred ChatGPT-3.5. This is not the same as "Vicuna is 90% as capable as ChatGPT." It is "on these 80 questions, GPT-4 preferred Vicuna's answers nearly as often." The subsequent MT-Bench paper correctly noted that Vicuna-13B lags significantly in Coding and Math relative to GPT-3.5-turbo, which the original 80-question set did not fully capture.

TL;DR For Engineers

  • FastChat is a three-process serving architecture (controller, model worker, API server) plus an evaluation pipeline (MT-Bench with LLM-as-a-judge) plus Chatbot Arena (1.5M human votes, Elo leaderboard for 70+ models). All three are tightly coupled and designed together

  • The default model worker uses HuggingFace transformers and is intentionally slow (~15-25 tok/s on A100 for 7B models), prioritizing model compatibility over throughput. Switch to the vLLM worker backend for production: 8-16x throughput improvement

  • MT-Bench's two-turn question structure is the key evaluation design: first turn asks, second turn follows up. Single-turn benchmarks miss instruction following and context maintenance failures

  • GPT-4 as judge achieves over 80% agreement with human preference, matching inter-human agreement rates. Documented failure modes: position bias, verbosity bias, self-enhancement bias. Use swap positions (judge both orders) to mitigate position bias

  • FastChat is in maintenance mode as of early 2024 (last release v0.2.36, February 2024). The Chatbot Arena platform continues operating, but the open-source codebase is not receiving new features

The Benchmark That Measured What Others Couldn't

FastChat is not the fastest LLM serving system. vLLM is. It is not the most actively developed open-source platform. Ollama is. What FastChat built that nothing else has replicated is a ground truth dataset: 1.5 million human preferences, collected blind, across 70+ models, over two years of production operation. Every Elo score on lmarena.ai is backed by real humans who chose one response over another without knowing which model produced it. The MT-Bench methodology, the LLM-as-a-judge validation, the bias documentation, the Conversation template abstraction, these are engineering and research contributions that ripple through the entire field. The correct frame for FastChat is not "a serving system that got slow." It is "the research infrastructure that established how to measure LLM quality at scale." That problem was genuinely unsolved before FastChat. It is now the field standard because FastChat solved it.

References

FastChat is a three-component distributed LLM serving system (controller routing, model workers, OpenAI-compatible API server) coupled with the MT-Bench evaluation framework (80 two-turn questions, GPT-4 as judge with documented >80% human agreement) and Chatbot Arena (lmarena.ai), which has collected 1.5 million human preferences across 70+ LLMs to produce the only leaderboard grounded in real human blind preference voting. The default model worker prioritizes compatibility over throughput (~15-25 tok/s on A100) and should be replaced with the vLLM worker backend for production. FastChat's lasting contributions are the LLM-as-a-judge methodology (with documented failure modes: position, verbosity, and self-enhancement bias), the Conversation template abstraction enabling multi-model platforms, and the Elo rating dataset that established how to measure human preference for open-ended chat at scale.

Sponsored Ad

If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀

The AI notepad for back-to-back meetings

Most AI note-takers just record your call and send a summary after.

Granola is different. It’s an AI notepad. You jot down what matters during the meeting, and Granola transcribes everything in the background.

When the call ends, it combines your notes with the full transcript to create summaries, action items, and next steps, all from your point of view.

Then the powerful part: chat with your notes. Write follow-up emails, pull out decisions, or prep for your next call, in seconds.

Think of it as a super-smart notes app that actually understands your meetings.

Recommended for you