In partnership with

TL;DR for Engineers

  • LMCache is a multi-tier KV cache layer (GPU → CPU → Redis/disk) that plugs into vLLM/SGLang via a standardized connector, no inference engine fork required.

  • SGLang's RadixAttention manages KV cache as a radix tree, enabling O(n) prefix lookup and automatic cross-request reuse within a single runtime.

  • Combined benchmark gain: up to 15× throughput vs. vLLM native on multi-turn and document-analysis workloads. TTFT drops from 11s → 1.5s at 128K context (VAST Data production benchmark).

  • The real unlock is prefill-decode (PD) disaggregation: route the compute-heavy prefill to one GPU cluster and the memory-bound decode to another. LMCache is the KV transfer bus between them.

  • Cache hit rates of 50–80% in production are common once you account for dynamically reusable patterns chat history, RAG chunks, agent observations. Most teams assume much lower, and leave massive compute savings on the table.

KV Caching

"KV caching" is one of the most overused terms in LLM infrastructure, and one of the least understood. Most teams think it means "same system prompt, skip recompute." The real opportunity is an order of magnitude larger, and almost nobody is harvesting it.

Every transformer inference call produces K and V tensors for each attention head at each layer. In a 70B model with 80 layers and a 128K context, that's hundreds of gigabytes of intermediate state generated, used once, then discarded. The entire industry spent years treating each request as a hermetically sealed computation.

LMCache and SGLang are the first open-source systems to change that framing. The insight is not just "reuse prefixes." It's deeper: the KV cache is a compressed representation of knowledge, and we've been throwing it away after every query.

What They Actually Do

LMCache is a KV cache storage and transfer library. It sits between your application and inference engine (vLLM or SGLang), intercepts KV tensors after each prefill, stores them in a tiered hierarchy, GPU VRAM → CPU DRAM → Redis/NFS/S3 and injects them back on cache hits. It also handles cross-engine KV transfer, making it the transport layer for prefill-decode disaggregation.

SGLang (Structured Generation Language, NeurIPS 2024) is simultaneously a Python DSL for writing multi-step LLM programs and a serving runtime. Its core runtime innovation RadixAttention manages KV cache as a radix tree with LRU eviction, enabling automatic prefix sharing across concurrent requests. Its HiCache extension (2025) adds a three-layer hierarchy natively inside the runtime.

The distinction matters: SGLang solves the within-engine KV reuse problem. LMCache solves the cross-engine, cross-request, persistent KV reuse problem. They are complements, not competitors.

By the numbers

Metric

Gain

Source

Throughput vs. vLLM native

15×

LMCache tech report

TTFT at 128K context

10× reduction

VAST Data production

SGLang vs. vLLM/LMQL

6.4×

NeurIPS 2024

TTFT on cache hit (DeepSeek-R1-671B)

84% reduction

SGLang HiCache blog

The Architecture, Unpacked

LMCache: Three-Tier KV Storage

┌──────────────────────────────────────────────────────────────┐
│                     APPLICATION LAYER                        │
│        (RAG pipeline / agentic loop / multi-turn chat)       │
└───────────────────────────┬──────────────────────────────────┘
                            │ HTTP / Python
┌───────────────────────────▼──────────────────────────────────┐
│                    LMCACHE MIDDLEWARE                         │
│  ┌─────────────┐  ┌──────────────────┐  ┌────────────────┐  │
│  │ Prefix Hash │─▶│  Cache Lookup    │─▶│Eviction Policy │  │
│  │ (token seq) │  │  L1 → L2 → L3   │  │(LRU/TTL/size)  │  │
│  └─────────────┘  └────────┬─────────┘  └────────────────┘  │
│                            │                                  │
│          ┌─────────────────▼──────────────────────┐          │
│          │      KV Cache Connector Interface       │          │
│          │  (standardized adapter, engine-agnostic)│          │
│          └──────┬─────────────────────┬────────────┘          │
└─────────────────┼─────────────────────┼──────────────────────┘
                  │                     │
     ┌────────────▼──────────┐  ┌───────▼──────────────────┐
     │   INFERENCE ENGINE    │  │    STORAGE HIERARCHY      │
     │   vLLM  |  SGLang     │  │                           │
     │                       │  │  L1: GPU VRAM  (~80 GB)   │
     │  Prefill → KV ────────┼─▶│  L2: CPU DRAM  (~1–2 TB) │
     │           extracted   │  │  L3: Redis/NFS/S3/3FS     │
     │  Decode  ← KV ────────┼──│      (petabyte-scale)     │
     │           injected    │  │                           │
     └───────────────────────┘  │  Transfer: gRPC/RDMA/NIXL │
                                └───────────────────────────┘

Key insight: The KV Connector Interface is the central abstraction. It decouples LMCache entirely from inference engine internals. When vLLM changes its paged-attention memory layout which happens constantly as new model architectures ship only the thin connector adapter changes, not the cache layer above it.

SGLang: RadixAttention + HiCache

SGLang Runtime
┌──────────────────────────────────────────────────────────────┐
│                   FRONTEND (Python DSL)                      │
│  @function → fork() → gen() → select() → join()             │
│  Compiles to execution graph with shared-prefix detection    │
└──────────────────────────┬───────────────────────────────────┘
                           │ execution graph
┌──────────────────────────▼───────────────────────────────────┐
│                  RUNTIME SCHEDULER                           │
│                                                              │
│  Requests sharing the same prefix:                           │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐                  │
│  │ system:  │  │ system:  │  │ system:  │ ← shared prefix  │
│  │ You are  │  │ You are  │  │ You are  │   computed ONCE  │
│  │ a coder  │  │ a coder  │  │ a coder  │                  │
│  │ Q: Fix   │  │ Q: Write │  │ Q: Debug │ ← unique suffix  │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘                  │
│       └─────────────┴─────────────┘                        │
│                     │ one prefill for the shared prefix     │
│  ┌──────────────────▼────────────────────────────────────┐  │
│  │         RADIX TREE (HiRadixTree in HiCache)            │  │
│  │                                                        │  │
│  │  root                                                  │  │
│  │   └─[system: You are a coder] → KV ptr (GPU L1)       │  │
│  │        ├─[Q: Fix the bug]     → KV ptr (GPU L1)       │  │
│  │        ├─[Q: Write a class]   → KV ptr (CPU L2)       │  │
│  │        └─[Q: Debug this]      → KV ptr (Disk L3)      │  │
│  │                                                        │  │
│  │  LRU eviction: cold blocks migrate GPU → CPU → Disk    │  │
│  └────────────────────────────────────────────────────────┘  │
│                                                              │
│  GPU-assisted I/O kernels: 3× faster CPU→GPU than           │
│  cudaMemcpy. Layer-pipelined prefetch: load layer N+1 KV    │
│  while computing layer N.                                    │
└──────────────────────────────────────────────────────────────┘

Key insight: Prefix lookup is O(token sequence length), not O(number of requests). The HiCache extension turns every tree node into a pointer to one of three storage tiers. When a request hits L2, the system prefetches to GPU L1 while starting decode I/O latency is hidden behind computation.

The Code, Annotated

Snippet 1 — Enabling LMCache with vLLM (the connector pattern)

# lmcache_config.yaml
# This config is the entire "database schema" for your KV cache layer.

chunk_size: 256          # CRITICAL: chunk >> vLLM page size (16 tokens)
                         # Larger chunks = fewer CUDA kernel launches
                         # = 3× higher effective I/O bandwidth

local_cpu:
  enabled: true
  max_cache_size: 50     # GB — CPU DRAM as L2 tier

redis:
  enabled: true
  url: "redis://cache-node:6379"  # L3 remote store, shared across nodes

eviction_policy: "lru"

enable_pipelining: true  # THIS IS THE TRICK: compute-I/O overlap.
                         # While GPU computes layer N attention,
                         # LMCache DMA-copies layer N+1 KV from CPU.
                         # Net effect: KV load cost ≈ 0 on cache hit.
# Python: launch vLLM with the LMCache connector
import subprocess
subprocess.run([
    "python", "-m", "vllm.entrypoints.openai.api_server",
    "--model", "meta-llama/Llama-3-70B-Instruct",
    "--kv-transfer-config",
    # The connector is a JSON config, not a vLLM source fork.
    # This single arg makes KV cache a first-class I/O object.
    '{"kv_connector":"LMCacheConnectorV2",'
    ' "kv_connector_extra_config":{"lmcache_config_file":"lmcache_config.yaml"}}'
])

Why this matters: The kv_connector argument is the standardized interface boundary. LMCache doesn't patch vLLM internals it only hooks into the KV tensor handoff points. When vLLM releases a new memory layout for a new model architecture, only the thin connector adapter changes.

Snippet 2 — SGLang's fork/join pattern (where RadixAttention earns its keep)

import sglang as sgl

@sgl.function
def analyze_document(s, document: str, questions: list[str]):
    # Shared prefix: system prompt + document.
    # SGLang computes the KV cache for this ONCE,
    # regardless of how many questions follow.
    s += sgl.system("You are a precise document analyst.")
    s += sgl.user(f"Document:\n{document}\n\nAnswer questions about it.")
    s += sgl.assistant("Understood. Ready.")

    # fork() creates N branches — each gets a POINTER to the shared
    # KV cache node, not a copy of the KV tensors.
    # O(1) memory regardless of how many branches you create.
    forks = s.fork(len(questions))   # ← THIS IS THE TRICK

    for f, q in zip(forks, questions):
        f += sgl.user(q)
        # Each branch runs decode-only — shared prefix already cached.
        # 4K-token doc, 8 questions:
        #   Naive:  8 prefills + 8 decodes
        #   SGLang: 1 prefill  + 8 parallel decodes
        f += sgl.assistant(sgl.gen("answer", max_new_tokens=200))

    return [f["answer"] for f in forks]

runtime = sgl.Runtime(model_path="meta-llama/Llama-3-70B-Instruct")
sgl.set_default_backend(runtime)
state = analyze_document.run(
    document=open("contract.txt").read(),
    questions=[
        "What is the payment term?",
        "Who are the parties?",
        "What are the termination clauses?",
        "What is the jurisdiction?"
    ]
)

Why this matters: fork() is a zero-copy operation, a pointer branch in the radix tree, not a tensor copy. On a 4K-token contract with 4 questions, the prefix is prefilled once. A naive API loop would prefill the same 4K tokens 4 separate times.

It in Action: End-to-End Worked Example

Scenario: A coding assistant running Qwen3-Coder-480B. Users average 8 turns per session, 25K tokens of accumulated history by turn 8.

Input: User sends turn 8. Prompt = [25,000 token history] + [120 token new question]. Without caching: the engine prefills all 25,120 tokens from scratch.

Step 1 — Cache lookup: SGLang's HiRadixTree hashes the 25K prefix. Finds a hit in CPU L2 (host memory). Triggers async prefetch to GPU L1 immediately.

Step 2 — Prefill (cache hit path): Only 120 new tokens are prefilled on GPU. The 25K KV tensors load via GPU-assisted I/O kernels (3× faster than cudaMemcpyAsync). I/O runs concurrently with GPU computation — latency is hidden.

Step 3 — Decode: Autoregressive generation proceeds normally. KV tensors for the 120 new tokens are appended to the radix tree node for this session.

Step 4 — Write-back: LMCache asynchronously writes new KV tensors to CPU L2, then L3 — without blocking the response. The next request can hit L2 immediately.

Real numbers from this exact scenario (SGLang HiCache + 3FS, production):

  • Cache hit rate: 40% → 80% after deploying hierarchical storage

  • Session average TTFT: 56% reduction

  • Inference throughput: 2× improvement

  • DeepSeek-R1-671B with Mooncake: cache hits achieved 84% TTFT reduction vs. full recompute

  • At 128K context with NFS storage: TTFT from 11s → 1.5s (7.3× improvement)

Why This Design Works, And What It Trades Away

Why it works: The core insight is that LLM inference has a massive, systematic redundancy problem that existing engines accepted as a given. Token sequences are not truly independent system prompts, RAG chunks, conversation history, and few-shot examples repeat across queries. Treating KV cache as an addressable, storable, transferable data structure instead of ephemeral GPU state is what unlocks the gains.

The compute-I/O pipelining is the hidden gem. On a 70B model with 80 layers, each layer's KV cache can be DMA-transferred independently. With PCIe 5.0 CPU-GPU bandwidth (~64 GB/s), this nearly eliminates KV load overhead on cache hits.

What it trades away:

  • Cache coherence complexity. Multi-tenant deployments must namespace KV entries by model version and session. A model update invalidates the entire cache cold-start latency spikes until it rewarms.

  • Short-prompt penalty. Cache lookup overhead can exceed prefill time for prompts under ~500 tokens. LMCache's own paper acknowledges this directly. Don't deploy this for chatbots with one-sentence prompts.

  • Memory pressure. CPU DRAM is now a hot resource. A production node holding 50 GB of CPU-tier KV cache is memory-constrained for other workloads. Plan your NUMA topology accordingly.

  • SGLang's DSL learning curve. The fork()/join() model is powerful but unfamiliar. Teams trained on vanilla OpenAI-style API calls will need to restructure application logic to exploit prefix sharing.

Technical Moats

Capability

Why it's hard to replicate

KV Connector Interface

Survives engine churn. vLLM and SGLang change paged-attention internals constantly 15–20 new open-weight models ship per week. Most caching approaches break on engine updates. This connector abstraction took months of co-design with both engine teams.

Compute-I/O Pipeline

Off-the-shelf cudaMemcpyAsync calls are 3× slower than LMCache's and SGLang HiCache's custom GPU-assisted I/O kernels. Both projects built these independently confirming the approach, but the kernel engineering is months of non-trivial systems work.

PD Disaggregation

Cross-engine KV transfer at RDMA/NVLink speeds requires tight scheduler integration. LMCache supports NIXL (NVIDIA Inference Xfer Library), enabling near-NVLink bandwidth for cross-node KV movement. Infrastructure-level work, not application-level config.

Ecosystem Breadth

LMCache now supports 10 storage backends (Redis, NFS, WEKA, S3, InfiniStore, Mooncake, 3FS, Valkey, GPU-Direct, NIXL), 4 hardware types (NVIDIA, AMD, Ascend, TPU), and 2 inference engines all contributed by enterprise partners. This network effect is the real moat.

Insights

SGLang's real innovation is not RadixAttention

RadixAttention alone could be ported to vLLM and partially has been. What's genuinely novel about SGLang is the co-designed frontend and runtime. The programming model exposes prefix structure to the runtime before execution starts. The KV radix tree knows about fork() branches before they execute. This gives the scheduler information that a general-purpose REST API can never provide. No other open-source inference system has this. When people say "SGLang is fast because of RadixAttention," they're explaining roughly half of the story.

The 15× headline is real, but workload-specific

The 15× throughput gain is an honest benchmark for high-reuse, long-context, multi-turn workloads. For workloads with diverse, non-repeating prompts (creative generation, one-shot summarization, novel tasks), LMCache adds operational overhead with near-zero benefit. The actual value depends entirely on your cache hit rate. Measure it before deploying. If your hit rate is under ~20%, the operational complexity of a tiered KV cache layer may not be worth it.

Surprising Takeaway

Enterprise deployments are seeing 50–80% cache hit rates for workloads that conventional wisdom assumed were uncacheable coding assistants, RAG pipelines, agentic loops.

The reason: modern LLM applications have dynamically reusable contexts. Conversation history, retrieved chunks, and chain-of-thought reasoning steps repeat across users and sessions in patterns that look structurally identical to fixed system prompts. LMCache's own enterprise customers were surprised by their production hit rates one company reported a 50% hit rate where they'd expected nearly zero.

The implication: almost every LLM serving stack is currently burning compute on redundant prefill that a KV caching layer would eliminate. This is not a niche optimization. It's a structural inefficiency running in production, at scale, right now in most of the LLM-powered products you use every day.

The Verdict: Infrastructure That Earns Its Complexity

LMCache and SGLang are not clever hacks bolted on top of existing inference engines. They are a coherent rethinking of what an LLM inference stack should look like when KV cache is a first-class resource rather than an ephemeral byproduct of attention computation.

The benchmarks are real. The engineering is deep. The adoption signal 10 storage backends, 4 hardware targets, production deployments across multiple enterprises confirms this is not a research prototype.

The threshold question is simple: does your workload have repeating prefixes? If yes and the enterprise data strongly suggests most do this stack will cut your TTFT and GPU spend in ways that scaling hardware cannot match. If your workload is pure creative diversity with zero prefix overlap, skip it and save the operational overhead.

For agentic systems, RAG pipelines, multi-turn products, and any LLM workflow with shared context: running without a KV caching layer in 2026 is the infrastructure equivalent of running a database with no indexes. The data is there. You're just not reusing it.

References

  1. Cheng Y. et al. LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference. TensorMesh / University of Chicago, 2025. lmcache.ai/tech_report.pdf

  2. Zheng L. et al. SGLang: Efficient Execution of Structured Language Model Programs. NeurIPS 2024. proceedings.neurips.cc

  3. LMSYS Org. SGLang HiCache: Fast Hierarchical KV Caching. Sept 2025. lmsys.org/blog/2025-09-10-sglang-hicache

  4. VAST Data. Accelerating Inference: LMCache + vLLM on VAST AI OS. 2025. vastdata.com/blog/accelerating-inference

  5. Redis Blog. Get Faster LLM Inference with LMCache and Redis. redis.io

  6. LMCache Blog. LMCache on GKE: KV Cache on Tiered Storage. Oct 2025. blog.lmcache.ai

  7. Clarifai. Comparing SGLang, vLLM, and TensorRT-LLM. Jan 2026. clarifai.com

LMCache and SGLang solve LLM inference's biggest hidden cost: redundant recomputation. LMCache stores KV cache across a GPU → CPU → Redis hierarchy and reuses it across requests, while SGLang's RadixAttention shares it automatically within a runtime together cutting TTFT by up to 10× and throughput by up to 15×. The core insight is simple: your inference stack is throwing away computed knowledge after every query, and these two tools are the first open-source systems built to stop that.

Sponsored Ad
If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad—it helps us keep building and delivering value 🚀

88% resolved. 22% stayed loyal. What went wrong?

That's the AI paradox hiding in your CX stack. Tickets close. Customers leave. And most teams don't see it coming because they're measuring the wrong things.

Efficiency metrics look great on paper. Handle time down. Containment rate up. But customer loyalty? That's a different story — and it's one your current dashboards probably aren't telling you.

Gladly's 2026 Customer Expectations Report surveyed thousands of real consumers to find out exactly where AI-powered service breaks trust, and what separates the platforms that drive retention from the ones that quietly erode it.

If you're architecting the CX stack, this is the data you need to build it right. Not just fast. Not just cheap. Built to last.

Recommended for you