Prefill-as-a-Service (Moonshot AI, Tsinghua University, arXiv:2604.15039) shows that hybrid-attention models shrink that KVCache enough to send it across commodity Ethernet between datacenters, and the resulting architecture delivers 54% higher throughput than a same-datacenter baseline. The catch inside the catch: shrinking the cache alone is not sufficient. You still need selective offloading and bandwidth-aware scheduling, or the system falls apart under real traffic.
SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | June 16, 2026
Prefill-decode (PD) disaggregation became the standard architecture for large-scale LLM serving because the two phases have fundamentally different resource profiles. Prefill is compute-bound: it processes the entire input prompt in parallel and is limited by raw FLOPS. Decode is memory-bandwidth-bound: it generates tokens one at a time and is limited by how fast you can move the KVCache through GPU memory. Splitting these phases onto different hardware lets you specialize: compute-dense accelerators for prefill, bandwidth-optimized accelerators for decode.
The problem is what happens between the phases. When prefill finishes, it has produced a KVCache for the request, and that cache has to move to wherever decode is running. For a 32K-token request on a dense-attention model like MiniMax-M2.5, a single prefill instance produces KVCache at roughly 60 Gbps. That number forces prefill and decode onto the same RDMA-class fabric inside one datacenter. Heterogeneous serving, the genuinely useful idea of running prefill on one chip architecture and decode on another, has been theoretically attractive and practically blocked by this bandwidth wall for years.
Hybrid-attention architectures change the math. Models like Kimi Linear, which interleave a small number of full-attention layers with a larger number of linear-complexity layers (Kimi Delta Attention, or KDA), reduce KVCache size by roughly an order of magnitude compared to dense attention. That reduction is what makes cross-datacenter KVCache transport newly plausible. Prefill-as-a-Service (PrfaaS) is the system that turns "plausible" into "practical."
Scope: the bandwidth wall that PD disaggregation has always faced, the KV throughput metric that determines whether cross-datacenter transport is viable, PrfaaS's three system-level mechanisms (selective offloading, bandwidth-aware scheduling, hybrid prefix-cache pooling), and the case study results on a 1T-parameter model. Also covered: the relationship to Mooncake (arXiv:2407.00079), the predecessor KVCache-centric architecture from the same research lineage. Not covered: the full mathematical derivation of the throughput-optimal configuration, or hardware-specific cost modeling beyond the paper's H200/H20 case study.
What It Actually Does
PrfaaS is a serving architecture, not a model. It sits on top of any hybrid-attention model and decides, per request, whether to run prefill locally (in the same datacenter as decode) or remotely (in a standalone, compute-dense prefill cluster located elsewhere), then transfers the resulting KVCache over commodity Ethernet rather than RDMA.
The bottleneck formula that determines feasibility:
Φ_kv(l) = S_kv(l) / T_prefill(l)
Where S_kv(l) is the KVCache size for a request of length l, and T_prefill(l) is the prefill latency. This is the KV throughput: how fast a prefill instance generates cache data that needs to move across the network. For dense-attention models, this number is large enough (tens of Gbps at moderate context lengths) that it exceeds typical cross-datacenter Ethernet capacity. For hybrid-attention models, the order-of-magnitude KVCache reduction brings this number down to something a standard inter-datacenter link can sustain.
Model architectures referenced in the paper (A:B = linear-layer to full-attention-layer ratio):
Model | Linear Mechanism | Full Attention | A:B Ratio | Params |
|---|---|---|---|---|
Kimi Linear | KDA | MLA | 3:1 | 48B |
MiMo-V2-Flash | SWA | GQA | 5:1 | 309B |
Qwen3.5-397B | GDN | GQA | 3:1 | 397B |
Ring-2.5-1T | Lightning Attention | MLA | 7:1 | 1T |
The Architecture, Unpacked

Focus on the length-based threshold routing at the top of the diagram. This single design decision is what separates PrfaaS from a naive "externalize everything" approach. The paper is explicit that smaller KVCache alone does not make cross-datacenter PD practical: bursty traffic, skewed request lengths, and uneven prefix-cache distribution still break a system that sends every request across the wire indiscriminately.
The Code, Annotated
Snippet One: KV Throughput Feasibility Check and Length-Based Routing
# PrfaaS-style feasibility check and request routing
# Reconstructed from arXiv:2604.15039 Section 2.1 and Section 3
# Determines whether cross-datacenter prefill offload is viable for a request
from dataclasses import dataclass
from enum import Enum
class RouteDecision(Enum):
LOCAL_PD = "local_pd" # short request: stays in local PD cluster
REMOTE_PRFAAS = "remote_prfaas" # long request: offload to remote cluster
@dataclass
class ModelKVProfile:
"""
KV throughput profile for a specific model architecture.
← This is computed once per model, not per request: it's a property
of the architecture (dense vs hybrid attention), not the workload.
"""
kv_bytes_per_token: float # KVCache bytes generated per input token
prefill_flops_per_token: float # compute cost per input token
def kv_throughput_gbps(self, seq_len: int, prefill_latency_s: float) -> float:
"""
Φ_kv(l) = S_kv(l) / T_prefill(l)
← THIS is the core feasibility metric from the paper (Equation 1).
It tells you: how many bits of KVCache does this model generate
per second of prefill, for a request of this length?
If this number exceeds your cross-datacenter link capacity,
offloading this request will stall the link, not speed things up.
"""
kv_cache_bytes = self.kv_bytes_per_token * seq_len
kv_cache_bits = kv_cache_bytes * 8
return (kv_cache_bits / prefill_latency_s) / 1e9 # Gbps
# ── DENSE MODEL: the bandwidth wall ────────────────────────────────────────────
# MiniMax-M2.5-style dense GQA model, no hybrid attention layers
dense_model = ModelKVProfile(
kv_bytes_per_token=128 * 1024, # large: full KV per layer, every layer
prefill_flops_per_token=2e9,
)
# At 32K tokens, this produces ~60 Gbps (matches paper's reported figure)
# ← A typical cross-datacenter Ethernet link is nowhere near sufficient
# This is WHY dense models are stuck inside one datacenter's RDMA fabric
# ── HYBRID MODEL: the unlock ───────────────────────────────────────────────────
# Kimi Linear-style hybrid model: KDA:MLA at 3:1 ratio
# ← Only 1 in 4 layers produces full KVCache; the other 3 use bounded-state KDA
hybrid_model = ModelKVProfile(
kv_bytes_per_token=128 * 1024 / 8, # ~8x reduction from hybrid layer mix
prefill_flops_per_token=2e9, # compute cost is roughly unchanged
)
# ← THIS 8x reduction in KV bytes/token is what makes Φ_kv low enough
# to fit within commodity Ethernet bandwidth budgets
def route_request(
seq_len: int,
uncached_len: int, # tokens NOT covered by an existing prefix cache
threshold_t: int, # length-based routing threshold (tunable)
available_bandwidth_gbps: float,
model_profile: ModelKVProfile,
estimated_prefill_latency_s: float,
) -> RouteDecision:
"""
Length-based threshold routing (paper Section 1, Section 3.4.3).
← THIS is the trick: route by UNCACHED length, not total length.
A request with a long prompt but a long matching prefix cache
effectively has a SHORT uncached prefill workload.
Routing it remotely would waste cross-cluster bandwidth for no gain.
"""
if uncached_len <= threshold_t:
# Short uncached prefill: cheap enough to run locally
# Avoids the round-trip latency and bandwidth cost of remote offload
return RouteDecision.LOCAL_PD
# Long uncached prefill: check if remote offload is bandwidth-feasible
required_gbps = model_profile.kv_throughput_gbps(
seq_len, estimated_prefill_latency_s
)
if required_gbps > available_bandwidth_gbps:
# ← Bandwidth-aware fallback: even a "long" request stays local
# if the cross-datacenter link can't sustain its KV throughput
# right now. This is the dynamic, real-time half of dual-timescale
# scheduling: react to fluctuating conditions before congestion hits.
return RouteDecision.LOCAL_PD
return RouteDecision.REMOTE_PRFAAS
# ── EXAMPLE: routing decisions for a mixed workload ───────────────────────────
short_request = route_request(
seq_len=512, uncached_len=480, threshold_t=4096,
available_bandwidth_gbps=10, model_profile=hybrid_model,
estimated_prefill_latency_s=0.05,
)
print(short_request) # RouteDecision.LOCAL_PD (below threshold)
long_request_feasible = route_request(
seq_len=64000, uncached_len=60000, threshold_t=4096,
available_bandwidth_gbps=10, model_profile=hybrid_model,
estimated_prefill_latency_s=2.1,
)
print(long_request_feasible) # RouteDecision.REMOTE_PRFAAS (long + bandwidth OK)
The route_request() function's use of uncached_len rather than seq_len is the detail that separates a paper-correct implementation from a naive one. A 100K-token request with a 95K-token cache hit only needs to prefill 5K tokens. Routing the full 100K-token request to a remote cluster based on total length would be wrong; the actual compute and KVCache transfer cost is determined by what still needs computing, not what the prompt contains in total.
Snippet Two: Dual-Timescale Scheduling and Throughput-Optimal Configuration
# PrfaaS dual-timescale scheduling: short-term routing + long-term allocation
# Reconstructed from arXiv:2604.15039 Section 3.4.3
# The mechanism that keeps the system stable under bursty, skewed real traffic
import time
from dataclasses import dataclass, field
@dataclass
class ClusterState:
"""Live state tracked for bandwidth- and cache-aware routing."""
available_bandwidth_gbps: float
prefill_queue_depth: int
decode_queue_depth: int
prefix_cache_locations: dict # request_hash -> cluster_id
class DualTimescaleScheduler:
"""
Two scheduling loops at different timescales:
SHORT-TERM: per-request routing, reacts in milliseconds
LONG-TERM: capacity re-allocation, reacts over minutes/hours
← WHY two timescales? Per-request decisions need to be fast (can't
wait for a global re-optimization before routing each request).
But the OPTIMAL prefill/decode instance split (how many machines
dedicated to each role) only needs to change as traffic patterns
shift over longer windows. Mixing these into one loop either makes
routing too slow or allocation too reactive to noise.
"""
def __init__(self, initial_np: int, initial_nd: int):
self.n_prefill = initial_np # number of prefill instances (PrfaaS side)
self.n_decode = initial_nd # number of decode instances (local PD side)
self.cluster_state = ClusterState(10.0, 0, 0, {})
self.traffic_history = []
def short_term_route(self, request) -> RouteDecision:
"""
Fast path: called on every incoming request.
Bandwidth- and cache-aware: checks live link state, not historical avg.
← Must react to fluctuating conditions BEFORE congestion accumulates.
A scheduler that only checks bandwidth every 10 seconds will
route requests into an already-congested link and make it worse.
"""
cache_hit = self._check_prefix_cache(request)
uncached_len = request.seq_len - cache_hit.matched_length
return route_request(
seq_len=request.seq_len,
uncached_len=uncached_len,
threshold_t=self._current_threshold(),
available_bandwidth_gbps=self.cluster_state.available_bandwidth_gbps,
model_profile=request.model_profile,
estimated_prefill_latency_s=request.estimated_latency,
)
def long_term_reallocate(self) -> tuple[int, int]:
"""
Slow path: periodic re-optimization of prefill/decode instance split.
← Grid search over (threshold t, N_prefill, N_decode) jointly.
Paper's case study found optimum at N_p=3, N_d=5 for their workload,
but THIS RATIO IS WORKLOAD-DEPENDENT, not a universal constant.
The grid search fixes one variable and sweeps the other:
(a) fix t at optimum → search instance split
(b) fix N_p=3, N_d=5 → search t
"""
recent_traffic = self.traffic_history[-1000:] # sliding window
avg_request_length = sum(r.seq_len for r in recent_traffic) / len(recent_traffic)
long_request_fraction = sum(
1 for r in recent_traffic if r.seq_len > self._current_threshold()
) / len(recent_traffic)
# ← Simplified allocation heuristic: more long requests → more
# prefill capacity needed on the remote PrfaaS side.
# The actual paper uses a throughput model (Section 3.4.1) that
# solves for the configuration maximizing aggregate throughput
# subject to the bandwidth constraint.
if long_request_fraction > 0.4:
self.n_prefill = min(self.n_prefill + 1, 8)
elif long_request_fraction < 0.2:
self.n_prefill = max(self.n_prefill - 1, 1)
return self.n_prefill, self.n_decode
def _check_prefix_cache(self, request):
# Hybrid prefix-cache pool lookup: accounts for cache LOCATION
# ← Prefix caches are unevenly distributed; this is NOT a simple
# "is it cached anywhere" check, it's "is it cached HERE,
# and is that worth the locality benefit vs available bandwidth"
raise NotImplementedError("Cache pool lookup logic")
def _current_threshold(self) -> int:
raise NotImplementedError("Threshold value from long-term optimizer")
# ── CASE STUDY CONFIGURATION (from paper Section 4) ───────────────────────────
# Baseline: homogeneous PD cluster of 96 H20 GPUs
# PrfaaS: standalone prefill cluster (8×H200) + local PD cluster
# Optimal split found via grid search: N_p=3, N_d=5
scheduler = DualTimescaleScheduler(initial_np=3, initial_nd=5)
# Results (paper Section 4.3):
# vs homogeneous PD baseline: +54% throughput
# vs naive heterogeneous baseline: +32% throughput
# Cross-datacenter bandwidth used: "modest" per machine (not saturating link)
The distinction between short-term routing (milliseconds, per-request) and long-term reallocation (minutes to hours, capacity planning) is the systems insight that makes PrfaaS deployable. A scheduler that tries to do both at the same timescale either makes routing decisions too slow to be useful, or makes capacity allocation decisions so reactive to short-term noise that the system never stabilizes.
It In Action: End-to-End Worked Example
Task: Serve a 1T-parameter hybrid model (Kimi Linear architecture, interleaved KDA:MLA) under realistic mixed-length traffic
Setup (from paper Section 4.1):
Baseline: homogeneous PD cluster, 96 H20 GPUs, single datacenter
PrfaaS deployment: standalone PrfaaS cluster for long-context prefill (8×H200)
+ conventional PD cluster for decode and short prefills
Prefill latency benchmarked on: 8×H200 with in-house vLLM
Model architecture: interleaved KDA:MLA layers (Kimi Linear style)
Step 1: Incoming request stream (mixed lengths, realistic skew)
Request batch (simulated production traffic):
60% requests: < 4K tokens (chat, short Q&A)
25% requests: 4K-32K tokens (document analysis, medium context)
15% requests: 32K-128K tokens (long-document, RAG-heavy, codebase analysis)
Length-based threshold (t) tuned via grid search: optimal value found
empirically for this workload distribution
Step 2: Routing decisions
Short requests (60%): uncached_len < t
→ LOCAL_PD cluster
→ No cross-datacenter transfer
→ Handled with standard same-cluster PD disaggregation
Medium/long requests (40%): uncached_len >= t
→ Check bandwidth feasibility (Φ_kv vs available cross-DC link capacity)
→ Hybrid attention reduces Φ_kv by ~8x vs dense equivalent
→ Most pass the bandwidth check → REMOTE_PRFAAS cluster
→ KVCache transferred over commodity Ethernet (not RDMA)
Step 3: Throughput comparison
Homogeneous PD baseline (96 H20, all local):
Baseline throughput: X tokens/sec (reference point)
Naive heterogeneous (full externalization, no selective offloading):
Throughput: ~0.76X (worse than baseline in some configs)
← Naive design suffers from congestion, unstable queueing,
poor utilization, exactly as the paper predicts
PrfaaS (selective offload + bandwidth-aware scheduling + N_p=3, N_d=5):
Throughput: 1.54X vs homogeneous baseline (+54%)
Throughput: 1.32X vs naive heterogeneous baseline (+32%)
Cross-datacenter bandwidth consumed: modest per machine
← The selective routing (60% stays local) is what keeps bandwidth
consumption "modest" rather than saturating the inter-DC link
Step 4: Why this works
The 8×H200 prefill cluster is compute-dense: optimized for the
arithmetic-heavy prefill phase, not memory bandwidth for decode.
This hardware specialization is exactly what PD disaggregation promised
but couldn't deliver across datacenters before hybrid attention.
The local PD cluster (handling decode + short prefills) doesn't need
to share an RDMA fabric with the remote prefill cluster.
Independent scaling: add more H200 prefill capacity in one region,
more H20 decode capacity in another, without coordination overhead
beyond the bandwidth-aware scheduler.
Why This Design Works, and What It Trades Away
The selective offloading design is the correct response to a system that the paper's authors clearly modeled honestly. A simpler design, "always send long-context prefill remotely," would be elegant but fragile: real traffic is bursty, request lengths are skewed, and naive full externalization suffers from congestion and unstable queueing exactly as a queueing-theory model would predict. By routing only requests above a length threshold, and only when the bandwidth check passes, PrfaaS keeps the common case (short requests) cheap and fast while reserving the cross-datacenter path for requests where the latency and compute benefit justifies the network cost.
The bandwidth-aware scheduler reacting before congestion accumulates is the detail that distinguishes a production-grade design from an academic one. Inter-cluster bandwidth fluctuates in practice; a scheduler that routes based on a static or slowly-updated bandwidth estimate will periodically overcommit the link and stall in-flight transfers. The dual-timescale design, fast per-request routing plus slower periodic capacity reallocation, matches the actual rate of change in the two signals it depends on: bandwidth conditions change in seconds, optimal instance allocation changes over much longer windows.
The hybrid prefix-cache pool accounting jointly for length, cache location, and bandwidth is the mechanism that prevents a subtle failure mode: routing a request to the remote prefill cluster when the relevant prefix cache is actually sitting in the local cluster. Without joint accounting, the system would either recompute cache unnecessarily (wasting compute) or transfer cache across the network unnecessarily (wasting bandwidth) whenever cache locality and length-based routing disagree.
What PrfaaS trades away:
Latency for long-context requests increases relative to a fully local, infinite-bandwidth ideal. Cross-datacenter network round-trip time is real, even on a well-provisioned commodity Ethernet link. The paper's throughput gains come from better aggregate resource utilization, not from making any individual long-context request faster than it would be on dedicated local hardware with unlimited bandwidth. For latency-critical single-request SLOs, this tradeoff needs explicit evaluation against your specific requirements.
The case study uses an internal, unreleased 1T-parameter model. The 54% and 32% figures are specific to that model's KDA:MLA ratio, the 8×H200/96×H20 hardware mix, and the traffic distribution used in the case study. The paper does not claim these exact numbers generalize to every hybrid-attention model or every hardware combination; the throughput model (Section 3.4.1) is the generalizable contribution, not the specific percentages.
The approach is fundamentally gated on model architecture. Dense-attention models cannot use PrfaaS effectively: their KV throughput is too high for any commodity link. This is explicitly model-dependent infrastructure. Teams running only dense-attention models gain nothing from this architecture until they migrate to a hybrid-attention model family.
Technical Moats
The KV throughput metric as a design primitive. Φ_kv(l) = S_kv(l)/T_prefill(l) is a simple formula, but recognizing it as THE binding constraint for cross-datacenter feasibility, rather than focusing on KVCache size alone, is the conceptual contribution. A model could have a small absolute KVCache size but a very fast prefill latency, producing a high KV throughput that still exceeds bandwidth limits. The ratio, not either number alone, determines feasibility. This framing generalizes to evaluating any future model architecture for cross-datacenter serving viability.
The length-based threshold combined with cache-aware routing. Building a router that correctly accounts for uncached length (not total prompt length) requires integrating prefix-cache lookup into the routing decision in real time, at the latency budget of individual request dispatch. This is the same class of engineering challenge that Mooncake's Conductor scheduler solved for single-datacenter prefix-cache-aware routing (arXiv:2407.00079), extended to a setting where cache locations span multiple clusters and the network cost of fetching a remote cache is no longer negligible.
Dual-timescale scheduling as a systems pattern. Separating fast per-request decisions from slow capacity reallocation is a general systems design pattern (similar to how autoscalers separate request routing from instance provisioning), but correctly tuning the two timescales to match the actual rate of change in bandwidth conditions versus traffic patterns requires production traffic data that most teams building a first version of this kind of system will not have until they deploy and measure.
Insights
Insight One: Hybrid-attention architectures were not designed with cross-datacenter serving in mind, but they are the unlock that makes it newly possible. Kimi Linear, MiMo-V2-Flash, Qwen3.5, and Ring-2.5-1T all adopted hybrid attention for inference efficiency reasons unrelated to multi-datacenter deployment: smaller KVCache means more concurrent requests per GPU, faster decode, lower memory pressure. The cross-datacenter serving opportunity is a second-order consequence of a first-order optimization. This is a recurring pattern in systems research: an architecture change made for one reason (memory efficiency) creates an opportunity in an entirely different layer of the stack (network topology) that the original designers were not optimizing for.
Insight Two: The paper's most important sentence is buried in the case study summary, not the abstract: "KVCache-efficient model architectures are necessary but not sufficient for cross-datacenter heterogeneous serving." This directly contradicts a natural but wrong inference from the headline result. It would be easy to read "hybrid attention reduces KVCache by 10x, therefore cross-datacenter serving works" as the takeaway. The paper's own data shows that a naive heterogeneous deployment using the same hybrid model still underperforms the homogeneous baseline in degraded conditions; PrfaaS's 32% improvement over naive heterogeneous deployment is entirely attributable to the system-side mechanisms (selective offloading, bandwidth-aware scheduling, cache-aware placement), not to the model architecture. Teams that adopt hybrid attention models and assume cross-datacenter serving will just work are missing the actual contribution of this paper.
Surprising Takeaway
The optimal prefill-to-decode instance ratio found in the case study, 3 prefill instances to 5 decode instances, is the inverse of what intuition suggests for a 1T-parameter model with a 3:1 KDA-to-MLA layer ratio. You might expect prefill capacity to scale with the fraction of full-attention layers, since those are the expensive ones. Instead, the optimal allocation favors decode capacity, because decode for a 1T-parameter model at scale is memory-bandwidth-bound across far more concurrent requests than the offloaded long-context prefill workload represents. The lesson generalizes: optimal hardware allocation in a disaggregated system is a function of the traffic distribution and the decode-side concurrency requirements, not a simple ratio derived from the model's internal attention-layer composition. Teams should not assume their instance split should mirror their model's architectural ratios; it should mirror their measured workload.
TL;DR For Engineers
PrfaaS (arXiv:2604.15039, Moonshot AI + Tsinghua, April 2026) lets prefill and decode run in different datacenters by exploiting the KVCache size reduction from hybrid-attention models (Kimi Linear, MiMo-V2-Flash, Qwen3.5, Ring-2.5-1T), transferring KVCache over commodity Ethernet instead of RDMA.
Feasibility is governed by KV throughput: Φ_kv(l) = S_kv(l)/T_prefill(l). Dense models produce ~60 Gbps at 32K tokens (infeasible for cross-DC links). Hybrid attention cuts this by roughly 8-10x, making it feasible.
Three system mechanisms make it work, not just smaller cache: length-based threshold routing (route by uncached length, not total length), bandwidth-aware scheduling (react before congestion, not after), and a hybrid prefix-cache pool (joint accounting of length, cache location, and live bandwidth).
Case study on an internal 1T-parameter Kimi Linear-style model: 54% higher throughput vs homogeneous 96×H20 PD baseline, 32% higher vs naive full-externalization heterogeneous baseline, modest cross-datacenter bandwidth per machine. Optimal split found via grid search: N_prefill=3, N_decode=5.
The paper's central warning: "KVCache-efficient model architectures are necessary but not sufficient." Reduced KVCache alone does not make cross-datacenter PD work; the 32% gain over naive heterogeneous deployment comes entirely from the scheduling and routing mechanisms.
The Network Topology Constraint on LLM Serving Just Loosened
PrfaaS demonstrates that the long-assumed requirement, prefill and decode must share a single high-bandwidth network domain, was a consequence of dense-attention KVCache size, not a fundamental property of disaggregated serving. As hybrid-attention architectures become more common across frontier model families, the design space for LLM serving infrastructure expands to include genuinely independent scaling of compute-dense and bandwidth-dense resources across regions, clouds, and hardware generations.
The paper's discipline in stating that model-side efficiency is necessary but not sufficient is the right caution for anyone building on this idea. The 54% throughput improvement is a system-level result, achieved through selective offloading and adaptive scheduling on top of a KVCache-efficient model, not a property that falls out of hybrid attention alone.
References
Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter, arXiv:2604.15039, Qin et al., Moonshot AI and Tsinghua University, April 16 2026
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving, arXiv:2407.00079, Qin et al., the predecessor architecture from the same research lineage powering Kimi
Kimi Linear: An Expressive, Efficient Attention Architecture — the hybrid-attention model family the case study is based on
NVIDIA Rubin CPX: Accelerating Inference Performance for 1M+ Token Context Workloads — the hardware roadmap context the paper cites for compute-dense prefill chips
Mooncake open-source repository — Transfer Engine and Mooncake Store implementations
Summary
Prefill-as-a-Service (PrfaaS, arXiv:2604.15039, Moonshot AI and Tsinghua University, April 2026) is a cross-datacenter LLM serving architecture that exploits the order-of-magnitude KVCache reduction in hybrid-attention models (Kimi Linear, MiMo-V2-Flash, Qwen3.5, Ring-2.5-1T) to transfer prefill-generated KVCache over commodity Ethernet rather than requiring a shared RDMA fabric. Feasibility is governed by the KV throughput metric Φ_kv(l) = S_kv(l)/T_prefill(l); hybrid attention brings this below typical cross-datacenter bandwidth budgets where dense attention cannot. Three system mechanisms, length-based threshold routing on uncached prompt length, a bandwidth-aware scheduler reacting before congestion, and a hybrid prefix-cache pool with joint length/location/bandwidth accounting, are what make the architecture viable under real bursty, skewed traffic. A case study on an internal 1T-parameter hybrid model achieved 54% higher throughput than a homogeneous 96-GPU PD baseline and 32% higher than a naive full-externalization heterogeneous baseline, with the explicit finding that KVCache-efficient model architecture is necessary but not sufficient: the system-side scheduling mechanisms account for the gain over naive heterogeneous deployment.
Sponsored Ad
If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀
AI help, without the trust tax.
Most AI tools ask you to trade your data for intelligence. Norton Neo doesn't. It's the first safe AI-native browser built by Norton, and it gives you powerful built-in AI without handing your privacy over to get it. Search, summarize, and write with AI built directly into your browser. Your data stays yours. Your context stays private.
Built-in VPN, anti-fingerprinting, and ad blocking come standard. No add-ons. No setup. No compromises.
Fast. Safe. Intelligent. That's Neo.


