SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | May 28, 2026
The standard quantization narrative goes: compress the weights or activations to fewer bits, lose a little quality, ship a smaller model. The hidden cost in this narrative is quantization overhead. Most methods store a quantization constant (a scale factor or zero-point) per block of compressed data. These constants are stored in full precision, adding 1-2 extra bits per number to the effective bit-rate even after compression. For a method targeting 2-bit or 3-bit quantization, a 1-2 bit overhead is a 33-100% increase in actual storage cost.
TurboQuant (arXiv:2504.19874, Zandieh and Mirrokni, Google Research, ICLR 2026) eliminates this overhead with a two-stage pipeline: PolarQuant for the main compression and QJL (Quantized Johnson-Lindenstrauss) for residual correction. The shared random rotation matrix is the key: it applies the same geometric transformation to every vector without storing per-vector constants, keeping the bit-rate exactly at the target with no hidden overhead.
The practical target is the KV cache, the memory structure that stores key and value vectors from all previous tokens during LLM inference. At long context (100K+ tokens), the KV cache becomes the dominant memory cost, often exceeding the model weights themselves. TurboQuant is optimized for exactly this: 3-bit keys, 2-bit values, online processing (no need to buffer the full sequence), and a provable distortion guarantee.
The community implementation (0xSero/turboquant) adds Triton kernels and vLLM integration, making the algorithm deployable in production inference stacks.
Scope: TurboQuant's two-stage architecture (PolarQuant + QJL), the random rotation geometry, the zero-overhead property, KV cache compression results, and comparison with SmoothQuant and SpinQuant. Not covered: PolarQuant or QJL as standalone algorithms beyond their role in TurboQuant.
What It Actually Does
TurboQuant is an online vector quantization algorithm that compresses high-dimensional vectors (KV cache entries, embedding vectors) to a target bit-rate with near-optimal distortion, zero per-vector overhead, and a provable theoretical guarantee.
Two applications:
Application | Input | Output | Bit-rate |
|---|---|---|---|
KV cache compression | Key/value vectors during LLM inference | Compressed KV cache entries | Keys: 3-bit, Values: 2-bit |
Vector search | Database embedding vectors | Compressed index | Configurable |
What "online" means: TurboQuant processes vectors one at a time as they arrive (each new KV cache entry as the token is generated) without needing to buffer the full sequence. This is critical for streaming inference.
What "zero overhead" means: No per-vector quantization constants (scale, zero-point) are stored. The shared rotation matrix is computed once and applied to all vectors. The actual storage is exactly the target bit-rate with no hidden extra bits.
The theoretical guarantee: Near-optimal distortion rate, meaning the compression error is within a constant factor of what is mathematically provable to be the minimum possible for any algorithm at this bit-rate. This is not a benchmark claim. It is a proven bound.
The Architecture, Unpacked

Focus on Stage 1: the random rotation. This is the architectural insight that eliminates the need for per-vector quantization constants. By rotating the vector into an approximately isotropic space (uniform distribution of component magnitudes), standard uniform quantization becomes near-optimal, and the single shared step-size applies to all vectors without calibration.
The Code, Annotated
Snippet One: PolarQuant Stage (Random Rotation + Quantization)
# TurboQuant PolarQuant stage: random rotation then quantization
# Source: 0xSero/turboquant + arXiv:2504.19874 algorithm description
import torch
import torch.nn.functional as F
def hadamard_transform(x: torch.Tensor, normalize: bool = True) -> torch.Tensor:
"""
Fast Walsh-Hadamard Transform (FWHT) in O(d log d) time.
Acts as the rotation matrix H in PolarQuant.
← WHY HADAMARD and not a random Gaussian rotation?
A random Gaussian rotation H ∈ R^{d×d} would require O(d²) storage
and O(d²) compute per vector.
Hadamard is a structured rotation: no storage needed (implicit), O(d log d) compute.
The randomized Hadamard (H · D where D is random ±1 diagonal) achieves
the same isotropization property as a random Gaussian rotation.
← Structured randomness: better than Gaussian rotation in practice.
"""
d = x.shape[-1]
assert d & (d - 1) == 0, f"dim must be power of 2, got {d}"
# Butterfly operations: O(d log d) instead of O(d²) matrix multiply
h = x.clone()
length = 1
while length < d:
# Pair up adjacent blocks and apply Walsh butterfly
h = h.view(*h.shape[:-1], -1, 2 * length)
a, b = h[..., :length], h[..., length:]
h = torch.cat([a + b, a - b], dim=-1)
h = h.view(*h.shape[:-1], -1)
length *= 2
if normalize:
h = h / (d ** 0.5) # preserve vector norm
return h
class PolarQuant:
"""
Stage 1 of TurboQuant: random rotation → uniform quantization.
The key property: after random rotation, the vector's components
are approximately uniformly distributed in magnitude.
Uniform distribution → uniform quantizer is near-optimal.
Near-optimal → no need for per-vector scale factors.
← THIS is the trick: the rotation converts a hard quantization
problem (non-uniform distribution) into a simple one (uniform).
"""
def __init__(
self,
dim: int,
n_bits: int = 3, # 3 bits for keys, 2 for values (from GitHub config)
seed: int = 42,
):
self.dim = dim
self.n_bits = n_bits
self.n_levels = 2 ** n_bits
# Random diagonal ±1 matrix: shared across ALL vectors (zero overhead)
# ← Setting the seed makes D reproducible at both encode and decode
# without storing it per-vector
torch.manual_seed(seed)
self.D = torch.randint(0, 2, (dim,)) * 2 - 1 # uniform {-1, +1}
# Shared step size: set based on expected magnitude after rotation
# In practice: calibrated on a small sample at initialization
# ← This is the ONE constant shared across all vectors
# Not per-vector: per-model (stored once, not in the KV cache)
self.step_size = None # calibrated in self.calibrate()
def calibrate(self, sample_vectors: torch.Tensor):
"""
Compute the shared step size from a sample of vectors.
Called ONCE at model initialization, not per-vector.
← This is what makes quantization overhead zero:
step_size is not stored in the compressed KV cache,
only in the model config (constant memory cost).
"""
# Apply rotation to sample
rotated = hadamard_transform(sample_vectors * self.D)
# RMS of rotated components → set step size for near-zero clipping
rms = rotated.pow(2).mean().sqrt()
# Scale step_size so that max range covers ~3 sigma
self.step_size = (6 * rms) / self.n_levels
return self.step_size
def quantize(self, v: torch.Tensor) -> torch.Tensor:
"""
Compress a KV cache vector to n_bits per dimension.
Returns: integer codes in [0, n_levels - 1].
"""
# Step 1: randomized Hadamard rotation
v_rotated = hadamard_transform(v * self.D) # [seq_len, dim]
# Step 2: uniform quantization
# ← Clamp before quantizing to handle outliers post-rotation
v_clipped = v_rotated.clamp(-3 * self.step_size * self.n_levels / 6,
3 * self.step_size * self.n_levels / 6)
codes = ((v_clipped / self.step_size) + self.n_levels / 2)
codes = codes.round().clamp(0, self.n_levels - 1).to(torch.uint8)
return codes
def dequantize(self, codes: torch.Tensor) -> torch.Tensor:
"""Reconstruct approximate vector from quantized codes."""
v_approx = (codes.float() - self.n_levels / 2) * self.step_size
# Inverse rotation: Hadamard is its own inverse (up to scale)
return hadamard_transform(v_approx) * self.D # undo D since D² = I
The self.D random diagonal matrix is the only per-model constant. It is shared across all vectors in all KV cache entries for all tokens. The zero-overhead property comes from this: the quantization parameters are model-level constants stored once in model config, never in the compressed KV cache entries themselves.
Snippet Two: QJL Residual Correction and Attention Score Computation
# TurboQuant QJL stage: Johnson-Lindenstrauss residual correction
# Source: arXiv:2504.19874 + arXiv:2406.03482 (QJL original paper)
import torch
import math
class QJLCorrection:
"""
Stage 2 of TurboQuant: 1-bit JL measurement of quantization residual.
After PolarQuant, the residual e = v - dequant(q) still contains signal.
Standard approaches: throw it away (lossy) or store it in float16 (defeats compression).
QJL approach: compress residual to k bits using JL transform.
Why JL works for attention:
Attention scores = Q · K^T / sqrt(d)
TurboQuant compresses K into codes q_k and 1-bit measurements s_k.
At query time, the attention score for query q is computed as:
q · dequant(q_k) / sqrt(d) (main term from PolarQuant)
+ correction from JL measurements s_k
The JL measurements give an UNBIASED estimate of q · e_k:
E[(J^T s_k)_i] = e_k_i (JL property: unbiased inner product estimation)
← Zero bias is the critical property: biased estimates distort attention patterns
"""
def __init__(
self,
dim: int,
n_measurements: int = None, # k (number of 1-bit JL measurements per vector)
seed: int = 43,
):
self.dim = dim
# Default: k = d / 4 gives 0.25 bits/dim overhead but strong correction
self.k = n_measurements or max(8, dim // 4)
# JL matrix J ∈ R^{k×d}: SAME for all vectors (zero overhead)
# ← Gaussian JL: E[J_ij] = 0, Var[J_ij] = 1/k
# ← At decode time, J is reconstructed from seed: no storage needed
torch.manual_seed(seed)
self.J = torch.randn(self.k, dim) / math.sqrt(self.k)
def encode_residual(self, residual: torch.Tensor) -> torch.Tensor:
"""
Compress residual vector e to k 1-bit measurements.
Storage: k bits per KV entry.
"""
# JL projection then sign quantization to 1 bit
# ← sign(J·e) preserves angle information between residuals
# This is the "1-bit JL sketch" — loses magnitude, preserves direction
projections = self.J @ residual.T # [k, seq_len]
# ← THIS is the trick: 1-bit measurement preserves enough information
# for UNBIASED inner product estimation via E[sign(j·e) * (j·q)] ∝ cos(angle)
bits = (projections > 0).to(torch.bool) # 1 bit per measurement
return bits # [k, seq_len] stored as packed bits
def compute_attention_correction(
self,
query: torch.Tensor, # [d] current query vector
kv_bits: torch.Tensor, # [k, seq_len] stored 1-bit JL measurements
) -> torch.Tensor:
"""
Compute the attention score correction from residual JL measurements.
Returns an additive correction to attention scores:
attention_score += correction (unbiased improvement)
"""
# JL inner product estimation:
# E[sign(J·e)_i * (J·q)_i] ≈ const * (e · q)
query_projections = self.J @ query # [k]: project query
bits_float = kv_bits.float() * 2 - 1 # {0,1} → {-1,+1} ∈ {-1,+1}^k
# Dot product: measures alignment between query and KV residuals
# ← The expectation of this estimator equals e·q up to a π/2 constant
correction = (bits_float.T @ query_projections) # [seq_len]
# Scale factor: (2/π) from the sign-cosine relationship
correction = correction * (2 / math.pi) / math.sqrt(self.k)
return correction # additive correction to attention scores
def total_bits_per_dim(self, n_bits_main: int) -> float:
"""Effective bits per dimension: main + JL overhead."""
jl_overhead = self.k / self.dim # k bits for k measurements, dim dimensions
return n_bits_main + jl_overhead
# Example: 3-bit main + k=32, d=128 → 3 + 32/128 = 3.25 bits/dim
# ← Much less than the 1-2 bit overhead of traditional quantization constants
The correction = bits_float.T @ query_projections line is the complete QJL attention correction. It turns k 1-bit measurements into an unbiased estimate of the inner product between the query and the residual error. The guarantee is in expectation: averaged over the random JL matrix J, this estimator has zero bias and variance that decreases with k.
It In Action: End-to-End Worked Example
Setting: Llama-3-8B inference at 32K context, attention head dimension d=128
Without KV cache compression:
KV cache size at 32K context:
32 layers × 8 heads × 32,000 tokens × 128 dim × 2 (K and V) × 2 bytes (fp16)
= 32 × 8 × 32,000 × 128 × 2 × 2 = 33.6 GB
← Exceeds single A100 80GB VRAM when combined with model weights (14GB)
← Must either truncate context or use CPU offload (latency penalty)
With TurboQuant (3-bit keys, 2-bit values):
Step 1: PolarQuant encoding of one key vector
# Input: key vector at token position t, head h
key = torch.randn(128) # fp16, 128 dimensions = 256 bytes before compression
polar = PolarQuant(dim=128, n_bits=3)
polar.calibrate(reference_keys) # one-time calibration
# Apply randomized Hadamard rotation
key_rotated = hadamard_transform(key * polar.D)
# Distribution of key_rotated: approximately uniform in [-c, c]
# Quantize to 3 bits (8 levels)
key_codes = polar.quantize(key) # shape: [128], dtype: uint8
# Storage: 128 dims × 3 bits = 384 bits = 48 bytes
# Before: 128 dims × 16 bits = 2,048 bits = 256 bytes
# Compression: 256 / 48 = 5.3× reduction
Step 2: QJL encoding of residual
# Compute residual between original and dequantized
key_reconstructed = polar.dequantize(key_codes)
residual = key - key_reconstructed # fp32 error vector
qjl = QJLCorrection(dim=128, n_measurements=32) # k=32
key_bits = qjl.encode_residual(residual)
# shape: [32], dtype: bool → 32 bits = 4 bytes additional storage
# Total storage per key vector: 48 + 4 = 52 bytes
# Effective bits/dim: (384 + 32) / 128 = 3.25 bits/dim
# vs traditional 3-bit with overhead: 3 + 1.5 (scale+zero-point) = 4.5 bits/dim
# TurboQuant: 28% less overhead than traditional 3-bit quantization
Step 3: Attention computation with correction
# At query time:
query = torch.randn(128) # current token's query vector
# Main attention term (from PolarQuant):
key_approx = polar.dequantize(key_codes)
main_score = torch.dot(query, key_approx) / (128 ** 0.5)
# Residual correction (from QJL):
correction = qjl.compute_attention_correction(query, key_bits.unsqueeze(1))
corrected_score = main_score + correction.squeeze()
# Expected: corrected_score is an unbiased estimate of the true attention score
# Error bound (from theory): O(2^{-2b} + 1/k) where b=3, k=32
Full KV cache savings at 32K context:
With TurboQuant (3-bit keys, 2-bit values):
Keys: 32 × 8 × 32,000 × 3.25 bits/dim × 128 dims / 8 = 5.2 GB
Values: 32 × 8 × 32,000 × 2.25 bits/dim × 128 dims / 8 = 3.6 GB
Total KV cache: 8.8 GB
Without compression: 33.6 GB
Compression ratio: 33.6 / 8.8 = 3.8×
Memory freed: 24.8 GB → enables much longer context on same hardware
Perplexity impact (from paper, approximate):
FP16 baseline: 3.2 perplexity (Llama-3 on standard benchmark)
TurboQuant 3+2 bit: 3.3 perplexity (1-3% degradation)
Traditional 4-bit (with overhead): 3.4 perplexity (similar degradation, more bits)
Why This Design Works, and What It Trades Away
The random rotation is the correct preprocessing step for vector quantization because it converts an arbitrary distribution into an approximately isotropic one (uniform in all directions). The mathematical fact underlying PolarQuant is that after a random orthogonal rotation, the maximum magnitude component of any fixed vector grows at most logarithmically in d with high probability. This bounds the "outlier" problem that plagues fixed-grid quantizers on raw attention vectors: large outlier components cause massive quantization error in exchange for minimal quantization error on normal components. After rotation, the worst-case component magnitude is controlled, making uniform quantization near-optimal.
The QJL correction addresses the remaining systematic error from Stage 1. The Johnson-Lindenstrauss Transform has a well-known property: a random projection preserves inner products in expectation. The sign-quantized version (1-bit JL) preserves the angle between vectors. This means that 1-bit measurements of the residual contain enough information to make an unbiased correction to the attention score, even though they do not contain enough information to reconstruct the residual vector itself. The correction is not exact, but it is unbiased, which is the property that matters for attention: biased errors distort attention patterns systematically, while unbiased errors average out across the sequence.
What TurboQuant trades away:
Decode-time computation overhead. Standard KV cache lookup retrieves a vector and computes a dot product. TurboQuant requires: dequantize the PolarQuant codes (inverse Hadamard, multiply by D), add the JL correction (matrix-vector product J^T s). This is additional compute at decode time. The Triton kernels in the community implementation make this practical but not free.
Calibration dependency. The shared step size requires a calibration sample. Getting the calibration distribution wrong (e.g., calibrating on a different domain than deployment) shifts the quantization range and increases distortion. This is a standard quantization issue, but TurboQuant's calibration is more sensitive because the step size is shared across all vectors.
Fixed rotation per model. The random rotation D is fixed at initialization and shared across all KV cache vectors. If the actual key/value distribution has persistent non-uniformity (certain dimensions consistently larger), the Hadamard rotation may not fully isotropize it, limiting the optimality guarantee to the approximately isotropic case.
Technical Moats
The theoretical distortion bound. TurboQuant's near-optimal distortion rate is not a benchmark result; it is a proof. The paper establishes that the per-vector distortion is within O(1) of the Shannon rate-distortion lower bound for the class of quantization problems it addresses. Prior methods achieve good empirical results without this guarantee. TurboQuant achieves comparable empirical results with a guarantee that transfers across models, tasks, and distributions.
The zero-overhead architecture. Most quantization methods that eliminate per-vector scale factors either require offline calibration on the full dataset (GPTQ, AWQ) or store scale factors at the group level (reducing but not eliminating overhead). TurboQuant eliminates per-vector overhead architecturally: the shared rotation ensures uniformity without needing per-vector constants. Replicating this requires the random rotation preprocessing to be part of the quantizer design from the start, not an add-on.
Online compatibility. TurboQuant processes each vector independently as it arrives. This is essential for KV cache compression in autoregressive inference, where keys and values are computed one token at a time. Methods that require the full sequence to select quantization parameters cannot work online. TurboQuant's online property is a consequence of the zero-overhead design: if you need no per-vector calibration, you can process vectors one at a time.
Insights
Insight One: The quantization overhead problem is not a minor inconvenience. It is the reason why many "low-bit" quantization methods do not actually compress models as much as advertised, and TurboQuant is one of the few to address this at the architecture level rather than with a footnote.
When a paper reports "4-bit quantization," the effective bit-rate often includes group-level scale factors at fp16 (16 bits per 128-element group = +0.125 bits/element), zero-points (another +0.125 bits/element), and sometimes second-order correction terms. The actual storage can be 4.5-5.5 bits/element for a "4-bit" method. TurboQuant's 3.25 bits/element for a "3-bit" method is directly comparable to these effective rates and is often competitive with prior 4-bit methods in practice. The comparison that matters is effective bits/element, not the nominal bit-width, and most method comparisons in the literature use nominal numbers.
Insight Two: TurboQuant's two-stage pipeline (most bits for main compression, 1 bit for residual correction) is the correct decomposition of the quantization problem, but the community has not adopted it broadly because it requires understanding Johnson-Lindenstrauss theory to implement correctly. The Triton kernel implementation in 0xSero/turboquant is the practical contribution that makes this accessible.
The insight that the residual from a good first-stage quantizer is approximately random and low-norm (and therefore well-suited to random projection) is not obvious. Most quantization implementations either ignore the residual entirely (round-to-nearest) or try to encode it explicitly (residual quantization, additive quantization). QJL's approach, encode the residual direction with 1-bit JL measurements and use this to correct the inner product at query time without reconstructing the residual, requires understanding why this works theoretically. This is the moat that separates the approach from straightforward engineering.
Takeaway
TurboQuant's QJL stage corrects attention scores, not reconstructed vectors. This is a fundamentally different approach from all prior quantization methods: instead of minimizing reconstruction error (how accurately can we recover the original vector?), TurboQuant minimizes attention score error (how accurately can we compute Q·K^T?). For LLM inference, attention scores are the only thing that matters: the exact key vectors are never used directly. Optimizing for the downstream task (attention) rather than the proxy task (reconstruction) is why TurboQuant achieves better perplexity at the same bit-rate than methods optimized for reconstruction.
This is the insight from the JL literature applied to LLM inference: you do not need to reconstruct the key to compute the attention score. You only need to estimate the inner product. A random 1-bit sketch is sufficient for an unbiased inner product estimate, and a biased reconstruction (which is what all prior methods produce) is not sufficient for unbiased attention scores. The target matters.
TL;DR For Engineers
TurboQuant (arXiv:2504.19874, Google Research, ICLR 2026) is a two-stage KV cache quantization algorithm: PolarQuant (randomized Hadamard rotation + uniform quantization) for main compression, QJL (1-bit Johnson-Lindenstrauss sketch of residual) for unbiased attention score correction. Zero per-vector overhead, online processing, near-optimal distortion guarantee.
The zero-overhead property: no per-vector scale factors or zero-points stored in the KV cache. The shared rotation matrix (random Hadamard + diagonal) is calibrated once at model initialization. Traditional 3-bit quantization with overhead ≈ 4.5 bits/element; TurboQuant 3-bit = 3.25 bits/element.
3.8× KV cache compression at Llama-3-8B 32K context: 33.6 GB → 8.8 GB. Perplexity impact: ~1-3% degradation. Community implementation (0xSero/turboquant) adds Triton kernels and vLLM integration for production use.
QJL corrects attention SCORES not vectors: 1-bit JL measurements of the reconstruction residual give an unbiased inner product estimate with the query. This beats methods that minimize reconstruction error because attention score bias is what directly hurts perplexity.
Practical deployment: calibrate step_size on a small sample at initialization, set D from a fixed seed, store PolarQuant codes (3 bits/dim for keys, 2 for values) and QJL bits (k bits per vector) in the KV cache. Triton kernels handle the decode-time Hadamard and JL correction efficiently.
The Bit Is the Budget
TurboQuant's central contribution is the elimination of hidden overhead through architectural design: rotate to uniformity, then quantize uniformly, and correct with a theoretically grounded 1-bit residual sketch. The result is that the stated bit-rate is the actual bit-rate, the distortion bound is a proof not a benchmark, and the method works online without buffering. For KV cache compression at long context, these three properties together define the correct algorithm.
The quantization community has optimized empirically for years. TurboQuant optimizes theoretically and then validates empirically. The gap between the two approaches is exactly the QJL correction: a 1-bit sketch that the empirical community would not add because it seems like too little to matter, but that the theoretical community knows gives an unbiased inner product estimate that changes the perplexity result.
References
TurboQuant: Online Vector Quantization with Near-Optimal Distortion Rate, arXiv:2504.19874, Zandieh and Mirrokni, Google Research, ICLR 2026
Quantized Johnson-Lindenstrauss, arXiv:2406.03482 — the QJL algorithm, AAAI 2025
PolarQuant, arXiv:2502.02617 — the PolarQuant algorithm, AISTATS 2026
SpinQuant: LLM Quantization with Learned Rotations, arXiv:2405.16406 — Meta's learned rotation approach; TurboQuant uses random rotations (theory-grounded vs. learned)
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models, arXiv:2211.10438 — the activation migration approach; different problem framing from TurboQuant
TurboQuant (arXiv:2504.19874, Google Research, ICLR 2026) is a two-stage online vector quantization algorithm achieving near-optimal distortion with zero per-vector overhead: Stage 1 (PolarQuant) applies a randomized Hadamard rotation to isotropize the vector distribution, then uniform-quantizes each dimension at b bits (3 for keys, 2 for values); Stage 2 (QJL) compresses the reconstruction residual to k 1-bit Johnson-Lindenstrauss measurements that provide an unbiased correction to attention scores at query time. At Llama-3-8B with 32K context, this reduces KV cache from 33.6 GB to 8.8 GB (3.8×) with 1-3% perplexity degradation. Community implementation at 0xSero/turboquant includes Triton kernels and vLLM integration.
Sponsored Ad
If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀
Catch Bad Actors. Let Good Users Flow.
When your goal is to increase the difficulty of online attacks, the advanced features of hCaptcha Enterprise is the most robust solution.
Take it from one of our customers:
“Compared to last year [when using competitor], we had a 96% reduction in bot throughput.” - Top 10 Gaming Company
Category leaders in every industry have been switching to hCaptcha because of the robustness and durability of our detection and deterrence solutions.
Virtually all companies that book a demo decide to move forward.


