SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | May 6, 2026
Every AI product that does semantic search, RAG, recommendation, or embedding-based retrieval eventually hits a wall: how do you run approximate nearest neighbor search at scale without building a dedicated GPU cluster or paying per-query API costs that compound at production volume? Turbopuffer's ANN v3 answers this at an engineering level that most of the vector database industry has not reached: 200ms p99 latency across 100 billion vectors (200 TiB of dense data), served from a stateless layer over S3, using CPU-only computation with binary quantization and hierarchical clustering.
This newsletter dissects ANN v3 as a systems engineering document: why vector search is memory-bandwidth-bound not compute-bound, how the RaBitQ binary quantization changes the arithmetic intensity enough to flip the bottleneck, what the hierarchical SPFresh clustering does for cold query tail latency, and how a two-stage approximate-then-refine architecture makes the memory hierarchy work in turbopuffer's favor at every level.
Scope: turbopuffer ANN v3 architecture (turbopuffer.com/blog/ann-v3), RaBitQ binary quantization, SPFresh hierarchical clustering, arithmetic intensity analysis, AVX-512 SIMD optimization, the AQR-HNSW multi-stage re-ranking approach (arXiv:2602.21600), and the B+ANN disk-based index (arXiv:2511.15557). Not covered: turbopuffer's metadata filtering stack, write path, or cost model beyond the bandwidth analysis.
What It Actually Does
Turbopuffer is a vector database that stores data durably on object storage (S3/GCS/Azure Blob) and serves queries from a stateless compute tier with a memory/SSD cache layer. ANN v3 (January 2026, Nathan VanBenschoten, Chief Architect) extends this architecture to support 100 billion vectors in a single search index with 200ms p99 latency at greater than 1,000 QPS.
The scale: 100 billion vectors at 1,024 dimensions per vector at 2 bytes per dimension (f16). That is 200 TiB of dense vector data. The goal: sub-200ms p99, greater than 1k QPS, from a stateless cache-over-S3 architecture, on CPUs only.
The key performance benchmarks from the ANN v3 blog:
p99 latency: 200ms at 100 billion vectors, 1,024 dimensions, k=10, 92% recall
Unfiltered search benchmark (1,024D) available at the turbopuffer leaderboard
Binary quantization provides a 16x compression ratio (f16 to 1-bit), making 12.5 TiB of quantized vectors resident in DRAM for 100B vectors
RaBitQ reuses each bit 4x during distance estimation, yielding 64x higher arithmetic intensity than raw f16 dot products
The Architecture, Unpacked
The foundational insight is that vector search is not compute-bound. It is memory-bandwidth-bound.
Why vector search is bandwidth-bound, not compute-bound:
The kernel of vector search is a dot product: multiply each dimension of a data vector against the corresponding dimension of a query vector, sum the results. Each element in the data vector is used exactly once. This is the definition of low arithmetic intensity (FLOPs per byte transferred). Contrast with matrix-matrix multiplication where each element is reused across many rows. The roofline model predicts that low arithmetic intensity workloads saturate memory bandwidth before they saturate compute bandwidth. For vector search at scale: every data vector needs to be fetched from somewhere in the memory hierarchy, and the rate at which vectors can be compared is limited by how fast bytes can move up the hierarchy, not by how fast the CPU can multiply them.
This has a critical implication: optimizing the distance kernel itself is the wrong optimization target. Making the per-vector computation 2x faster does nothing if the memory bus is the binding constraint.

Focus on the two-stage design: binary quantization produces coarse rankings in DRAM at compute-bound speed, then full-precision reranking on a small candidate set fetches only a tiny fraction of the full data from NVMe. Data compression works doubly: it reduces bandwidth requirements AND allows more data to live in faster cache tiers simultaneously.
The memory hierarchy design is the core of ANN v3. Each level of the hierarchy has a different size-to-bandwidth ratio:
CPU L1/L2/L3: kilobytes to megabytes, 1-10+ TiB/s bandwidth
DRAM: gigabytes to terabytes, 100-500 GB/s bandwidth
NVMe SSD: terabytes, 1-30 GB/s bandwidth
Object storage (S3): petabytes, 1-10 GB/s bandwidth
Binary quantization makes the quantized vectors 16x smaller than f16. At 100B vectors, f16 storage is 200 TiB, but quantized storage is 12.5 TiB. On a cluster of large-memory instances, 12.5 TiB fits in DRAM. The binary search pass runs entirely from DRAM at DRAM bandwidth. Full-precision reranking fetches a tiny number of candidates from NVMe at full precision, exploiting NVMe's high parallel I/O capability.
The Code, Annotated
Snippet One: Binary Quantization and Hamming Distance Kernel
import numpy as np
# Standard f16 vector storage: 2 bytes per dimension, 1024D = 2048 bytes per vector
# Binary quantization: 1 bit per dimension, 1024D = 128 bytes per vector
# 16x compression ratio
def quantize_to_binary(vectors_f16: np.ndarray) -> np.ndarray:
"""
Convert f16 vectors to binary representation for RaBitQ-style quantization.
vectors_f16: shape (N, D), dtype float16
returns: shape (N, D//8), dtype uint8
"""
# ← THIS is the trick: convert to binary by thresholding at 0
# RaBitQ uses random rotation + thresholding, but the core idea is:
# each float becomes one bit, 1 if positive, 0 if not
# This 16x compression is what fits 12.5 TiB into DRAM at 100B scale
binary = (vectors_f16 > 0).view(np.uint8)
# Pack 8 bits into each byte (bit packing)
# Input: (N, D) array of 0/1 uint8
# Output: (N, D//8) array of packed uint8
N, D = vectors_f16.shape
assert D % 8 == 0, "D must be multiple of 8 for clean bit packing"
packed = np.packbits(binary, axis=1) # shape (N, D//8)
return packed # at D=1024: (N, 128) bytes — 16x smaller than f16
def hamming_distance_batch(
query_binary: np.ndarray, # shape (D//8,) — one packed query
candidates_binary: np.ndarray, # shape (N, D//8) — N packed candidates
) -> np.ndarray:
"""
Compute Hamming distance from query to all N candidates.
Returns shape (N,) of int32 distances.
On modern CPUs with AVX-512, this maps to:
XOR: compare 512 bits (64 bytes) per instruction
POPCOUNT: count differing bits per 64-byte block
This is the exact operation turbopuffer's kernel runs per clock cycle.
← The key: each byte fetched is used for 8 bit-comparisons, not 1 float op.
RaBitQ reuses each bit 4x in its distance estimator → 64x higher arithmetic
intensity vs f16 dot product. This tips the system from bandwidth-bound to
compute-bound, which is why turbopuffer then needed to optimize the kernel.
"""
# XOR reveals differing bits between query and each candidate
xor_result = np.bitwise_xor(candidates_binary, query_binary) # (N, D//8)
# POPCOUNT counts the number of 1-bits in each XOR result
# In pure numpy: slow. In AVX-512: extremely fast.
# np.unpackbits then .sum() is a reasonable Python approximation:
bit_differences = np.unpackbits(xor_result, axis=1).sum(axis=1) # (N,)
return bit_differences.astype(np.int32)
# Performance comparison:
# f16 dot product (SDOT): 1 float op per 2 bytes → arithmetic intensity ~0.5 FLOPs/byte
# Binary Hamming (AVX-512): ~8 bit ops per byte + RaBitQ 4x reuse
# → arithmetic intensity ~32 FLOPs/byte (64x higher)
# This difference is why binary search is compute-bound but f16 search is bandwidth-bound
The arithmetic intensity jump from 64x is not an optimization. It is a phase transition. The system behavior changes qualitatively: from "get bytes faster" to "process bytes faster." These require completely different optimization strategies.
Snippet Two: Hierarchical Clustering, Cold Query Bound, and Reranking
import heapq
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class CentroidNode:
"""One node in the SPFresh hierarchical centroid tree."""
centroid: np.ndarray # the mean vector of this cluster, f16
children: list = field(default_factory=list) # child CentroidNodes
vector_ids: Optional[list] = None # only populated at leaf nodes
class HierarchicalANNIndex:
"""
Simplified representation of turbopuffer's SPFresh-based index.
Key insight: hierarchical clustering bounds cold-query S3 round-trips.
"""
def query(
self,
query_f16: np.ndarray, # the query vector in full f16 precision
k: int,
recall_target: float = 0.92,
) -> list[int]:
# STEP 1: Navigate centroid tree (hierarchical narrowing)
# ← Each level does: compare query to all centroids at this level
# and keep the top-branch candidates.
# Critical for cold queries: each tree level = one S3 round-trip maximum.
# Without the hierarchy, a graph-based index needs sequential graph hops,
# each potentially requiring a separate S3 read. Unbounded tail latency.
# With hierarchy: tail latency bounded by tree height = O(log N) round-trips.
promising_clusters = self._navigate_tree(query_f16, n_clusters=3)
# STEP 2: Binary quantization pass (coarse ranking)
# All cluster data is stored in DRAM as binary vectors (16x compressed)
# This entire pass runs from DRAM → never touches NVMe
candidates = []
for cluster in promising_clusters:
binary_vectors = self._load_binary_cluster(cluster) # from DRAM
query_binary = quantize_to_binary(query_f16.reshape(1, -1))[0]
distances = hamming_distance_batch(query_binary, binary_vectors)
# Keep top-M candidates from this cluster (M >> k but << cluster_size)
top_m_indices = np.argpartition(distances, min(100, len(distances)))[:100]
candidates.extend([(distances[i], cluster.vector_ids[i]) for i in top_m_indices])
# STEP 3: Full-precision reranking (NVMe scatter-gather)
# ← THIS is the key design decision for accuracy:
# We don't trust binary quantization for final ranking.
# Binary gives us a "good enough" shortlist. Full precision gives us
# the correct final order. The shortlist is small (hundreds of vectors),
# so the NVMe reads are a tiny fraction of total cluster size.
# NVMe random read latency: 50-100 μs per I/O, highly parallelizable.
top_candidates = heapq.nsmallest(200, candidates, key=lambda x: x[0])
candidate_ids = [vid for _, vid in top_candidates]
# Fetch full f16 vectors for only the shortlisted candidates (scatter-gather)
# This is where NVMe bandwidth is used: parallel I/O to many small reads
full_vectors = self._scatter_gather_f16(candidate_ids) # from NVMe SSD
# Compute exact dot products at full f16 precision
exact_distances = full_vectors @ query_f16
final_ranking = np.argsort(-exact_distances)[:k]
return [candidate_ids[i] for i in final_ranking]
The two-stage design is not a compromise. It is the correct decomposition: use cheap, high-compression binary comparison to eliminate 99%+ of candidates from DRAM, then use expensive, accurate f16 comparison on the tiny survivor set from NVMe. The error introduced by binary quantization is fully corrected by the reranking step.
It In Action: End-to-End Worked Example
Input: Query a 100-billion-vector namespace with a 1,024-dimensional f16 query vector. Target: k=10 nearest neighbors, 92% recall, p99 ≤ 200ms.
Scale assumptions:
100B vectors × 1024 dims × 2 bytes = 200 TiB f16 data on S3
Binary quantized: 100B × 128 bytes = 12.5 TiB in DRAM
Cluster size: ~500 vectors per leaf (example parameterization)
Top-3 clusters selected at query time
Reranking candidate count: 200 vectors
Step 1: Centroid tree navigation (5-50ms)
Query vector q (1024D f16) hits the index.
Root level: 1,000 centroids. Compare q against all 1,000 centroids.
→ AVX-512: processes 16 f16 values per instruction
→ ~1,024 ÷ 16 = 64 instructions per centroid comparison
→ 1,000 comparisons: ~64,000 instructions, completes in microseconds from L2 cache
→ Top-3 centroids selected by cosine similarity
L1 level: 3 sub-trees explored, each with ~300 centroids.
→ Same process: 3 × 300 × 64 instructions, still in L2/L3 cache
→ Top-3 L1 centroids selected
Leaf level: 3 leaf clusters identified.
→ Each cluster file: ~6 MB (500 vectors × 1024D × 2 bytes per vector) in binary: ~62.5 KB
→ Cache status check:
DRAM hit: ~0ms additional latency
NVMe hit: ~5ms (parallel reads, 3 clusters simultaneously)
S3 cold: ~50ms (parallel GETs, 3 clusters simultaneously, bounded by tree height)
Worst case (S3 cold): ~50ms for cluster fetch
Best case (DRAM hot): ~1ms total for steps 1
Step 2: Binary quantization search (80 microseconds)
3 clusters, ~500 vectors each = 1,500 candidates
Each binary vector: 128 bytes (vs 2,048 bytes for f16)
Total binary data size: 1,500 × 128 = 192 KB (fits comfortably in L3 cache)
For each of 1,500 candidates:
XOR(query_binary, candidate_binary): 128 bytes ÷ 64 bytes per AVX-512 op = 2 ops
POPCOUNT(XOR_result): 2 more ops
Total: ~4 AVX-512 ops per candidate
1,500 candidates × 4 ops = 6,000 AVX-512 ops
At ~3 GHz clock, ~2 billion AVX-512 ops/second: 6,000 ÷ 2B ≈ 3 μs
Including memory fetches and overhead: ~80 μs measured
Output: top-200 candidates ranked by Hamming distance
Binary distance is approximate: some true nearest neighbors ranked lower,
but recall-optimized parameters ensure ≥92% of true top-10 are in the top-200.
Step 3: Full-precision reranking (5-10ms)
200 candidates from step 2 need exact distance recomputation.
Each full f16 vector: 2,048 bytes
Total data: 200 × 2,048 = 409.6 KB from NVMe (scatter-gather)
NVMe random read: 50-100 μs per I/O operation
200 parallel I/O requests (scatter-gather) in ~5ms total
For each of 200 candidates:
Full dot product at f16: 1,024 multiply-accumulate ops
AVX-512 processes 16 f16 values per instruction: 64 instructions per dot product
200 dot products × 64 instructions = 12,800 instructions: negligible time
Output: exact top-10 by cosine similarity, returned to client
Total latency breakdown (hot cache):
Centroid navigation: ~1ms
Binary search pass: ~0.08ms
NVMe scatter-gather: ~5ms
Exact reranking compute: ~0.5ms
Overhead: ~1ms
Total: ~7-8ms (median hot latency)
Cold query (S3 fetch):
S3 cluster fetch: ~50ms
+ hot path above: ~7ms
Total: ~57ms (cold median)
p99 target met at 200ms with buffer for variance.
Why This Design Works, and What It Trades Away
The binary quantization is the correct quantization choice for the high-bandwidth-to-cache-efficiency problem at this scale. Product quantization (PQ) achieves higher accuracy at lower compression ratios, but PQ codebook lookups require more complex SIMD patterns (table lookups, not pure XOR/POPCOUNT). Binary quantization maps directly to AVX-512 VPXORQ and VPOPCNTQ instructions that process 512 bits per cycle. The AQR-HNSW paper (arXiv:2602.21600) demonstrates a complementary approach: density-aware quantization combined with multi-stage re-ranking achieves better recall at moderate compression ratios. At 100B-scale where the goal is fitting quantized data in DRAM, binary's extreme compression ratio (16x vs PQ's typically 4-8x) is the decisive factor.
The SPFresh hierarchical clustering is the correct index structure for the cold-query tail latency problem. Graph-based indexes (HNSW) traverse edges sequentially: each edge traversal could require a separate S3 fetch. At 100B scale with cold namespaces, this produces unbounded tail latency. The hierarchical tree bounds cold fetches to tree height (logarithmic). SPFresh also supports incremental updates without full index rebuilds, which is essential for a live production index at this scale.
The stateless architecture over S3 is the correct durability model for this scale. Storing data durably on object storage with a warm-cache compute layer means the storage cost is object storage pricing (cheap) rather than provisioned block storage pricing (expensive). The cache layer absorbs hot working sets in DRAM and NVMe. The tradeoff: cold queries pay the S3 latency penalty, bounded but real.
What ANN v3 trades away:
AVX2 machines. ANN v3's binary quantization kernel requires AVX-512 for efficient POPCOUNT. The ANN v3 blog explicitly notes: "we were recently surprised to find that AVX2 (256-bit x86 SIMD) does not have an accelerated popcount instruction." On AVX2-only machines, the binary quantization kernel falls back to slower POPCNT emulation. Any production deployment targeting binary quantization at this scale needs to verify AVX-512 support in the instance family.
Extreme recall requirements. At 92% recall, some true nearest neighbors are missed. For applications requiring 99%+ recall (certain safety-critical retrieval tasks), the binary quantization error budget needs to be tightened by increasing the candidate count N in the reranking step, which increases NVMe I/O and compute cost. The tradeoff between recall and latency is tunable but exists.
Homogeneous vector dimensions. The analysis is calibrated for 1,024D f16 vectors. Significantly shorter vectors (e.g., 128D) change the bandwidth calculations and may not benefit as dramatically from binary quantization because the cluster files become small enough that even f16 vectors fit in DRAM.
Technical Moats
The arithmetic intensity flip. Most vector database engineers think about "making distance computation faster." ANN v3 reframes this: the bottleneck is not computation, it is data movement. Binary quantization's 16x compression plus 64x arithmetic intensity increase changes which resource is the bottleneck, and that flip enables a qualitatively different optimization strategy (compute-side AVX-512 tuning vs. bandwidth-side compression). Recognizing that your system is bandwidth-bound before optimizing is the insight. Most teams optimize the compute kernel first and wonder why it does not scale.
The two-stage approximate-then-refine pattern at every level. ANN v3 applies the same "coarse approximation followed by precise refinement" at three distinct scales: the centroid tree narrows billions of vectors to thousands, binary quantization narrows thousands to hundreds, and full-precision reranking narrows hundreds to k. Each stage uses the appropriate accuracy for its scale. This compositional correctness design is what makes the accuracy/speed tradeoff tunable without architectural changes.
The SPFresh cold-query bound. Bounding object storage round-trips to tree height (logarithmic) rather than allowing unbounded graph traversal eliminates the class of tail latency failures that afflict graph-based indexes at this scale. This is an architectural guarantee, not a tuning parameter.
Insights
Insight One: The vector database community benchmarks throughput-per-dollar at million-scale. ANN v3 changes the competitive frame to billion-scale on commodity hardware, where the bottleneck economics are completely different.
At million-scale, HNSW with in-memory f16 vectors on a single machine is perfectly adequate. The competitive differentiation between vector databases at this scale is largely marginal: recall tuning, filter performance, and API ergonomics. At 100-billion-scale, the bottleneck changes from "which index structure is faster" to "can you even fit the data in a way that allows queries." ANN v3's insight is that the path to 100-billion-scale is not more RAM or more GPUs. It is finding the right place in the memory hierarchy for each phase of the computation, which requires a fundamentally different index architecture than what scales to a billion.
Insight Two: Binary quantization does not reduce search quality. It shifts where quality loss occurs, and full-precision reranking fully corrects it, making the system's accuracy determined entirely by reranking candidate count, not by quantization error.
The common understanding of quantization is: compressed vectors → worse distances → lower recall. This is correct for systems that use quantized distances as final distances. ANN v3 uses binary distances only as a filter. The top-N binary-ranked candidates include the true top-k at 92%+ recall even though individual binary distances are imprecise. The full-precision reranking then computes exact distances for this shortlist. The final recall is determined entirely by whether the true top-k survived the binary filter, not by quantization error in the reranking step itself. This is the insight that makes two-stage retrieval systems work: you do not need the approximate phase to be accurate, you need it to include the right candidates in its output set.
Takeaway
AVX2 (256-bit SIMD, which is far more common than AVX-512) does not have a native popcount instruction. The binary quantization approach that enables ANN v3 fundamentally requires AVX-512. AVX2 machines must use emulated POPCNT, which is substantially slower and breaks the compute-bound arithmetic intensity argument.
This is not a theoretical limitation. Turbopuffer explicitly encountered this during ANN v3 development and noted the surprise in the blog post. The implication for AI infrastructure teams building binary quantization into their vector search stacks: verify AVX-512 availability in your instance family before adopting binary quantization as a primary optimization. On AWS, i4i and i7i instances support AVX-512. r5 and r6i (common general-purpose instances) do not. This single hardware requirement separates whether binary quantization achieves its theoretical throughput gains or not.
TL;DR For Engineers
Vector search is memory-bandwidth-bound, not compute-bound. Each vector element is used once (SDOT arithmetic intensity ~0.5 FLOPs/byte). Optimizing the distance kernel without reducing bandwidth requirements does not help at scale.
Binary quantization (16x compression, f16 → 1-bit) changes arithmetic intensity to 32 FLOPs/byte (64x higher via RaBitQ 4x bit reuse), making the system compute-bound. This allows 100B vectors' quantized representations (12.5 TiB) to fit in DRAM, changing which hardware resource limits throughput.
Two-stage architecture: (1) binary quantization over DRAM to get top-N candidates (~80 μs), (2) full-precision f16 reranking over NVMe scatter-gather for top-k final results. Binary error is fully corrected by reranking. Final recall is determined by binary filter's recall, not quantization accuracy.
SPFresh hierarchical clustering bounds cold-query S3 round-trips to tree height. No graph traversal = no unbounded tail latency on cold namespaces.
AVX-512 is required for efficient POPCOUNT (binary quantization kernel). AVX2 has no native popcount instruction. Verify AVX-512 availability in your target instance family before adopting binary quantization at production scale.
Bandwidth Is the Problem. Compression Is the Architecture.
ANN v3's insight is not "use binary quantization to go faster." It is "recognize that vector search is fundamentally bandwidth-bound, identify every level of the memory hierarchy where that bandwidth constraint appears, and design a quantization and caching strategy that pushes the bottleneck down to the fastest available level at each scale." Binary quantization accomplishes two things simultaneously: it reduces the bandwidth requirement (16x smaller data) and raises the arithmetic intensity enough to flip the binding constraint from bandwidth to compute, enabling a completely different class of optimization. This is the correct systems engineering frame. The specific algorithm (RaBitQ), the index structure (SPFresh), and the hardware choices (AVX-512) all follow from the initial bottleneck analysis.
References
ANN v3: 200ms p99 query latency over 100 billion vectors, Nathan VanBenschoten, turbopuffer, January 21, 2026
Quantization to Speedup Approximate Nearest Neighbor Search, Neural Computing and Applications, 2023
Roofline Model and Arithmetic Intensity, Wikipedia
How turbopuffer Serves 2.5 Trillion Vectors on S3, Ajay Edupuganti, April 2026
Case Study: turbopuffer ANN v3, Terence Z. Liu, February 2026
Turbopuffer ANN v3 (January 2026) achieves 200ms p99 latency across 100 billion vectors (200 TiB f16) by recognizing that vector search is memory-bandwidth-bound rather than compute-bound, then applying binary quantization (RaBitQ, 16x compression) to raise arithmetic intensity 64x and fit the quantized dataset (12.5 TiB) in DRAM, hierarchical SPFresh clustering to bound cold-query S3 round-trips to tree height, and full-precision reranking on a small NVMe-fetched shortlist to correct binary approximation error. The system transitions from bandwidth-bound to compute-bound with binary quantization, requiring AVX-512 POPCOUNT instructions for efficient distance computation, with AVX2 machines lacking a native popcount instruction and requiring emulation.
Sponsored Ad
If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀
What 2,000 SaaS Companies Reveal About Growth in 2026
Is your growth in-line with your peers in B2B SaaS & AI?
Benchmark yourself against actual billings data for Maxio’s 2000+ global customers, alongside firsthand company perspectives to understand how growth varied by company size, business model, and strategic focus.
Key takeaways from the report:
Average growth across 2,000 companies
Growth by revenue band
AI-led vs AI-enhanced. Who performed better?


