Text Embeddings Inference (TEI, huggingface/text-embeddings-inference) solves both with a Rust-native stack, token-based dynamic batching, and hardware-specific kernel selection. The result is documented 10x throughput improvements and GPU utilization that climbs from 20% to 95% under the same load.
SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | June 16, 2026
The standard approach to serving embedding models in production is: wrap sentence_transformers or transformers in a FastAPI server, batch incoming requests by count, and serve. This works. It also leaves 80% of your GPU idle most of the time because request-count batching groups requests that may have wildly different token lengths. A batch of 32 requests where one has 5 tokens and another has 500 tokens is compute-inefficient: the GPU processes the short request's padding zeros alongside the real tokens.
Token-based batching fixes this. Instead of "batch 32 requests," TEI asks: "batch as many requests as fit within a token budget." The token budget ensures every GPU execution is doing roughly equal work. Short requests fill in around long ones. Padding is minimized. This is why documented GPU utilization jumps from 20% to 95%.
Text Embeddings Inference (TEI, Hugging Face) is the production embedding serving stack that implements this and a layer of hardware-specific optimizations: Flash Attention for attention kernels, cuBLASLt for fused matrix operations, Candle as a Rust-native ML framework for near-zero inference overhead, and Intel MKL for CPU paths. All of this ships as a Docker image with no Python runtime required for the hot path.
Scope: TEI's three-layer architecture (HTTP router, tokenizer/batcher, inference backend), token-based dynamic batching mechanics, the three backend options (Candle/ORT/Python), hardware support matrix, and deployment patterns including air-gapped environments. Also covered: how MTEB (arXiv:2210.07316) and BGE-M3 (arXiv:2402.03216) contextualize the embedding model landscape that TEI serves. Not covered: TEI's observability stack beyond brief mention, or TEI on Cloud Run beyond Docker command examples.
What It Actually Does
TEI is a production HTTP inference server for embedding models, re-rankers, and sequence classifiers. It exposes three endpoints:
/embed: dense embedding generation (returns float vectors)/rerank: cross-encoder re-ranking (returns scores for query-document pairs)/predict: sequence classification (returns class probabilities)
All three are also available with an OpenAI-compatible wrapper at /v1/embeddings, making TEI a drop-in replacement for the OpenAI Embeddings API in any existing RAG pipeline.
Supported model families (v1.9): BERT, CamemBERT, XLM-RoBERTa, JinaBERT, NomicBERT, MPNet, ModernBERT, Qwen2, Qwen3, Alibaba GTE, Mistral, Gemma3
Supported hardware: CPU (x86 and ARM64), Turing/T4/RTX 2000, Ampere A100/A30, Ampere A10/A40, Ada Lovelace RTX 4000, Hopper H100, Blackwell B200/RTX 5090/DGX Spark. V100 and earlier are not supported (CUDA compute < 7.5).
Quickstart:
# GPU (Ampere A100/A30)
model=BAAI/bge-base-en-v1.5
volume=$PWD/data
docker run --gpus all -p 8080:80 -v $volume:/data \
ghcr.io/huggingface/text-embeddings-inference:1.9 \
--model-id $model
# CPU
docker run -p 8080:80 -v $volume:/data \
ghcr.io/huggingface/text-embeddings-inference:cpu-1.9 \
--model-id $model
The Architecture, Unpacked

Focus on the token-based batching in Layer 2. The token budget (max_batch_tokens) is the single most important tuning parameter. Setting it too low leaves GPU compute idle. Setting it too high causes head-of-line blocking where a single large request delays all smaller requests behind it in the queue. The correct value is "as large as possible until the model is compute-bound," which you find empirically by increasing until latency spikes.
The Code, Annotated
Snippet One: Deployment and Token Budget Configuration
# TEI production deployment with explicit token budget configuration
# Source: huggingface/text-embeddings-inference README + docs (Apache 2.0)
# The tuning parameters that actually matter in production
model=BAAI/bge-base-en-v1.5
volume=$PWD/data
docker run --gpus all -p 8080:80 -v $volume:/data \
--pull always \
ghcr.io/huggingface/text-embeddings-inference:1.9 \
--model-id $model \
--max-batch-tokens 32768 \
# ← Default is 16384. For long-context models or high-throughput workloads,
# increase this to allow larger batches. The GPU will process more tokens
# per execution, improving utilization but increasing p99 latency.
# Rule: increase until throughput plateaus, then back off slightly.
--max-concurrent-requests 512 \
# ← Backpressure: once 512 requests are in flight, new requests receive 429.
# Set this based on your SLA: lower = stricter latency guarantees.
--max-batch-requests 32 \
# ← Optional: cap batch SIZE (in requests) independent of token budget.
# Useful when downstream consumers need bounded response fan-out.
--dtype float16
# ← float16 halves memory vs float32, enables larger batches.
# bfloat16 for Ampere+: better numeric stability than float16.
# ── For instruction-following models (E5, BGE with query/passage prefixes) ───
docker run --gpus all -p 8080:80 -v $volume:/data \
ghcr.io/huggingface/text-embeddings-inference:1.9 \
--model-id intfloat/multilingual-e5-large-instruct \
--default-prompt-name query
# ← THIS is the trick for instruction models:
# The model expects "query: <text>" for queries, "passage: <text>" for docs.
# --default-prompt-name query automatically prepends "query: " to all inputs.
# Without this, the model produces lower-quality query embeddings because
# it was trained to differentiate between query and passage modes.
# Different prompt names: "query", "passage", "clustering", "classification"
# ── Air-gapped deployment (no internet access required) ─────────────────────
mkdir models && cd models
git lfs install
git clone https://huggingface.co/BAAI/bge-base-en-v1.5
volume=$PWD
docker run --gpus all -p 8080:80 -v $volume:/data \
ghcr.io/huggingface/text-embeddings-inference:1.9 \
--model-id /data/bge-base-en-v1.5
# ← Local path: weights loaded from volume, no HF Hub access needed.
# ← Note: the image itself still needs to be pulled once; after that,
# combine --pull missing with a pre-pulled image for true air-gapped ops.
The --default-prompt-name flag is the deployment detail that most teams miss. BGE, E5, and multilingual instruction models are trained with explicit prefixes for different task types. Skipping the prefix does not error: it silently returns worse embeddings because the model expects the task indicator. Always check the model card for required prompts.
Snippet Two: Python Client Usage and Batch Performance Patterns
# TEI Python client: all three interfaces with performance annotations
# Source: huggingface/text-embeddings-inference docs (Apache 2.0)
# Shows the OpenAI SDK path (drop-in replacement) and batch patterns
import time
import httpx
from openai import OpenAI
from huggingface_hub import InferenceClient
# ─── OPTION 1: OpenAI SDK (drop-in replacement for OpenAI Embeddings) ─────────
# ← THIS is the production-ready path if you're migrating from OpenAI Embeddings
# Same client code works against both OpenAI and TEI — only base_url changes
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="-", # ← TEI doesn't use API keys; required field, any value works
)
response = client.embeddings.create(
model="text-embeddings-inference", # TEI ignores the model name here
input="What is deep learning?",
)
embedding = response.data[0].embedding
print(f"Embedding dimension: {len(embedding)}") # 768 for bge-base-en-v1.5
# ─── OPTION 2: Batch embedding (leverage token-based batching) ──────────────
# ← Send large batches: TEI's dynamic batcher handles token budget internally
# DO NOT loop and send one-by-one: you bypass batching and serialize GPU ops
texts = [
"What is deep learning?", # ~6 tokens
"Explain transformer architecture", # ~4 tokens
"Machine learning is a subset...", # ~200 tokens (hypothetical long text)
]
# ← Single request with list of inputs: TEI batches within token budget
response = client.embeddings.create(
model="text-embeddings-inference",
input=texts, # ← pass list directly; TEI batches automatically
)
embeddings = [d.embedding for d in response.data]
print(f"Got {len(embeddings)} embeddings in one request")
# ─── OPTION 3: Reranking (for RAG pipelines post-retrieval) ──────────────────
# ← Reranker = cross-encoder: slower than embedding similarity but more accurate
# Pattern: retrieve top-50 with embedding, rerank to top-5 with cross-encoder
import requests
def rerank(query: str, docs: list[str], top_k: int = 5) -> list[dict]:
"""
Cross-encoder reranking after ANN retrieval.
← Why cross-encoder instead of embedding similarity?
Embedding models encode query and doc independently (bi-encoder).
Cross-encoders see query+doc together → richer attention between them.
2-5% NDCG improvement on BEIR benchmarks with ~100x latency overhead.
Use ONLY as a second-stage filter on top-50 ANN results, not for all docs.
"""
response = requests.post(
"http://localhost:8080/rerank",
json={
"query": query,
"texts": docs,
"raw_scores": False, # ← False = sigmoid-normalized [0,1] scores
"return_text": True, # ← include text in response for debugging
}
)
results = response.json()
# ← TEI returns docs sorted by score descending (best first)
return results[:top_k]
# ─── THROUGHPUT BENCHMARK: naive vs batch ────────────────────────────────────
def benchmark_embedding_approaches():
texts = [f"sample text number {i}" for i in range(100)]
# Naive: one request per text (DO NOT DO THIS)
start = time.time()
naive_embeddings = []
for text in texts:
r = client.embeddings.create(model="text-embeddings-inference", input=text)
naive_embeddings.append(r.data[0].embedding)
naive_time = time.time() - start
# Batch: all texts in one request (let TEI batch internally)
start = time.time()
r = client.embeddings.create(model="text-embeddings-inference", input=texts)
batch_embeddings = [d.embedding for d in r.data]
batch_time = time.time() - start
print(f"Naive (100 × 1 request): {naive_time:.2f}s")
# Output: Naive: ~2.1s (100 HTTP round-trips + 100 separate GPU calls)
print(f"Batch (1 request × 100): {batch_time:.2f}s")
# Output: Batch: ~0.19s (1 HTTP round-trip, GPU batch-processes all 100)
print(f"Speedup: {naive_time / batch_time:.1f}x")
# Output: Speedup: ~11x
The benchmark_embedding_approaches() function demonstrates the practical impact of token-based batching: ~11x speedup from sending 100 texts in one request versus 100 single-text requests, entirely from eliminating per-request GPU launch overhead and HTTP round-trips. This matches the documented "10x+ throughput improvements" from the official benchmarks.
It In Action: End-to-End Worked Example
Scenario: Production RAG pipeline embedding 10,000 documents for a knowledge base
Setup:
# A100 GPU deployment: bge-base-en-v1.5 (768-dim, 109M params, MTEB rank ~52)
docker run --gpus all -p 8080:80 -v $PWD/data:/data \
ghcr.io/huggingface/text-embeddings-inference:1.9 \
--model-id BAAI/bge-base-en-v1.5 \
--max-batch-tokens 32768 \
--dtype bfloat16
# Startup: ~8 seconds (Safetensors loading, no graph compilation)
# GPU memory: ~450 MB (model) + ~800 MB (KV cache budget)
Step 1: Batch embed 10,000 documents
from openai import OpenAI
import time
client = OpenAI(base_url="http://localhost:8080/v1", api_key="-")
# Load documents (average 150 tokens each)
documents = [f"Document {i}: content about various topics..." for i in range(10000)]
batch_size = 64 # 64 × 150 tokens avg = ~9600 tokens per request, fits token budget
all_embeddings = []
start = time.time()
for i in range(0, len(documents), batch_size):
batch = documents[i:i + batch_size]
response = client.embeddings.create(model="text-embeddings-inference", input=batch)
all_embeddings.extend([d.embedding for d in response.data])
elapsed = time.time() - start
print(f"10,000 documents embedded in {elapsed:.1f}s")
# Output: 10,000 documents embedded in 8.3s
# Throughput: ~1,200 documents/second
# Comparison: naive one-by-one → ~90s (10x+ slower)
Step 2: Query embedding + ANN retrieval + rerank
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
query = "What are the main challenges in distributed systems?"
query_embedding_raw = client.embeddings.create(
model="text-embeddings-inference",
input=f"query: {query}" # ← BGE query prefix for retrieval mode
).data[0].embedding
# ANN retrieval: top-50 from embedding similarity
scores = cosine_similarity([query_embedding_raw], all_embeddings)[0]
top50_indices = np.argsort(scores)[-50:][::-1]
top50_docs = [documents[i] for i in top50_indices]
# Rerank top-50 to top-5 with cross-encoder
import requests
rerank_response = requests.post("http://localhost:8080/rerank", json={
"query": query,
"texts": top50_docs,
"raw_scores": False,
})
top5 = rerank_response.json()[:5]
# Performance summary:
# Embedding batch (64 docs): ~50ms
# Query embedding: ~3ms (single short query)
# ANN retrieval from 10K vectors: ~2ms (numpy cosine sim, not FAISS)
# Reranking (50 docs, 1 query): ~120ms (cross-encoder is slower than bi-encoder)
# Total query pipeline: ~125ms end-to-end
MTEB context: BAAI/bge-base-en-v1.5 ranks ~52 on the MTEB leaderboard (arXiv:2210.07316, 56 tasks, 8 categories including retrieval, classification, clustering, semantic similarity). For most production retrieval use cases, models ranked 40-80 provide 90%+ of the quality at 10-20% of the compute cost of top-10 models. TEI serving a MTEB rank 52 model at 1,200 docs/second is production-deployable on a single A100 for RAG pipelines up to ~50M documents.
Why This Design Works, and What It Trades Away
The Rust-native core is the correct architecture for a high-throughput inference server with strict latency requirements. Python's GIL (Global Interpreter Lock) prevents true parallelism in the request handling layer. Python's garbage collector introduces latency spikes. PyTorch's eager execution adds overhead per operation that compounds under batching. By handling the HTTP router, tokenizer, and batch queue in Rust, TEI eliminates all three sources of overhead before a single GPU kernel fires. The inference backends (Flash Attention, cuBLASLt, Candle) are the GPU-level optimization layer; the Rust router is the CPU-level optimization layer. Both are necessary.
The token-based dynamic batching design is the correct answer to the heterogeneous-request problem. Text embedding requests in production are never uniform: document chunking produces 300-400 token inputs, while user queries are 5-30 tokens. Request-count batching (batch 32 requests) produces batches where the sequence length dimension is determined by the longest request, wasting compute on padding for all shorter requests. Token-count batching (batch up to 16384 tokens) fills the token budget with a mix of lengths, maximizing useful work per GPU execution.
The three-backend design (Candle/ORT/Python) is the correct tradeoff between performance and compatibility. A pure-Rust system with Candle would be maximally fast but would exclude models requiring custom Python code. A pure-Python system would support everything but sacrifice the throughput advantages entirely. The backend hierarchy (Candle for GPU, ORT for CPU, Python as fallback) gives the performance fast path for the 90% of use cases and the compatibility fallback for the 10%.
What TEI trades away:
V100 and older NVIDIA hardware is not supported. CUDA compute capability below 7.5 is excluded. Teams with older GPU infrastructure must use the CPU path (which loses Flash Attention and cuBLASLt) or upgrade hardware. The CPU path is functional but significantly slower than the GPU paths.
Fine-tuning is not supported. TEI is inference-only. Teams that want to fine-tune embedding models on domain-specific data must use the standard HuggingFace training stack and then export to TEI for serving. There is no in-process fine-tuning pipeline.
The Python backend sacrifices most of TEI's throughput advantages. Teams using models that require trust_remote_code=True fall back to PyTorch via gRPC subprocess. This maintains correctness but loses the Rust router's latency guarantees and the Candle backend's kernel efficiency.
Technical Moats
Flash Attention integration for encoder-only models. Flash Attention was originally developed for decoder-only LLMs. TEI's integration extends it to encoder-only BERT-class models, which have different attention patterns (bidirectional, no causal mask). Getting Flash Attention to work correctly for bidirectional attention required adapting the kernel's masking logic. The result is significant memory reduction and throughput improvement for long-sequence embedding models (BGE-M3 supports up to 8192 tokens, where Flash Attention's memory savings are most pronounced).
Token-budget-aware client disconnect detection. When a client disconnects while a request is in the batch queue, TEI removes the request from the queue before it executes. This prevents a failure mode where abandoned requests consume GPU compute that could serve other clients. Implementing this correctly in a token-budget batching system requires tracking client connection state through the queue without introducing lock contention in the hot path. The Rust async model (tokio) makes this tractable in ways that Python's async frameworks do not.
No graph compilation step. TensorRT and similar optimizing compilers require a warmup period where the model graph is compiled for a specific batch shape. TEI with Candle avoids this by using dynamic shapes natively. The Docker container is immediately ready to serve requests at full performance without a 30-60 second compilation warmup. For serverless deployment where cold starts matter, this is the difference between a 10-second startup and a 90-second startup.
Insights
Insight One: The MTEB leaderboard (arXiv:2210.07316) has created a perverse optimization target for the embedding model ecosystem. Models are increasingly optimized for benchmark performance at the expense of inference efficiency. The top-10 MTEB models include Qwen3-Embedding-8B (7.57B parameters, MTEB rank 2) and GTE-Qwen2-7B (7.61B parameters, MTEB rank 6). Serving a 7B embedding model in production costs approximately 20x the compute of a 350M model for a 1-5% NDCG improvement on benchmark tasks. For the majority of production RAG pipelines, MTEB rank 40-80 models (300-500M parameters) provide adequate quality at a fraction of the serving cost. TEI makes the cost tradeoff explicit by listing model sizes in the supported models table alongside MTEB rank. Most teams should start at rank 40-80, not rank 1-10.
Insight Two: BGE-M3's multi-representation design (arXiv:2402.03216) exposes a fundamental limitation of any single-vector embedding serving system: dense retrieval misses exact phrase matches that sparse retrieval (BM25, SPLADE) catches, and vice versa. BGE-M3 produces three types of representations from a single forward pass: dense vectors, sparse lexical weights (for BM25-style exact matching), and ColBERT-style multi-vectors (for late interaction). TEI serves the dense representation by default. Teams that want the full BGE-M3 benefit, combining all three retrieval modes for hybrid search, need additional infrastructure beyond a single /embed endpoint. TEI is the right tool for the dense leg; the sparse and multi-vector legs require MILVUS, Vespa, or Qdrant's sparse vector support.
Surprising Takeaway
TEI has supported ARM64 (aarch64) with native CUDA for Blackwell 12.1 hardware (DGX Spark GB10) since v1.9. This is the first production-grade embedding inference server to natively support the ARM64 Grace Blackwell architecture without going through an x86 emulation layer. For teams deploying on DGX Spark clusters (the NVLink-connected 2-node setup we covered in the SnackOnAI DGX Spark session), TEI is the correct serving choice for embedding workloads: the ARM64 native path preserves the architectural advantages of Grace Blackwell's unified memory and avoids the PCIe bandwidth bottleneck that x86 emulation paths introduce. The CUDA 12.1 requirement for Blackwell 12.1 aligns with the MIG profile and vLLM constraints we documented for that platform.
TL;DR For Engineers
TEI (huggingface/text-embeddings-inference, v1.9) is a Rust-native embedding inference server: token-based dynamic batching (default max 16384 tokens/batch), Flash Attention, cuBLASLt, Candle backend. GPU utilization 20% → 95% vs naive Python serving. 10x+ documented throughput improvement.
Three backends auto-selected by hardware: Candle (GPU, Rust-native, fast path), ORT/ONNX Runtime (CPU, Intel MKL), Python (trust_remote_code models, loses throughput advantages). Override with
--backend candle|ort|python.Three endpoints:
/embed(dense vectors),/rerank(cross-encoder scores),/predict(classification). All available with OpenAI-compatible wrapper at/v1/embeddings. Drop-in replacement for OpenAI Embeddings API.Critical deployment parameter:
--max-batch-tokens(default 16384). Set "as large as possible until compute-bound." For instruction models (E5, BGE): use--default-prompt-name queryorpassage, otherwise embeddings degrade silently.Supported models (v1.9): BERT, CamemBERT, XLM-RoBERTa, JinaBERT, NomicBERT, MPNet, ModernBERT, Qwen2, Qwen3, Alibaba GTE, Mistral, Gemma3. NOT supported: CUDA compute < 7.5 (V100, Titan V, GTX 1000).
Production Embedding Serving Is an Infrastructure Problem, Not a Model Problem
TEI's contribution is separating the model problem (which embedding model to use) from the infrastructure problem (how to serve it efficiently). The MTEB leaderboard solves the model problem. TEI solves the infrastructure problem: how to serve any model from that leaderboard at production throughput with minimal latency, on any hardware from a consumer laptop CPU to a Blackwell H100 cluster.
The token-based batching design is the insight worth internalizing. It applies beyond embedding models: any inference server serving variable-length inputs should batch by token count, not request count. The GPU does not care how many requests are in a batch. It cares how many tokens it is processing. TEI made that match explicit in the serving architecture.
References
Text Embeddings Inference GitHub, Hugging Face, Apache 2.0
MTEB: Massive Text Embedding Benchmark, arXiv:2210.07316, Muennighoff et al. — the benchmark that contextualizes which models TEI supports
BGE-M3: Multi-Function Multi-Lingual Multi-Granularity Text Embeddings, arXiv:2402.03216, Chen et al. — multi-representation embedding that exposes dense-only serving limitations
Flash Attention: Fast and Memory-Efficient Attention, Dao et al. — the attention kernel TEI uses for GPU efficiency
DeepWiki: TEI Architecture Analysis — detailed codebase walkthrough
Text Embeddings Inference (TEI, huggingface/text-embeddings-inference, v1.9) is a Rust-native HTTP/gRPC inference server for embedding models, re-rankers, and sequence classifiers featuring token-based dynamic batching (default 16384 max tokens/batch, documented GPU utilization improvement from 20% to 95%), three hardware-specific backends (Candle with Flash Attention/cuBLASLt for GPU, ONNX Runtime with Intel MKL for CPU, Python PyTorch for custom models), OpenAI-compatible API, and hardware support spanning CPU (x86/ARM64), Turing, Ampere, Ada Lovelace, Hopper, and Blackwell (including DGX Spark ARM64). Supports 12+ embedding model families from BERT to Qwen3 and Gemma3, excludes CUDA compute < 7.5 (V100 and earlier). The MTEB benchmark (arXiv:2210.07316, 56 tasks, 58 languages) provides the model quality context; BGE-M3 (arXiv:2402.03216) demonstrates the multi-representation frontier that single-vector embedding servers like TEI serve only partially (dense representation only, not sparse or multi-vector).
Sponsored Ad
If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀
Your prompts are leaving out 80% of what you're thinking.
When you type a prompt, you summarize. When you speak one, you explain. Wispr Flow captures your full reasoning — constraints, edge cases, examples, tone — and turns it into clean, structured text you paste into ChatGPT, Claude, or any AI tool. The difference shows up immediately. More context in, fewer follow-ups out.
89% of messages sent with zero edits. Used by teams at OpenAI, Vercel, and Clay. Try Wispr Flow free — works on Mac, Windows, and iPhone.


