SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | April 19, 2026
You don't need a data center to run a trillion-parameter model. You need four Mac Studios, a Thunderbolt cable, and exo.
This is not a toy demo. Jeff Geerling's December 2025 benchmark of a 4-node M3 Ultra Mac Studio cluster running exo over RDMA delivered 31.9 tokens per second on Qwen3-235B and 32.5 tokens per second on DeepSeek V3.1-671B. The 1-trillion parameter Kimi K2 Thinking model, which cannot run on a single node at all, reached 28.3 tokens per second across four nodes. Total system cost: just under $40,000. Total system power draw: under 250 watts per node. For comparison, 26 NVIDIA H100s would be required to match the memory capacity, at $780,000+ before networking and datacenter infrastructure.
Exo (exo) is a distributed LLM inference engine that turns any collection of consumer devices into a unified AI cluster. No master-worker hierarchy. No manual configuration. No shared VRAM requirement. Just peer-to-peer layer splitting, topology-aware placement, and an OpenAI-compatible HTTP API on every node.
This newsletter dissects exactly how it works, where the design decisions live, and where the real tradeoffs bite.
What It Actually Does
exo by exo labs (43.2k GitHub stars, Apache-2.0) solves one hard problem: LLM inference across heterogeneous consumer hardware with no manual configuration.
The canonical use case is a model too large to fit on any single device. A Llama 3.1-70B in 8-bit requires roughly 70GB of memory. A single M3 Ultra with 192GB handles it easily. A MacBook Pro with 16GB does not. But two MacBook Pros connected over Thunderbolt, each holding half the model's layers, can handle it together. exo manages this split automatically, streaming activations between devices, and exposing a single unified API to the client.
What exo is not: a high-throughput production inference system optimized for server-grade batch workloads. vLLM with PagedAttention dominates there. exo is optimized for the scenario where you have multiple consumer devices and want to run models that exceed any one device's memory, with minimal setup.
Supported backends as of April 2026: MLX for Apple Silicon (primary), TinyGrad (secondary for heterogeneous setups). Linux GPU support is under active development, currently CPU-only.
The Architecture, Unpacked
exo uses a flat peer-to-peer topology. No node is designated master at setup time. Leader election happens dynamically after peer discovery. Every node runs an identical daemon that can function as coordinator, worker, or both.

Caption: Focus on the placement engine. It builds a topology graph from discovered peers, finds viable closed cycles, filters by per-node memory availability, and selects the optimal placement. The key insight: activation tensors transferred between layers are tiny (~4KB for Llama 3.2-3B), so network latency, not bandwidth, is the bottleneck.
Peer Discovery
Every exo node broadcasts UDP discovery packets every 2.5 seconds. Nodes that hear each other exchange capability profiles: device ID, GPU memory, CPU cores, RAM, and active interfaces. The result is a live topology graph. No DNS, no central registry, no Kubernetes. Just UDP multicast and a continuously updated neighbor table.
Cluster isolation uses the EXO_LIBP2P_NAMESPACE environment variable. Nodes only join clusters with matching namespaces, enabling multiple independent clusters on the same physical network.
exo supports two distinct sharding modes with fundamentally different tradeoffs:
Pipeline Parallelism: Contiguous layer ranges are assigned to each node. Node A runs layers 0-39, Node B runs layers 40-79. Activations flow sequentially from A to B. Single-request latency is slightly worse than running on one node (network overhead added). Multi-request throughput scales nearly linearly with node count. With 3 nodes, exo measured 108.8 TPS versus 49.3 TPS on a single device, 2.2x throughput for 3x nodes.
Tensor Parallelism: All nodes run all layers, but weight matrices are horizontally partitioned. For a model with hidden_size=4096 and world_size=4, each node holds a 1024-column slice of every weight matrix. Requires world_size AllReduce operations per layer, but single-request throughput improves. exo benchmarks show 1.8x speedup on 2 devices and 3.2x speedup on 4 devices via tensor parallel, matching Megatron-LM's theoretical predictions for small world sizes.
The Placement Engine
The placement engine is the most novel component. Given a topology graph (nodes as vertices, communication links as weighted edges), it:
Enumerates viable cycles (closed paths visiting all participating nodes)
Filters cycles by memory: each node must hold its assigned shard in GPU memory
Selects the cycle that minimizes maximum link latency (prefers Thunderbolt over Ethernet over WiFi)
Generates backend-specific configs: host lists for Ring (TCP/IP), RDMA interface matrices for Jaccl
The placement decision is deterministic and reproducible given the same inputs, which is critical for debugging distributed inference failures.
Communication Backends
Ring (TCP/IP): Default fallback. Each node connects to left and right neighbors in a logical ring. Suitable for WiFi or Ethernet connected devices. Self-position binds to 0.0.0.0 (listen on all interfaces). Non-neighbors get RFC 5737 TEST-NET-2 placeholder addresses.
Jaccl (RDMA via Thunderbolt 5): The high-performance path. Requires macOS 26.2 or later with RDMA enabled in recovery mode. Uses a N×N RDMA interface matrix: matrix[i][j] specifies the RDMA interface on device i that connects to device j. Diagonal is always None. Latency drops from ~300 microseconds (TCP) to 5-50 microseconds (RDMA). This is the path that makes distributed inference feel like a single machine.
The Code, Annotated
Snippet One: API flow and shard routing
# From src/exo/master/api.py (simplified for clarity)
async def handle_post_chat_completions(request: web.Request):
data = await request.json()
chat_request = ChatCompletionRequest(**data)
# 1. Resolve tokenizer from model registry
# ← The model_id maps to a HuggingFace repo; tokenizer downloaded on first use
tokenizer = await resolve_tokenizer(chat_request.model)
# 2. Apply chat template to raw messages
# ← This is where system prompts, user turns, and assistant turns get formatted
prompt = tokenizer.apply_chat_template(
chat_request.messages,
tokenize=False,
add_generation_prompt=True
)
# 3. Generate a unique request ID for tracking across nodes
request_id = str(uuid.uuid4())
# 4. Route to the inference orchestrator
# ← This is the key dispatch: if this node owns the first shard, run locally
# ← If not, forward via gRPC to the node that does
async for token in node.process_prompt(
shard=model_shard, # which layers this model instance covers
prompt=prompt,
request_id=request_id,
inference_state=None
):
# Stream tokens back as SSE (Server-Sent Events)
# ← Each token arrives as it's sampled, not all at once
yield format_chat_completion_chunk(token, chat_request.model)
Caption: The routing decision in process_prompt is the architectural heart: local execution if this node owns the first shard, gRPC forwarding otherwise. This flat dispatch means any node can receive the API request regardless of which node holds the first layer.
Snippet Two: Placement engine, topology to shard assignment
# From src/exo/master/placement.py (simplified)
def place_instance(
model_id: str,
topology: DeviceTopology, # live graph of discovered peers + their resources
strategy: str = "pipeline" # "pipeline" | "tensor"
) -> Instance:
# Step 1: Find all closed cycles in the topology graph
# ← A "cycle" is any closed path through a subset of nodes
# ← exo prefers cycles that traverse high-bandwidth links (Thunderbolt > Ethernet > WiFi)
cycles = find_viable_cycles(topology)
# Step 2: Filter by memory constraint
# ← Each node in the cycle must have enough free GPU RAM for its shard
# ← Pipeline: each node holds (total_layers / world_size) layers
# ← Tensor: each node holds (hidden_size / world_size) columns of every layer
viable = [c for c in cycles if memory_fits(c, model_id, strategy)]
# Step 3: Select optimal cycle
# ← Prefer cycles where the bottleneck link has lowest latency
# ← "Leaf node preference": prefer cycles where the most memory-constrained
# node is at the end of the pipeline (minimizes stall waiting)
optimal = select_cycle(viable, topology) # ← THIS is the topology-aware scheduler
# Step 4: Generate shard assignments
shards = assign_shards(optimal, model_id, strategy)
# Step 5: Generate backend config
if strategy == "tensor" and all_rdma_capable(optimal, topology):
# ← Build N×N RDMA matrix: matrix[i][j] = interface name on node i → node j
backend_config = build_jaccl_config(optimal, topology)
else:
# ← Build neighbor list for ring TCP communication
backend_config = build_ring_config(optimal, topology)
return Instance(shards=shards, backend=backend_config)
Caption: The cycle-finding step is where exo's topology-awareness lives. Rather than assuming a flat mesh, it models the actual communication graph and finds cycles that minimize the worst-case inter-node link. This is why Thunderbolt links get preferred over WiFi when both are available.
It In Action: End-to-End Worked Example
Setup: Two devices, one MacBook Pro (M3 Pro, 36GB), one Mac Studio (M3 Ultra, 192GB). Connected via Thunderbolt 5. Running DeepSeek-R1-Distill-Qwen-32B (8-bit, ~34GB).
Step 1: Node startup and discovery
# On Mac Studio (node A):
uv run exo
# Output:
# [exo] Starting exo v1.0.69
# [exo] Listening on 0.0.0.0:52415
# [exo] Broadcasting discovery on UDP 52415
# [exo] No peers discovered yet, running standalone
# On MacBook Pro (node B), 8 seconds later:
uv run exo
# Output:
# [exo] Starting exo v1.0.69
# [exo] Peer discovered: mac-studio-a (M3 Ultra, 192GB, TB5)
# [exo] Elected coordinator: mac-studio-a
# [exo] Cluster ready: 2 nodes, 228GB combined memory
Step 2: Preview placement for the model
curl "http://localhost:52415/instance/previews?model_id=deepseek-r1-distill-qwen-32b"
Response (simplified):
{
"previews": [
{
"model_id": "mlx-community/DeepSeek-R1-Distill-Qwen-32B-8bit",
"sharding": "Pipeline",
"instance_meta": "MlxRing",
"memory_delta_by_node": {
"mac-studio-a": 22548480000,
"macbook-pro-b": 12482048000
}
}
]
}
The placement engine assigned 64 layers to the Mac Studio (layers 0-63, ~21GB) and 36 layers to the MacBook Pro (layers 64-99, ~12GB). Memory allocation is proportional to available RAM, not equal.
Step 3: Create instance and send request
# Create instance
curl -X POST http://localhost:52415/instance \
-H 'Content-Type: application/json' \
-d '{"instance": {...placement from step 2...}}'
# Query the model
curl -N -X POST http://localhost:52415/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "mlx-community/DeepSeek-R1-Distill-Qwen-32B-8bit",
"messages": [{"role": "user", "content": "Explain quantum entanglement in 50 words."}],
"stream": true
}'
Step 4: Token flow through the pipeline
Input prompt: "Explain quantum entanglement in 50 words."
Tokens: [128000, 849, 23648, 31874, 76861, 27, 220, 1135, 4339, 13]
(10 tokens via LLaMA tokenizer)
Node A (Mac Studio) execution:
- Embedding lookup: 10 tokens → hidden states (10 × 4096 float16)
- Forward pass layers 0-63: ~340ms (MLX on M3 Ultra GPU)
- Output activation: shape (10, 4096) = 81,920 values × 2 bytes = 160KB
- Thunderbolt 5 transfer to Node B: <1ms (RDMA) or ~15ms (TCP fallback)
Node B (MacBook Pro) execution:
- Receive activation: (10, 4096) hidden states
- Forward pass layers 64-99: ~190ms (MLX on M3 Pro GPU)
- Final layer norm + LM head: logits shape (10, 65536 vocab)
- Sampling (temperature=0.6): next token = 51272 ("Quantum")
- Broadcast sampled token to Node A
- Repeat for next token
Total first-token latency: ~545ms (dominated by forward pass compute)
Subsequent tokens: ~110ms each (~9 tokens/sec)
Single-node reference (Mac Studio only, model fits): ~14 tokens/sec
Two-node cluster via TCP: ~7.5 tokens/sec (network overhead penalty on single requests)
Two-node cluster via RDMA: ~11 tokens/sec (approaching single-node, worth it for larger models)
Key observation: For this model size (32B), a single Mac Studio is faster on single requests. The two-node setup only wins when the model doesn't fit on one device, or when handling multiple concurrent requests where pipeline parallelism utilizes both nodes simultaneously.
Why This Design Works (and What It Trades Away)
The flat P2P architecture eliminates the single point of failure of a central coordinator. Any node can receive API requests. Any node can be elected leader. The system degrades gracefully if a non-leader node leaves: the remaining cluster re-runs placement and adjusts shards. If the leader leaves, a new leader is elected.
The topology-aware placement is the genuine systems contribution. Most distributed inference frameworks assume homogeneous hardware and uniform network connections. exo models the actual graph: Thunderbolt 5 links (~80 Gb/s, ~5-50µs RDMA latency), 10GbE (~10 Gb/s, ~100µs), WiFi (~1 Gb/s, ~1-5ms). The placement engine finds the highest-bandwidth closed cycle and assigns larger shards to nodes with faster inter-node links. This matters enormously in heterogeneous setups where the bottleneck shifts to the weakest link.
The activation transfer insight is underappreciated: for Llama-3.2-3B, activations between pipeline stages are less than 4KB per token. Network bandwidth is almost never the bottleneck. Latency is. This is why RDMA over Thunderbolt (5-50µs) unlocks scaling that TCP (300µs) cannot achieve, even though both have ample bandwidth for 4KB packets.
What exo trades away:
Single-request latency on distributed setups. Pipeline parallelism is inherently sequential. When a request flows through Node A then Node B, each node waits for the previous to finish. For single-user interactive inference where the model fits on one device, running locally is faster. The exo team measured 49.3 TPS single-node versus 44.4 TPS on 2 nodes and 39.7 TPS on 3 nodes for Llama-3.2-3B on M4 Pro hardware.
Production readiness. Linux GPU support is under development. The RDMA stack requires macOS 26.2 or later. There is no equivalent to vLLM's PagedAttention for KV cache management, no speculative decoding, no continuous batching. exo is production-ready for Mac-based workloads at moderate concurrency, not enterprise multi-tenant serving.
Fault tolerance. If a node holding a shard disconnects mid-inference, the request fails. There is no redundant shard execution or checkpoint/restart for inference sessions. This is a research and development tool, not a five-nines infrastructure component.
Technical Moats
What makes exo hard to replicate:
The RDMA over Thunderbolt integration is the sharpest moat. Apple's RDMA support in macOS 26.2 is new (December 2025), documented only in TN3205. Integrating it into a distributed inference engine requires kernel-level interface management, RDMA interface matrix construction, and careful coordination of memory-mapped buffers across device boundaries. exo was the first open-source inference engine to ship day-0 RDMA support, before any competing tool had RDMA integration.
The topology-aware placement engine requires modeling heterogeneous network graphs with real latency measurements, not estimated values. The cycle-finding algorithm must be efficient enough to run in real time as the topology changes (nodes join or leave). Most academic distributed systems papers assume static topologies. exo handles dynamic membership.
The MLX integration is deep. MLX's distributed communication primitives (mlx.distributed.all_reduce, mlx.distributed.all_gather) are used for tensor parallelism AllReduce operations. This requires tight coupling between exo's sharding logic and MLX's internal graph execution model. Building the same depth of integration with CUDA would require significant CUDA-aware MPI work, which is why the Linux CUDA backend is still under development.
Insights
Insight One: Adding more devices to exo makes single-request performance worse, and the documentation buries this critical fact.
The benchmark table in exo's own blog post shows it clearly: single-request TPS on Llama-3.2-3B drops from 49.3 TPS on one M4 Pro to 39.7 TPS on three M4 Pros. Every added node adds a network round-trip latency to the autoregressive decode loop. For single-user interactive inference where the model fits on one device, the correct configuration is one device. The multi-device setup only wins in two scenarios: the model is too large for any single device, or you are processing multiple concurrent requests where pipeline parallelism utilizes different stages simultaneously. The community's enthusiasm for "more devices = faster AI" is directly contradicted by the project's own benchmarks.
Insight Two: The exo architecture's real bottleneck is not network bandwidth. It is the autoregressive decode loop, and no amount of hardware upgrades fixes this without algorithm changes.
exo ships real-time RDMA over Thunderbolt 5 with 80 Gb/s bidirectional bandwidth. The activation tensors transferred between pipeline stages are 4KB for a 3B model, scaling to roughly 64KB for a 70B model. Even at 1 Gb/s WiFi, 64KB takes 0.5ms. At 80 Gb/s Thunderbolt, it takes 6µs. The difference is real but not decisive. What actually limits tokens per second is the sequential nature of autoregressive decoding: every token requires one complete forward pass through all layers, in order, one layer at a time. No parallelism strategy eliminates this serial dependency. Tensor parallelism reduces per-layer compute time by distributing weight matrix multiplications, which is why it improves single-request latency (1.8x on 2 devices, 3.2x on 4 devices) in ways that pipeline parallelism cannot. The community focus on "better networking" as the path to faster local inference is partly misdirected. The real path is speculative decoding with a small draft model, which exo does not yet implement.
Takeaway
The activation tensor transferred between pipeline stages for Llama-3.2-3B is smaller than a typical JPEG thumbnail: under 4KB.
This is the number that breaks every intuition about distributed inference. Engineers assume that splitting a neural network across machines requires moving huge amounts of data between nodes. The reality: a transformer's intermediate activation tensor at the layer boundary has shape (sequence_length, hidden_size). For Llama-3.2-3B with sequence length 8 and hidden size 3072, that is 8 × 3072 × 2 bytes (float16) = 49,152 bytes, roughly 48KB for an 8-token sequence. For single-token autoregressive decode, it is 1 × 3072 × 2 = 6,144 bytes. 6KB. This is why exo can work over WiFi at all. The network latency of sending 6KB is negligible compared to the GPU compute time of running 40 transformer layers. The bottleneck is always the compute, never the transfer, for any reasonable inter-device connection faster than 10 Mb/s.
TL;DR For Engineers
exo is a flat P2P distributed inference engine that splits transformer layers across consumer devices, using topology-aware placement to optimize for actual network graph topology, not assumed uniform connectivity
The activation tensors transferred between pipeline stages are tiny, 6KB per token for a 3B model, which is why network bandwidth rarely matters and latency is the only metric that does
Single-request performance degrades with more nodes in pipeline parallel mode (49.3 TPS on 1 node vs 39.7 on 3 nodes for Llama-3.2-3B). Multi-request throughput scales to 2.2x on 3 nodes. Use exo for models that exceed single-device memory, or for batch/parallel workloads
Tensor parallelism (all nodes run all layers, weights partitioned) achieves 1.8x speedup on 2 devices and 3.2x on 4 devices for single-request latency, at the cost of AllReduce communication overhead per layer
RDMA over Thunderbolt 5 (macOS 26.2) drops inter-node latency from 300µs to 5-50µs, enabling a 4-node Mac Studio cluster to run Qwen3-235B at 31.9 tok/s vs 15.2 tok/s for llama.cpp (TCP) on the same hardware
The Cluster Is Already in Your Rack
exo does not solve the hard problem of production-grade distributed LLM serving. vLLM solves that. What exo solves is the adjacent problem that nobody else had properly addressed: taking the devices a small team or research lab already owns, wiring them together with whatever cables are at hand, and making a model run that would be physically impossible on any single device. The 1-trillion-parameter Kimi K2 Thinking model running at 28 tokens per second on four Mac Studios connected with Thunderbolt cables is not a datacenter achievement. It is a dorm-room achievement scaled by a few orders of magnitude and a kernel-level RDMA driver. The engineering discipline is real. The tradeoffs are honest. And the fact that a single terminal command causes devices to discover each other, elect a leader, partition a model, and begin serving an OpenAI-compatible API with zero manual configuration is, by any reasonable measure, the kind of systems software that does not get built often.
References
exo GitHub Repository — 43.2k stars, Apache-2.0, v1.0.69 latest
exo Labs Blog: Transparent Benchmarks (Day 1) — pipeline parallelism benchmarks, M4 Pro cluster results
Jeff Geerling: 1.5TB VRAM on Mac Studio, RDMA over Thunderbolt 5 — real-world 4-node cluster benchmarks
Apple TN3205: RDMA over Thunderbolt — macOS 26.2 RDMA documentation
exo Placement Engine, DeepWiki — placement algorithm and tensor/pipeline shard types
GPipe: Efficient Training via Pipeline Parallelism, arXiv 1811.06965 — foundational pipeline parallelism paper
Megatron-LM: Tensor Parallelism, arXiv 1909.08053 — tensor parallel AllReduce communication model
Orca: Distributed Transformer Serving, arXiv 2308.16369 — production distributed inference for comparison
PagedAttention (vLLM), arXiv 2309.06180 — KV cache management absent from exo
Collaborative Edge Inference for LLMs, arXiv 2405.14371 — academic foundation for edge inference
Alpa: Inter and Intra-Operator Parallelism, arXiv 2201.12023 — automated distributed parallelism strategy search
Neurosurgeon: Cloud and Mobile Edge Co-Inference, arXiv 1703.05460 — model partitioning across heterogeneous devices
exo is a flat peer-to-peer distributed inference engine that partitions transformer model layers across consumer devices using topology-aware placement, automatic peer discovery via UDP broadcast, and RDMA over Thunderbolt 5 for sub-50µs inter-node latency on macOS 26.2. A 4-node M3 Ultra Mac Studio cluster running exo with RDMA achieves 31.9 tok/s on Qwen3-235B and 32.5 tok/s on DeepSeek V3.1-671B versus 15.2 and 14.6 tok/s for llama.cpp (TCP) on the same hardware. The key engineering insight is that activation tensors at pipeline stage boundaries are under 4KB per token, making network latency, not bandwidth, the binding constraint, and explaining why RDMA's 6x latency reduction (300µs to 50µs) unlocks scaling that higher-bandwidth TCP connections cannot. Critical tradeoffs: single-request performance degrades with more pipeline nodes; Linux GPU support is still CPU-only; no speculative decoding; no PagedAttention-style KV cache management.
Sponsored Ad
If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀
Accio Work: Your Business, On Autopilot
Meet Accio Work, the agentic workspace designed to run your business operations end to end. From sourcing products and negotiating with suppliers to managing your store and launching marketing campaigns, Accio Work handles the execution so you don’t have to.
Powered by verified capabilities and deep integrations with business tools, it doesn’t just generate ideas, it takes action. Backed by Alibaba.com’s global supplier network and over 1B products, it seamlessly connects strategy to execution.
Stay in control while everything runs on autopilot.


