SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | April 26, 2026
Every developer who has tried to run a local LLM from scratch has hit the same wall: GGUF files to find, quantization formats to decode, GPU offloading flags to tune, chat templates to configure manually, and a llama.cpp binary that requires cmake to compile with CUDA support. The first three hours are not inference. They are archaeology. Ollama exists to eliminate that wall entirely. One command installs it. One command pulls a model. One command starts a local API server that your existing OpenAI-compatible code can call with a URL change.
This newsletter dissects Ollama not as a beginner tool but as an engineered system: what the Go server layer actually does, how the GGUF-based layer offloading scheduler works, what the OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS environment variables expose about its concurrency model, and where the benchmarks from arXiv:2511.05502 and arXiv:2511.07425 show it trading throughput for ergonomics.
Scope: Ollama's architecture, GGUF format, quantization selection (informed by arXiv:2601.14277), performance on Apple Silicon and single-board computers, and its real tradeoffs vs. llama.cpp directly. Not covered: fine-tuning, vLLM, or multimodal pipelines beyond Ollama's built-in vision support.
What It Actually Does
Ollama is a local LLM runtime with 162,000 GitHub stars and 14,600 forks. Written in Go, it wraps llama.cpp's inference engine behind a model management layer modeled explicitly on Docker's workflow: ollama pull, ollama run, ollama ps, ollama rm. Every model is a versioned artifact with a content-addressed blob store. The server exposes an OpenAI-compatible REST API on port 11434 by default.
What Ollama is not: it is not a high-throughput serving engine. Under its Go HTTP layer, every inference request passes to a llama.cpp subprocess (or the newer Ollama engine subprocess when OLLAMA_NEW_ENGINE=1 is set). That subprocess does the actual tensor computation via GGML on whatever hardware is available: CUDA, Metal, Vulkan, ROCm, or CPU. The Go layer handles scheduling, model lifecycle, GPU memory estimation, and the OpenAI API translation.
A 2025 benchmark on Apple Silicon (arXiv:2511.05502) across MLX, MLC-LLM, Ollama, llama.cpp, and PyTorch MPS using Qwen-2.5 across prompts from a few hundred to 100K tokens found:
MLX achieves the highest sustained generation throughput
MLC-LLM delivers the lowest time-to-first-token for moderate prompt sizes
llama.cpp is the most efficient for single-stream use
Ollama "emphasizes developer ergonomics but lags in throughput and TTFT"
This is the honest benchmark. Ollama is not the fastest. It is the most deployable. These are different optimizations, and the community regularly conflates them.
The Architecture, Unpacked

Focus on the scheduler layer. This is where Ollama's ergonomics are implemented: automatic GPU memory estimation, layer distribution across CPU and GPU, and model keep-alive. Everything below the scheduler is llama.cpp. Everything above it is Docker-style UX for LLMs.
The key architectural choice is the subprocess model. Ollama's Go server launches the runner as a separate process and communicates with it via HTTP on a local port. This provides isolation: a segfaulting model inference does not crash the management server. It also adds latency: every inference request passes through a Go HTTP handler, serialization, subprocess HTTP call, deserialization, and response path before a token appears.
The Code
Snippet One: CLI to REST API to Modelfile (complete workflow)
# Install: single binary, no Python, no conda, no cmake
# macOS: brew install ollama
# Linux: curl -fsSL https://ollama.com/install.sh | sh
# Windows: download installer from ollama.com
# ← No GPU driver manual setup required. Ollama auto-detects CUDA, ROCm, Metal.
# Pull a model: Docker-style, versioned, content-addressed
# ← Behind the scenes: Ollama resolves the manifest from ollama.com registry,
# downloads GGUF blob shards to ~/.ollama/models/blobs/ (SHA256-addressed),
# writes manifest to ~/.ollama/models/manifests/registry.ollama.ai/library/llama3/
ollama pull llama3.2
# Run interactively (foreground inference, streaming output)
ollama run llama3.2
# ← Checks if model is loaded: if yes, reuses (OLLAMA_KEEP_ALIVE=5m by default)
# ← If not loaded: estimates VRAM, distributes layers, loads into memory
# Run as API server (background, keeps running)
ollama serve
# ← Server at http://127.0.0.1:11434 — REST API for all models
# Query via OpenAI-compatible API (zero code change from OpenAI SDK usage)
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "Explain GGUF quantization in one paragraph."}],
"stream": true
}'
# ← OpenAI-compatible: change base_url in your existing openai Python client, done.
# ← Returns SSE stream of tokens identical to OpenAI's streaming response format
# Inspect what is loaded and on which hardware
ollama ps
# Output:
# NAME ID SIZE PROCESSOR UNTIL
# llama3.2:latest a80c4f17acd5 4.0 GB 100% GPU 4 minutes from now
# ← "100% GPU" means all transformer layers fit in VRAM
# ← "X% GPU / Y% CPU" means split: X% of layers on GPU, rest on CPU
# Check GPU utilization to verify actual GPU usage
nvidia-smi # or: watch -n 1 nvidia-smi
# ← GPU passthrough silently falls back to CPU if toolkit is missing
# ← Always verify with ollama ps before trusting GPU acceleration
The Docker analogy is not decorative. Content-addressed blob storage, manifests, versioned tags, pull/run/ps/rm commands: this is a deliberate UX choice that makes LLM management familiar to any developer who has used Docker.
Snippet Two: Modelfile and Environment Configuration (production tuning)
# Modelfile: the Dockerfile equivalent for Ollama models
# Defines the model, parameters, system prompt, and chat template
FROM llama3.2
# ← num_ctx: KV cache size in tokens. Default is model-defined (often 2048-4096).
# Larger context = more VRAM used for KV cache. 8192 uses ~1GB extra on a 7B model.
PARAMETER num_ctx 8192
# ← temperature: 0.0 = deterministic, 1.0 = default, >1.0 = more random
PARAMETER temperature 0.7
# ← num_predict: max tokens to generate per response. -1 = unlimited.
PARAMETER num_predict 512
# ← num_gpu: number of MODEL LAYERS to run on GPU (NOT number of GPUs)
# Llama-3.2-3B has 28 layers. num_gpu 28 = all on GPU. num_gpu 20 = 20 on GPU, 8 on CPU.
# ← THIS is the most important performance parameter for VRAM-constrained systems
PARAMETER num_gpu 28
# System prompt embedded in every request
SYSTEM """
You are a helpful coding assistant. Be concise and precise.
"""
# Build and run the custom model
# ollama create coding-assistant -f ./Modelfile
# ollama run coding-assistant
# Key environment variables for production Ollama deployments:
# ← OLLAMA_NUM_PARALLEL: concurrent inference requests per model instance
# Default is 1 (sequential). Set higher for multi-user serving.
# Warning from arXiv:2511.05502: higher parallelism reduces per-request throughput
# on Apple Silicon due to memory bandwidth saturation.
export OLLAMA_NUM_PARALLEL=4
# ← OLLAMA_MAX_LOADED_MODELS: how many models stay in memory simultaneously
# Default is 1 on GPU (to avoid VRAM fragmentation), 3 on CPU
# Each loaded model consumes its full VRAM footprint while hot.
export OLLAMA_MAX_LOADED_MODELS=2
# ← OLLAMA_KEEP_ALIVE: how long an idle model stays loaded before eviction
# Default: 5m. Set to "0" to unload immediately after each request.
# Set to "-1" to never unload (maximizes reuse, consumes VRAM permanently).
export OLLAMA_KEEP_ALIVE=10m
# ← OLLAMA_FLASH_ATTENTION: enable FlashAttention for longer context windows
# Disabled by default. Enable for contexts > 4K tokens for memory savings.
export OLLAMA_FLASH_ATTENTION=1
# ← OLLAMA_NEW_ENGINE: use Ollama's Go-native runner instead of llama.cpp subprocess
# Newer architecture, better overlap of batch prep and GPU execution.
# Use for supported model architectures; falls back to llama.cpp otherwise.
export OLLAMA_NEW_ENGINE=1
# Docker deployment (GPU access requires nvidia-container-toolkit)
docker run -d \
--gpus all \ # ← REQUIRED for GPU passthrough. Silently falls back to CPU if missing.
-v ollama:/root/.ollama \ # ← persist models across container restarts
-p 11434:11434 \
--name ollama \
ollama/ollama
OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS are the two variables that determine Ollama's behavior in multi-user scenarios. Both default to conservative values. Both need explicit tuning for anything resembling production multi-user load.
It In Action: End-to-End Worked Example
Scenario: Deploy Llama-3.2-3B locally on three hardware profiles, and reproduce the quantization tradeoff data from arXiv:2601.14277.
Input: Qwen-2.5-7B-Instruct, deployment targets: RTX 4090 (24GB VRAM), M2 MacBook Pro (16GB unified), Raspberry Pi 5 (8GB RAM).
Step 1: Check available hardware and choose quantization
# On GPU machine: check VRAM available
nvidia-smi --query-gpu=memory.free,memory.total --format=csv,noheader
# Example: 22050 MiB free of 24576 MiB total
# Quantization decision table (from arXiv:2601.14277, Llama-3.1-8B-Instruct):
# Format Size Perplexity MMLU Memory Throughput (CPU)
# Q8_0 8.5GB 6.14 0.683 8.5GB ~15 tok/s
# Q6_K 6.6GB 6.18 0.681 6.6GB ~18 tok/s
# Q5_K_M 5.7GB 6.20 0.680 5.7GB ~20 tok/s ← sweet spot
# Q4_K_M 4.7GB 6.26 0.676 4.7GB ~24 tok/s ← default Ollama choice
# Q3_K_M 3.9GB 6.54 0.663 3.9GB ~28 tok/s
# Q2_K 2.9GB 7.35 0.614 2.9GB ~35 tok/s ← noticeable quality drop
# ← THE KEY FINDING from arXiv:2601.14277:
# Q4_K_M → Q5_K_M: +1GB memory, <0.5% perplexity improvement. Usually worth it.
# Q5_K_M → Q6_K: +0.9GB memory, <0.3% perplexity improvement. Marginal.
# Q4_K_M is the correct default for VRAM-constrained deployment.
# Q3_K_M and below: noticeable quality degradation. Avoid unless hardware forces it.
Step 2: Pull and run on each hardware profile
# RTX 4090 (24GB VRAM): run Q8_0 — full fidelity fits comfortably
ollama pull qwen2.5:7b-instruct-q8_0
ollama run qwen2.5:7b-instruct-q8_0
# Result: all 28 layers on GPU, ~50-65 tok/s decode, 8.5GB VRAM used
# TTFT (time to first token): ~120ms for 512-token prompt
# M2 MacBook Pro (16GB unified): Q4_K_M fits with room for KV cache
ollama pull qwen2.5:7b-instruct-q4_K_M
ollama run qwen2.5:7b-instruct-q4_K_M
# Result (Apple Silicon, from arXiv:2511.05502 benchmarks):
# Ollama: ~15-25 tok/s decode on M2 Pro
# vs. MLX: ~35-45 tok/s on same hardware (MLX wins on throughput)
# vs. llama.cpp direct: ~20-30 tok/s
# TTFT with Ollama on 512-token prompt: ~400-600ms (worse than MLC-LLM's ~200ms)
# Raspberry Pi 5 (8GB RAM, CPU-only): Q3_K_M or Q4_K_M, CPU inference only
# Results from arXiv:2511.07425 (SBC eval, Ollama vs Llamafile):
# Ollama on Raspberry Pi 5: ~1.5-2.5 tok/s on 1.5B models
# Llamafile on same hardware: up to 4x higher throughput, 30-40% lower power
# ← IMPORTANT: On single-board computers (SBCs), Llamafile significantly outperforms Ollama.
# Ollama is not the right tool for Raspberry Pi deployment.
Step 3: Verify GPU utilization and layer distribution
ollama ps
# NAME SIZE PROCESSOR UNTIL
# qwen2.5:7b-instruct-q4_K_M 4.7 GB 100% GPU 9 minutes from now
# If VRAM is insufficient, Ollama splits layers automatically:
# qwen2.5:7b-instruct-q8_0 8.5 GB 52% GPU / 48% CPU 9 minutes from now
# This means 14 of 28 layers run on GPU, 14 on CPU.
# Mixed-device inference is slower than pure GPU but faster than pure CPU.
# Token generation requires full layer traversal — every CPU layer adds latency.
Step 4: Python client integration
from openai import OpenAI
# ← Zero code change from OpenAI SDK usage except base_url and api_key
# THIS is why Ollama's OpenAI-compatible API is its single most important feature
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # ← Ollama ignores this but the SDK requires it
)
response = client.chat.completions.create(
model="qwen2.5:7b-instruct-q4_K_M",
messages=[
{"role": "system", "content": "You are a coding assistant."},
{"role": "user", "content": "Write a Python function to compute rolling averages."}
],
stream=True,
temperature=0.7,
)
for chunk in response:
print(chunk.choices[0].delta.content or "", end="", flush=True)
# Output: token-by-token streaming, identical behavior to OpenAI SDK against cloud API
# Measured decode latency (RTX 4090, Q4_K_M, qwen2.5-7b): ~50-65 tok/s
Why This Design Works, and What It Trades Away
The Go server plus subprocess architecture is the correct design for Ollama's actual use case: a single-developer or small-team local inference server where ergonomics and model lifecycle management matter more than throughput. The subprocess isolation means a failing inference does not crash the management layer. The content-addressed blob storage means pulling the same model on multiple machines produces identical artifacts. The OpenAI-compatible API means any application built for cloud LLM APIs runs locally with one URL change. These are the correct design decisions for the stated goal.
The GGUF format is the correct storage layer for local deployment. A single binary file bundles weights, tokenizer, model metadata, and quantization format. No Python, no HuggingFace transformers, no safetensors conversion. Ollama, LM Studio, GPT4All, Jan, koboldcpp, and llama.cpp all read GGUF. The format is the de facto standard for local inference precisely because of this portability. The quantization research (arXiv:2601.14277) shows Q4_K_M as the correct default: it delivers 75% memory reduction from FP16 with less than 1% quality degradation on standard benchmarks for Llama-3.1-8B-Instruct.
What Ollama trades away:
Raw throughput. Benchmarks consistently show Ollama running 15-30% slower than llama.cpp directly on the same hardware, due to the HTTP subprocess boundary, the Go serialization layer, and the absence of advanced batching features like continuous batching or speculative decoding. vLLM with PagedAttention is 2-3x more throughput-efficient for multi-user server workloads. Ollama does not compete in that category and should not be used for it.
Multi-GPU tensor parallelism. Ollama supports multiple GPUs for running different model instances or multiple instances of the same model for load balancing. It does not support splitting a single model's weight matrices across multiple GPUs in tensor-parallel fashion. A 70B model that requires two GPUs to fit in VRAM is not supported by Ollama in that configuration. Use vLLM or llama.cpp directly with --split-mode layer for that.
Upstream llama.cpp feature lag. Ollama ships its own fork of the GGML inference engine. New llama.cpp features (new quantization formats, speculative decoding improvements, new model architecture support) arrive in Ollama after a delay. Power users who need the latest capabilities immediately run llama.cpp directly.
Technical Moats
162,000 GitHub stars and a model registry with thousands of pre-quantized models. The ollama.com registry provides pre-quantized GGUF models (Llama-3, Qwen-2.5, Mistral, Gemma, DeepSeek, Phi, and many more) with verified Modelfiles including correct chat templates. For any model in the registry, ollama pull handles quantization selection, download, and correct template configuration automatically. For any model outside the registry, the user must find the GGUF, determine the quantization, and write the Modelfile manually. The registry is the moat.
The Docker UX for LLM management. ollama pull, ollama run, ollama ps, ollama rm. Developers who know Docker need no other documentation to manage Ollama models. This is not accidental. The explicit Docker analogy is the key to Ollama's adoption. No competing tool surfaces this UX as clearly.
OpenAI-compatible API as default. Every LangChain component, every LlamaIndex loader, every OpenAI client library, every tool that targets OpenAI's API works with Ollama by changing one URL. The switching cost is one line of configuration. This is the primary adoption mechanism for production integrations.
Insights
Insight One: Ollama is not a local inference engine. It is a DevEx layer on top of llama.cpp, and the community's insistence on benchmarking it against MLX, vLLM, and TGI reveals a fundamental misunderstanding of what it is optimized for.
The arXiv:2511.05502 benchmark on Apple Silicon explicitly notes that Ollama "emphasizes developer ergonomics but lags in throughput and TTFT." This is not a bug. It is the design goal. Ollama is optimized for the "pull a model and have it work in your existing code" use case, not for maximum tokens-per-second. Developers who reach for Ollama as a production serving engine, then are disappointed by its throughput, have selected the wrong tool for the job. The correct frame is: Ollama for development and single-user local inference, vLLM for GPU server workloads, MLX for Apple Silicon maximum throughput, Llamafile for edge devices.
Insight Two: The quantization format matters more than the model choice for most local deployments, and Ollama's default choice (Q4_K_M) is correct, but the community does not know why, and routinely chooses formats that are either wasteful or degraded.
The arXiv:2601.14277 unified evaluation of llama.cpp quantization on Llama-3.1-8B-Instruct shows that Q4_K_M reduces model size by 71% from FP16 while losing less than 1% on MMLU (0.683 → 0.676). Moving to Q3_K_M saves another 17% memory but degrades MMLU by 1.9% (0.676 → 0.663) and perplexity by 4.4%. Q2_K degrades perplexity by 19.6%. Most users who choose Q3_K_M or Q2_K to "fit the model" are trading significant quality for modest memory savings. Q4_K_M is correct as the default. Q5_K_M is correct when 1GB of extra memory is available and quality matters. Q6_K and Q8_0 deliver minimal additional quality improvement and should be used only when memory is genuinely unconstrained.
Takeaway
On single-board computers (Raspberry Pi 4, Raspberry Pi 5, Orange Pi 5 Pro), Llamafile outperforms Ollama by up to 4x in throughput and uses 30-40% less power, per arXiv:2511.07425's evaluation of 25 quantized open-source LLMs across three SBCs.
Ollama's subprocess HTTP boundary, which is acceptable overhead on a developer laptop or server, is a significant penalty on ARM hardware where every microsecond of serialization latency compounds across thousands of tokens. Llamafile bundles the model and runtime as a single self-contained executable with no subprocess boundary. On embedded targets, the architecture that eliminates all process communication overhead wins decisively. This is the correct tool selection insight that most edge AI deployment guides miss: Ollama is not the right choice for Raspberry Pi.
TL;DR For Engineers
Ollama is a Go-based model manager plus llama.cpp inference subprocess. The Go layer handles scheduling, GPU memory estimation, model lifecycle, and OpenAI API translation. Everything below is llama.cpp and GGML. The subprocess boundary costs 15-30% throughput vs. llama.cpp direct.
Use Q4_K_M as the default quantization for VRAM-constrained local deployment: 71% size reduction from FP16, less than 1% quality loss on standard benchmarks. Q5_K_M for one step up when memory allows. Avoid Q3_K_M and below unless forced by hardware constraints (arXiv:2601.14277).
On Apple Silicon, MLX delivers 2x Ollama's throughput. On SBCs (Raspberry Pi, Orange Pi), Llamafile delivers up to 4x Ollama's throughput with 30-40% lower power. Ollama wins on ergonomics and ecosystem, not raw performance (arXiv:2511.05502, arXiv:2511.07425).
OLLAMA_NUM_PARALLEL,OLLAMA_MAX_LOADED_MODELS,OLLAMA_KEEP_ALIVE, andOLLAMA_FLASH_ATTENTIONare the four environment variables that control production behavior. All default to conservative values. All need explicit tuning for multi-user or long-context workloads.Ollama does not support tensor-parallel multi-GPU splitting for a single model. For 70B+ models requiring multiple GPUs, use vLLM or llama.cpp with
--split-mode layer.
The Registry Is the Product. The Inference Engine Is the Implementation Detail.
Ollama's 162,000 GitHub stars are not for llama.cpp behind an HTTP boundary. They are for ollama pull llama3.2 working correctly on the first try, with the right chat template, the right quantization, and an OpenAI-compatible API endpoint that requires zero integration work. The inference performance is a tradeoff that Ollama explicitly accepts in exchange for that ergonomic guarantee. The users who understand this use Ollama for what it is: the easiest correct path from "I want to run a local LLM" to "I have a running local LLM API my existing code can call." The users who misunderstand this use Ollama as a production serving engine and wonder why it underperforms vLLM. The architecture does not hide which use case it serves. The community just does not read it carefully enough.
References
Ollama GitHub Repository, 162k stars, MIT license
Production-Grade Local LLM Inference on Apple Silicon, arXiv:2511.05502, Rajesh et al., 2025
An Evaluation of LLMs Inference on Popular Single-board Computers, arXiv:2511.07425, Nguyen and Nguyen, 2025
llama.cpp GitHub Repository, 100k+ stars, primary inference engine underlying Ollama
Switching from Ollama to llama-swap, Bas Nijholt, power-user perspective on Ollama's limits
Ollama (162k GitHub stars, MIT) is a Go-based model management layer wrapping llama.cpp for local LLM inference, exposing a Docker-style CLI (pull, run, ps, rm), a content-addressed GGUF blob store, and an OpenAI-compatible REST API. Its scheduler estimates GPU memory, distributes transformer layers across CPU and GPU, and manages model lifecycle with configurable keep-alive and parallelism. Benchmarks show it trailing MLX by 2x on Apple Silicon throughput (arXiv:2511.05502) and Llamafile by up to 4x on single-board computers (arXiv:2511.07425), while Q4_K_M quantization (its default) delivers 71% size reduction from FP16 with less than 1% quality loss on MMLU (arXiv:2601.14277). Ollama's moat is its model registry with pre-quantized, correctly-templated models and the ergonomic guarantee that ollama pull works on first try; its real constraint is the subprocess HTTP boundary that costs 15-30% throughput versus llama.cpp direct.
Sponsored Ad
If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀
ChatGPT gives you generic answers because you give it generic prompts.
You know the fix: longer prompts, more context, clearer constraints. But typing all that takes five minutes per prompt, so you shortcut it. Every time.
Wispr Flow lets you speak your prompts instead of typing them. Talk through your thinking naturally — include context, constraints, examples — and get clean text ready to paste. No filler words. No cleanup.
Works inside ChatGPT, Claude, Cursor, Windsurf, and every other AI tool. System-level, so there's nothing to install per app. Tap and talk.
Millions of users worldwide. Teams at OpenAI, Vercel, and Clay use Flow daily. Free on Mac, Windows, and iPhone.


