SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | April 24, 2026
Every LLM deployment story eventually hits the same wall: you have a model, you have hardware, and between them is a gap that CUDA cannot cross. iPhones speak Metal. Android Snapdragon chips speak OpenCL. Browsers speak WebGPU. Mali GPUs exist. Each target requires hand-tuned kernels that no single team can maintain across every model and every device combination. MLC-LLM's answer to this problem is not to write more kernels. It is to build a compiler that generates them.
This newsletter dissects MLC-LLM not as a deployment tool but as a compiler infrastructure problem: what TensorIR actually represents, how MetaSchedule searches kernel optimization spaces without human intervention, how the same compilation pipeline deploys Llama-3 to a $100 Orange Pi and to a browser running WebGPU, and why WebLLM retains up to 80% of native GPU performance inside a browser tab.
What It Actually Does
MLC-LLM (Machine Learning Compilation for Large Language Models) is a universal deployment engine for LLMs built on Apache TVM Unity. It has 22,000 GitHub stars and 1,900 forks. The project produces the WebLLM browser runtime and powers LLM inference across iOS (Metal), Android (OpenCL), NVIDIA (CUDA), AMD (ROCm, Vulkan), Intel (Vulkan), browsers (WebGPU, WASM), and embedded GPUs like Mali.
The value proposition is precise: one compilation pipeline, one set of Python-first optimization passes, one runtime, targeting all of these. No separate kernel library per backend. No per-device hand-tuning by humans. The TVM compiler generates and optimizes backend-specific kernels automatically using TensorIR and MetaSchedule.
The deployment pipeline has three stages, always in this order:
Weight conversion: HuggingFace weights converted to MLC format with quantization applied (
mlc_llm convert_weight).Model compilation: TVM compiles the model graph and generates device-specific kernel code (
mlc_llm compile). Produces a.so,.dylib, or.wasmlibrary per target.Runtime: A lightweight C++ runtime loads the compiled library and quantized weights, drives inference. Python, iOS, Android, JavaScript, and REST APIs all call into this same runtime.
Supported quantization modes: q0f16 (fp16, no quantization), q0f32 (fp32), q3f16_1 (3-bit weights, fp16 activations), q4f16_1 (4-bit group quantization, fp16 activations, the standard production mode), q4f16_awq (AWQ, experimental), and FP8 variants for CUDA. The format qAfB_id encodes A bits for weight storage, B bits for activation storage.
The Architecture

Focus on the TensorIR layer. This is where hardware portability is actually achieved: by separating what to compute from how to compute it, the same model definition retargets to any backend by changing the schedule, not the algorithm.
The key insight in TensorIR is the decoupling of the compute specification from the schedule. A matrix multiplication is described once as a data-parallel loop nest specifying what each output element computes. The schedule, which controls tile size, loop ordering, memory placement (global, shared, register), and vectorization, is a separate object that MetaSchedule searches automatically. Different hardware gets different schedules for the same computation.
MetaSchedule (NeurIPS 2022) searches this schedule space using probabilistic programs. It proposes candidate schedules, measures their performance on the target hardware, and updates a learned cost model. After the search, the best schedule is compiled to the target language and cached. Subsequent runs skip the search entirely.
The Code
Snippet One: Full Deployment Pipeline (Python CLI, Llama-3 to CUDA)
# Step 1: Convert weights from HuggingFace format to MLC format
# q4f16_1 = 4-bit group quantization, fp16 activations
# ← All platforms share the same converted weights. Compile separately per target.
mlc_llm convert_weight ./dist/models/Llama-3-8B-Instruct/ \
--quantization q4f16_1 \
-o dist/Llama-3-8B-Instruct-q4f16_1-MLC/
# Step 2: Generate mlc-chat-config.json (model metadata, tokenizer, memory planning)
# ← This config drives everything downstream: the compiler reads it, the runtime reads it
mlc_llm gen_config ./dist/models/Llama-3-8B-Instruct/ \
--quantization q4f16_1 \
--conv-template llama-3 \
-o dist/Llama-3-8B-Instruct-q4f16_1-MLC/
# Step 3: Compile for CUDA target
# ← TVM runs MetaSchedule search here on first compile. Cached on subsequent runs.
# ← Output: dist/libs/Llama-3-8B-Instruct-q4f16_1-cuda.so (native shared library)
mlc_llm compile dist/Llama-3-8B-Instruct-q4f16_1-MLC/mlc-chat-config.json \
--device cuda \
-o dist/libs/Llama-3-8B-Instruct-q4f16_1-cuda.so
# Same weights, same config, different target: WebGPU for browser deployment
# ← Output: a .wasm file that runs in any WebGPU-capable browser, no server needed
mlc_llm compile dist/Llama-3-8B-Instruct-q4f16_1-MLC/mlc-chat-config.json \
--device webgpu \
--prefill-chunk-size 1024 \ # ← reduce to fit browser GPU memory limits
-o dist/libs/Llama-3-8B-Instruct-q4f16_1-webgpu.wasm
# One-liner shortcut: JIT compile on first run, then serve via OpenAI-compatible API
# ← Skips explicit compile step. TVM JIT compiles and caches automatically.
mlc_llm serve HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC --device cuda
The two compile commands for CUDA and WebGPU use identical weights and config. The only difference is --device. TVM's code generation handles the rest: CUDA kernels for one, WGSL shaders for the other, from the same TensorIR schedule search.
Snippet Two: Python API and OpenAI Compatibility (MLCEngine)
from mlc_llm import MLCEngine
# MLCEngine is the universal runtime. Same API for CUDA, Metal, Vulkan, WebGPU.
# ← HF:// prefix triggers auto-download of pre-compiled artifacts from Hub
# ← On first run: downloads weights, JIT compiles if no cached .so exists
model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC"
engine = MLCEngine(model)
# ← THIS is the key design decision: OpenAI-compatible API, not a custom interface
# Existing OpenAI client code works with zero modification by changing the base_url
for response in engine.chat.completions.create(
messages=[{"role": "user", "content": "Explain TensorIR in one paragraph."}],
model=model,
stream=True, # streaming works on all backends including WebGPU
temperature=0.7,
max_tokens=256,
):
for choice in response.choices:
print(choice.delta.content, end="", flush=True)
engine.terminate()
# For Android (Java), TVM4J provides identical semantics via JNI bridge
# For iOS (Swift), a Swift wrapper calls the same C++ runtime
# For browsers, WebLLM exposes the same OpenAI-style JavaScript API
# ← The runtime is the same C++ code everywhere. Language bindings are thin wrappers.
# Multi-GPU tensor parallelism: one flag, no code change
engine_multi = MLCEngine(model, overrides={"tensor_parallel_shards": "2"})
# ← ZeRO-style sharding: each GPU holds a shard of the weight matrices
# ← NCCL AllReduce for attention and MLP outputs across shards
The OpenAI-compatible API is not a convenience wrapper. It is the primary API surface. Any application built against OpenAI's SDK can switch to local MLC-LLM inference by changing one URL. This is the adoption strategy, not a feature.
It In Action: End-to-End Worked Example
Scenario: Deploy Llama-3 (8B, 4-bit quantized) to three targets from the same weights: CUDA server, iOS Metal, and Chrome browser via WebGPU.
Input: meta-llama/Meta-Llama-3-8B-Instruct from HuggingFace, target platforms: CUDA, iOS Metal, WebGPU.
Step 1: Weight conversion (one-time, platform-agnostic)
mlc_llm convert_weight meta-llama/Meta-Llama-3-8B-Instruct \
--quantization q4f16_1 \
-o ./dist/llama3-8b-q4f16_1-MLC/
# Output: quantized weights in .bin shards (~4GB at 4-bit vs ~16GB fp16)
# These shards are identical for all three target platforms.
Step 2: Compile for each target
# CUDA (Linux server, A100)
mlc_llm compile ./dist/llama3-8b-q4f16_1-MLC/mlc-chat-config.json \
--device cuda -o ./dist/libs/llama3-8b-cuda.so
# Compile time: ~2-5 minutes (MetaSchedule search + NVCC compilation)
# Output: ./dist/libs/llama3-8b-cuda.so (~80MB)
# iOS Metal (cross-compiled on macOS)
mlc_llm compile ./dist/llama3-8b-q4f16_1-MLC/mlc-chat-config.json \
--device iphone -o ./dist/libs/llama3-8b-metal.tar
# Output: Metal shader library, bundled with iOS app
# WebGPU (browser, Chrome/Edge with WebGPU enabled)
mlc_llm compile ./dist/llama3-8b-q4f16_1-MLC/mlc-chat-config.json \
--device webgpu --prefill-chunk-size 1024 \
-o ./dist/libs/llama3-8b-webgpu.wasm
# Output: WebAssembly + WGSL shader bundle (~85MB .wasm)
Step 3: Runtime results
CUDA (A100 80GB, bf16 activations, 4-bit weights):
Prefill throughput: ~3,200 tokens/sec
Decode throughput: ~85-110 tokens/sec
Memory: ~4.3GB VRAM at 4-bit (vs ~16GB at fp16)
iOS (iPhone 15 Pro, Metal, q4f16_1):
Decode throughput: ~15-25 tokens/sec on Apple A17 Pro
Model fits in 6GB RAM budget with 4-bit quantization
WebGPU (Chrome, M2 MacBook Air, q4f16_1):
Decode throughput: ~12-20 tokens/sec
Retains up to 80% of native Metal performance (per WebLLM paper, arXiv:2412.15803)
Runs entirely client-side, no server, no API key
Step 4: OpenAI-compatible REST server (CUDA target)
mlc_llm serve HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC --device cuda --port 8080
# Any OpenAI client now works against local hardware:
# client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="none")
Why This Design Works, and What It Trades Away
The TensorIR compute-schedule separation is the correct abstraction for this problem because it cleanly separates concerns that have different owners. The model architecture (what to compute) is defined once in Python and does not change per hardware. The optimization schedule (how to compute it) is hardware-specific and searched automatically by MetaSchedule. Without this separation, adding a new hardware target requires rewriting kernels. With it, adding a new target requires writing a new code generation backend and reusing the existing schedule search infrastructure.
The Python-first development model is the correct choice for iteration speed. TVM Unity exposes the entire compilation pipeline as Python objects. An engineer can inspect the IRModule after any pass, write custom passes in Python, and compose new model architectures without touching C++. This enabled the Orange Pi Mali GPU deployment in roughly one week of effort by reusing all existing passes with OpenCL retargeting.
The OpenAI-compatible API as the primary interface is a deliberate adoption strategy. The alternative, a custom MLC-specific API, would require application developers to learn a new interface. By matching OpenAI's API surface, any application already using the OpenAI SDK deploys locally by changing one URL. This is not a minor convenience. It is the difference between "interesting research" and "production deployment".
What MLC-LLM trades away:
Compilation time. MetaSchedule's search-and-measure loop takes minutes to hours for a new model-hardware combination. The search result is cached, but first-run latency is real. llama.cpp avoids this entirely with pre-written GGML kernels. For production deployments where compile-once-run-many is acceptable, this tradeoff is correct. For rapid experimentation across many hardware targets without caching, it is a significant friction point.
Kernel authorship transparency. When llm.c runs, you can read exactly which C function computes your attention kernel. When MLC-LLM runs, the attention kernel was generated by TVM from a TensorIR schedule found by MetaSchedule. Debugging a performance regression requires understanding the compiler, not just the model. This is a real expertise barrier.
Continuous batching and PagedAttention. MLC-LLM's serving performance on multi-user scenarios is not competitive with vLLM for server-side workloads. vLLM's PagedAttention eliminates KV cache memory fragmentation entirely. MLC-LLM's serving stack is optimized for single-session mobile and browser inference, not for maximizing throughput across hundreds of concurrent users.
Technical Moats
The cross-platform kernel search cache. MetaSchedule's search produces tuned schedules for each model-hardware combination. The mlc-ai organization publishes pre-compiled libraries and pre-searched schedules on HuggingFace, meaning users running a supported model on a supported device skip the search entirely. Replicating this requires running the search across every model and hardware combination, which requires the hardware fleet to do it.
WebGPU inference at 80% native performance. WebGPU has no high-performance kernel library equivalent to cuBLAS or Metal Performance Shaders. MLC-LLM/WebLLM is the only system that generates tuned WGSL compute shaders via TVM, which is why WebLLM achieves 80% native performance while naive WebGPU implementations fall to 20-30%. This is a genuine compiler contribution, not a runtime optimization.
The Orange Pi proof. Running Llama-2/3 on a $100 Orange Pi with Mali GPU via OpenCL, with no manual kernel changes, required only retargeting the existing OpenCL codegen backend. This demonstrates that the abstraction is genuinely hardware-agnostic, not just claimed to be. Any team building a competing solution from scratch would need to write and tune OpenCL kernels for Mali from zero.
Insights
Insight One: MLC-LLM is not a deployment framework. It is a compiler that happens to have a deployment runtime. The community treats these as the same thing and misunderstands both.
The community compares MLC-LLM to llama.cpp and vLLM as if they are interchangeable deployment options. They are not. llama.cpp is a hand-tuned inference runtime with custom GGML kernels. vLLM is a serving engine with PagedAttention. MLC-LLM is a compiler that generates optimized inference code for any target hardware. The runtime is a consequence of the compiler, not the point of the project. This distinction matters because MLC-LLM's value is not in what it currently supports. It is in how trivially it extends to hardware that does not yet have an LLM inference stack: new mobile SoCs, embedded accelerators, emerging GPU architectures. Any of these are within reach by writing a TVM codegen backend, without touching model code.
Insight Two: The 80% native WebGPU performance number buries the real story. The other 20% is structurally unavailable, not an engineering gap.
WebLLM's paper reports retaining up to 80% of native performance on the same device. This sounds like a 20% penalty. The reality is more interesting: WebGPU's execution model lacks the fine-grained memory control that CUDA and Metal expose, specifically subgroup operations and explicit shared memory tiling patterns that are critical for attention kernels. The remaining 20% gap is not recoverable without changes to the WebGPU specification itself. MLC-LLM's TensorIR schedules for WebGPU are already at or near the ceiling of what WebGPU currently allows. Any competing implementation faces the same ceiling. The 20% is not an MLC-LLM limitation. It is a WebGPU limitation.
Takeaway
Deploying Llama-3 on a $100 Orange Pi with a Mali GPU took approximately one week and required zero new kernel code. The entire existing MLC-LLM pipeline, quantization passes, optimization passes, layout transformations, was reused verbatim. Only the OpenCL codegen backend was retargeted, which already existed in TVM. The compiler eliminated what would have been months of manual kernel engineering.
This is the actual proof of the architecture's generalization. Not the iPhone demo or the browser inference, which are impressive but expected for well-resourced hardware. A $100 single-board computer running a quantized 7B model at usable token rates, with no custom kernel work, is the benchmark that matters for understanding what "universal deployment" actually means in practice.
TL;DR For Engineers
MLC-LLM is a compiler first: TVM Unity + TensorIR + MetaSchedule generates hardware-specific kernels from a single Python-first model definition. The runtime (iOS, Android, CUDA, WebGPU) is what the compiler produces.
The three-stage pipeline: convert weights (quantize, platform-agnostic), compile (MetaSchedule search + codegen, per-target), run (universal C++ runtime with thin language bindings). Weights are shared across all targets; compiled libraries are per-target.
WebLLM retains up to 80% native GPU performance in browser via TVM-generated WGSL kernels. The remaining 20% is a WebGPU specification constraint, not an MLC-LLM engineering gap.
Quantization modes:
q4f16_1(4-bit weights, fp16 activations) is the standard production mode. 4-bit reduces Llama-3-8B VRAM from ~16GB to ~4.3GB with minimal accuracy loss.Do not use MLC-LLM for multi-user server workloads. vLLM with PagedAttention wins on throughput per concurrent session. MLC-LLM wins when the hardware target is not CUDA or when browser/mobile deployment is required.
The Compiler Is the Product. Everything Else Is a Target.
MLC-LLM's value is not that it runs LLMs on phones today. It is that it provides a principled path to running LLMs on any hardware that receives a TVM codegen backend. Mali GPU, Snapdragon Adreno, Apple Neural Engine, future RISC-V AI accelerators, all of them are within reach of the same compilation pipeline without model-specific engineering work. The alternative, writing and maintaining hand-tuned kernels for every model and every hardware target, does not scale. No team is large enough to do it. The compiler is the only approach that scales with the combinatorial explosion of models and hardware. MLC-LLM built the right abstraction. The question is whether the community will recognize compiler infrastructure as the durable investment it is, rather than chasing whichever serving framework has the highest benchmark this week.
References
MLC-LLM GitHub Repository, 22k stars, Apache-2.0
WebLLM: A High-Performance In-Browser LLM Inference Engine, arXiv:2412.15803, Ruan et al., 2024
MLC-LLM is a universal LLM deployment engine built on Apache TVM Unity that uses TensorIR and MetaSchedule to compile any LLM to any hardware target (CUDA, Metal, OpenCL, Vulkan, WebGPU, WASM) from a single Python-first compilation pipeline. The core design separates what to compute (TensorIR compute specification) from how to compute it (MetaSchedule-searched schedule), enabling retargeting to new hardware by changing the code generation backend without touching model code. WebLLM, built on MLC-LLM, retains up to 80% native GPU performance in browsers via TVM-generated WGSL kernels, with the remaining gap being a WebGPU specification constraint rather than an engineering limitation. The same pipeline deployed Llama-3 to a $100 Orange Pi Mali GPU in one week with zero new kernel code, demonstrating genuine hardware universality.
Sponsored Ad
If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀
