March 3, 2026

·

We Built the Fastest LLM Decode Engine for Apple Silicon. Here Are the Numbers.

We Built the Fastest LLM Decode Engine for Apple Silicon. Here Are the Numbers.
DEVELOPERS

How fast can you run an LLM on Apple Silicon if you throw away every abstraction and go straight to the metal?

658 tokens per second. All on a single M4 Max.

We benchmarked MetalRT against five engines across four models. MetalRT won decode on 3 of 4 models and averaged 1.67x faster than llama.cpp.

We tested against:

  • uzu - production-grade Rust inference engine
  • mlx-lm - Apple's MLX inference framework
  • llama.cpp - the most widely-used open-source inference engine
  • Ollama - popular llama.cpp wrapper with REST API (v0.17.4)

Setup

EngineLanguageBenchmark Method
MetalRTC++Native binary
uzuRustNative cli bench
mlx-lmPython + MLX C++Python API
llama.cppC/C++llama-bench v8190
OllamaGo + llama.cppREST API (streaming)
  • Hardware: Apple M4 Max, 64 GB unified memory, macOS 26.3
  • Models: Qwen3-0.6B, Qwen3-4B, Llama-3.2-3B, LFM2.5-1.2B (all 4-bit quantized)
  • Runs: 5 per engine per model, best reported
  • Fairness: MetalRT and mlx-lm use the exact same model files. Ollama uses the same GGUF files as llama.cpp, with REST API overhead included.

Decode Speed

Decode speed is how fast tokens stream to the user. It is the metric that matters most for interactive chat.

Decode speed comparison
Decode throughput across all engines
ModelMetalRTuzumlx-lmllama.cppOllama
Qwen3-0.6B658627552295*274
Qwen3-4B18616517087120
Llama-3.2-3B184222210137131
LFM2.5-1.2B570550509372313

*Qwen3-0.6B llama.cpp/Ollama use Q8_0 (8-bit), not directly comparable.

MetalRT wins 3 of 4 models. The speedups:

  • 1.10-1.19x vs mlx-lm (same model files)
  • 1.35-2.14x vs llama.cpp
  • 1.41-2.40x vs Ollama

uzu wins Llama-3.2-3B at 222 tok/s. We report this honestly.

MetalRT vs Apple MLX and llama.cpp

mlx-lm is Apple's official inference framework. MetalRT and mlx-lm use the exact same model files, so this is a pure engine-to-engine comparison.

MetalRT decode speedup vs mlx-lm and llama.cpp
MetalRT decode speedup vs Apple MLX and llama.cpp

MetalRT is 1.10-1.19x faster than mlx-lm on decode (same model files) and 1.35-2.14x faster than llama.cpp across the board.

What MetalRT Is Built For

Use CaseWhy MetalRT
Chat apps186 tok/s on a 4B model, responses stream instantly
Structured output / tool callingFaster decode = faster JSON and function call generation
Agent workflowsCompound latency savings across sequential LLM calls
Coding assistantsSub-7ms TTFT on small models
Privacy-first appsCloud-competitive speed, entirely on-device
Voice pipelinesFaster decode shrinks the gap between hearing and responding

Summary

  • 658 tok/s peak decode (Qwen3-0.6B)
  • 1.67x faster than llama.cpp on decode
  • 1.19x faster than mlx-lm on decode (same model files)
  • 1.59x faster than Ollama on decode
  • 6.6ms time-to-first-token (Qwen3-0.6B)

Output quality is identical across all engines. The model is the same. The speed is not.


Benchmarked on Apple M4 Max, 64 GB, macOS 26.3. Models: Qwen3-0.6B, Qwen3-4B, Llama-3.2-3B, LFM2.5-1.2B, all 4-bit. Greedy decoding, 5 runs, best reported. MetalRT + mlx-lm share identical MLX 4-bit model files. llama.cpp and Ollama use GGUF Q4_K_M (Q8_0 for Qwen3-0.6B). Ollama v0.17.4 via REST API, TTFT measured end-to-end.

RunAnywhere Logo

RunAnywhere

Connect with developers, share ideas, get support, and stay updated on the latest features. Our Discord community is the heart of everything we build.

Company

Copyright © 2025 RunAnywhere, Inc.