Inference Engines
LiveMetalRT by RunAnywhere
The fastest AI inference engine for Apple Silicon. Every GPU kernel hand-written in Metal Shading Language — LLM, speech, vision, and speech-to-speech in one C++ runtime.
token output stream
$0
marginal inference cost on-device
Cloud inference costs $0.08–0.35 per minute for voice alone. Serving AI to 8 billion people through centralized GPU clusters is economically impossible.
<7ms
time-to-first-token (Qwen3-0.6B, M4 Max)
A round-trip to the cloud takes 300–400ms minimum. For real-time voice, vision, and autonomous systems, physics sets the floor — on-device removes it.
658
tok/s on a single MacBook
Small models now match the quality of models 250x their size. The bottleneck isn’t the model — it’s the runtime. That’s what we build.
Benchmarks · Apple M4 Max
LLM Decode
higher is betterTime to First Token
lower is betterSpeech-to-Text
lower is betterSpeech-to-Speech
higher is betterHow it works
Built from the metal up.
We write GPU kernels from scratch — hand-designed memory layouts, fused operators, and custom Metal shaders that bypass every generic abstraction layer. MetalRT achieves 658 tok/s LLM decode on Apple Silicon. Every kernel targets the specific hardware it runs on.
The runtime orchestrates quantized weights, KV cache, and graph execution on unified memory — one C++ engine behind every modality the SDK exposes.
Inference Stack
iOS · macOS · Android
RunAnywhere.load("llama-3.2-1b")
Swift · Kotlin · React Native · Flutter
Cross-platform bindings → C++ core
C++ Inference Engine · Quantized Weights · KV Cache
Orchestrates graph execution on unified memory
Hand-written Metal Shading Language
We write every GPU kernel from scratch
M1 · M2 · M3 · M4 · Unified Memory · 800 GB/s
simd_sum · threadgroup_barrier · [[buffer(0)]]
Next: HexagonRT — our inference engine for Qualcomm NPUs
The same kernel-level discipline as MetalRT, built for the NPU inside billions of Android and Windows devices. Benchmarks first, launch second.
Get launch updates via Inference Radar