Inference — Research

Live

Two engines.
Every kernel by hand.

MetalRT owns the Apple GPU — every kernel written from scratch in Metal Shading Language. QHexRT runs LLM, VLM, STT, TTS, and embeddings 100% on Qualcomm Hexagon NPUs. One C++ runtime behind both, and every claim published with numbers.

Talk to us about inference Read the benchmarks Get the binaries

metalrt benchmark

ready

0tok/sLLM decode · Qwen3-0.6B · M4 Max

Ollama · 85Apple MLX · 220llama.cpp · 290MetalRT · 658

token output stream

marginal inference cost on-device

Cloud inference costs $0.08–0.35 per minute for voice alone. Serving AI to 8 billion people through centralized GPU clusters is economically impossible.

<7ms

time-to-first-token (Qwen3-0.6B, M4 Max)

A round-trip to the cloud takes 300–400ms minimum. For real-time voice, vision, and autonomous systems, physics sets the floor — on-device removes it.

658

tok/s on a single MacBook

Small models now match the quality of models 250x their size. The bottleneck isn’t the model — it’s the runtime. That’s what we build.

Benchmarks · Apple M4 Max

LLM Decode

higher is better

RunAnywhere

658 tok/s

Apple MLX

553 tok/s

llama.cpp

394 tok/s

Time to First Token

lower is better

RunAnywhere

6.6ms

Apple MLX

8ms

llama.cpp

11ms

Speech-to-Text

lower is better

RunAnywhere

101ms

Apple MLX

465ms

Speech-to-Speech

higher is better

RunAnywhere

123 tok/s

mlx-audio

81 tok/s

How it works

Built from the metal up.

We write GPU kernels from scratch — hand-designed memory layouts, fused operators, and custom Metal shaders that bypass every generic abstraction layer. MetalRT achieves 658 tok/s LLM decode on Apple Silicon. Every kernel targets the specific hardware it runs on.

The runtime orchestrates quantized weights, KV cache, and graph execution on unified memory — one C++ engine behind every modality the SDK exposes.

metalrt-binaries on GitHub Use it through the SDK

Inference Stack

Your App

iOS · macOS · Android

RunAnywhere.load("llama-3.2-1b")

SDK Layer

Swift · Kotlin · React Native · Flutter

Cross-platform bindings → C++ core

MetalRT Runtime

C++ Inference Engine · Quantized Weights · KV Cache

Orchestrates graph execution on unified memory

Custom .metal Kernels

Hand-written Metal Shading Language

We write every GPU kernel from scratch

qmv.metalattention_decode.metalrms_norm.metalrope.metalswiglu.metalkv_cache.metal

Apple Silicon GPU

M1 · M2 · M3 · M4 · Unified Memory · 800 GB/s

simd_sum · threadgroup_barrier · [[buffer(0)]]

Output:658 tok/s decode|101ms STT

Publications

Every claim comes with numbers.

MetalRT · Speech-to-SpeechMar 15, 2026

MetalRT Now Does Speech-to-Speech. 1.52x Faster Than mlx-audio.

Read the benchmarks

123 tok/s

S2S throughput

MetalRT · VisionMar 13, 2026

MetalRT Now Runs Vision Language Models. Fastest on Apple Silicon.

Read the benchmarks

287 tok/s

vision decode

MetalRT · SpeechMar 9, 2026

The First Complete AI Inference Engine for Apple Silicon. Now with Speech.

Read the benchmarks

101ms

STT latency

MetalRT · LLMMar 3, 2026

We Built the Fastest LLM Decode Engine for Apple Silicon.

Read the benchmarks

658 tok/s

LLM decode

All MetalRT posts

QHexRT

100% on the NPU. Every modality.

QHexRT · LaunchJun 25, 2026

QHexRT Is Live: Full-Stack NPU Inference for Qualcomm

Read the benchmarks

12,540 tok/s

NPU prefill

New

QHexRT — our inference engine for Qualcomm NPUs

The first inference engine built to run LLM, VLM, STT, TTS, and embeddings 100% on Qualcomm Hexagon NPUs. First model: LFM 2.5 230M — 12,540 tok/s prefill, 36ms flat TTFT on v81.

Read the launch post LFM 2.5 230M on Hugging Face

Inference services

We do this for hire.
Your silicon, your models, our kernels.

The same team that wrote MetalRT and QHexRT takes on custom inference work: engines for your target hardware, model bring-up and quantization, and latency budgets that generic runtimes can't hit.

Custom engines

for your target silicon

Model bring-up

quantization · conversion · eval

Latency engineering

budgets, hit and measured

Talk to us GitHub

Two engines.Every kernel by hand.

LLM Decode

Time to First Token

Speech-to-Text

Speech-to-Speech

Built from the metal up.

Every claim comes with numbers.

MetalRT Now Does Speech-to-Speech. 1.52x Faster Than mlx-audio.

MetalRT Now Runs Vision Language Models. Fastest on Apple Silicon.

The First Complete AI Inference Engine for Apple Silicon. Now with Speech.

We Built the Fastest LLM Decode Engine for Apple Silicon.

100% on the NPU. Every modality.

QHexRT Is Live: Full-Stack NPU Inference for Qualcomm

QHexRT — our inference engine for Qualcomm NPUs

We do this for hire.Your silicon, your models, our kernels.

Two engines.
Every kernel by hand.

We do this for hire.
Your silicon, your models, our kernels.