Blog

Engineering notes from every layer of the on-device stack

All QHexRT1 MetalRT4 SDKs2 Agents2 Voice2

June 25, 2026

QHexRT Is Live: Full-Stack NPU Inference for Qualcomm Hexagon

QHexRT is officially live — the first inference engine built to run LLM, VLM, STT, TTS, and embeddings 100% on Qualcomm Hexagon NPUs. First model: LFM 2.5 230M at 12,540 tok/s prefill and 36ms flat TTFT on v81.

MetalRT

March 15, 2026

MetalRT Now Does Speech-to-Speech. 1.52x Faster Than mlx-audio.

MetalRT adds native speech-to-speech support. 1.68s end-to-end latency, 123 tok/s generation throughput, 1.52x faster than mlx-audio on a single M4 Max.

MetalRT

March 13, 2026

MetalRT Now Runs Vision Language Models. Fastest on Apple Silicon.

MetalRT adds VLM support and wins every decode benchmark. 279 tok/s vision decode, 92ms time-to-output, 1.22x faster than mlx-vlm across all resolutions on a single M4 Max.

SDKs

March 13, 2026

How RunAnywhere SDK Powers On-Device AI Coaching in PickleRite

A deep-dive into how PickleRite — a pickleball performance tracker — runs a specialized LLM entirely on-device using RunAnywhere SDK. Zero cloud costs, full offline support, complete privacy.

MetalRT

March 9, 2026

MetalRT: The First Complete AI Inference Engine for Apple Silicon. Now with Speech.

MetalRT becomes the first inference engine to handle LLMs, Speech-to-Text, and Text-to-Speech on Apple Silicon. 101ms to transcribe 70 seconds of audio. 178ms to synthesize speech. 4.6x faster than Apple MLX.

MetalRT

March 3, 2026

We Built the Fastest LLM Decode Engine for Apple Silicon. Here Are the Numbers.

MetalRT delivers 658 tok/s decode and 6.6ms time-to-first-token, winning decode on 3 of 4 models we tested on a single M4 Max.

Voice

February 24, 2026

FastVoice RAG: Sub-200ms Voice AI with Retrieval-Augmented Generation, Entirely On-Device

We added hybrid retrieval (BM25 + vector search) to our on-device voice pipeline. Retrieval adds less than 4ms. The real cost is LLM prefill — but word-level flushing absorbs it. Sub-200ms first-audio on 5,016 chunks with zero cloud dependencies.

Voice

February 22, 2026

FastVoice: 63ms First-Audio Latency for On-Device Voice AI on Apple Silicon

FastVoice achieves 63ms first-audio latency — well under the 200ms perceptual threshold — by composing STT, LLM, and TTS into a single C++ pipeline on Apple Silicon. No cloud. No network. Just speed.

Agents

February 21, 2026

I Built a Fully Offline AI Agent on Android. It Listens, Thinks, Acts, and Speaks Back.

No server. No API key. No internet. Just a phone doing things on its own.

SDKs

February 19, 2026

I Tried Running an LLM on a $150 Android Phone. Here's What Actually Happened.

And the rabbit hole that taught me more about Android internals than 3 years of app development.

Agents

February 9, 2026

On-Device Browser Agent: AI Web Automation Without the Cloud

Automate web tasks with natural language using a Chrome extension powered by on-device AI. No API keys, no data leaving your browser, complete privacy.