Y Combinator

Backed by Y Combinator

June 25, 2026

·

QHexRT Is Live: Full-Stack NPU Inference for Qualcomm Hexagon

QHexRT Is Live: Full-Stack NPU Inference for Qualcomm Hexagon
DEVELOPERS

We shipped MetalRT for Apple Silicon — the first engine to run LLM, speech, vision, and speech-to-speech in one runtime, entirely on the GPU. Today we're launching QHexRT for Qualcomm: the same bet on the NPU.

QHexRT runs inference 100% on the Hexagon NPU. No Python in the hot path. No CPU fallback during inference. We are building the widest model catalog on any Qualcomm NPU stack, with same-day support for the models the community ships.

No one has shipped a single runtime that covers LLM, VLM, STT, TTS, and embeddings fully on Qualcomm NPUs. MetalRT did it for Apple Silicon. QHexRT does it for Hexagon.

First model: LFM 2.5 230M

LiquidAI released LFM 2.5 230M on June 25, 2026. QHexRT supports it on day one — our first catalog entry. The NPU bundle is on Hugging Face: runanywhere/lfm2_5_230m_HNPU.

Every tensor in the inference path stays on the HTP: decode graph, prefill graph, lm-head, embeddings. Greedy output matches the source model ("The capital of France is"" Paris.").

All modalities, one NPU runtime

ModalityStatus
LLMLive — LFM 2.5 230M on Hexagon v81
VLMIn development
STTIn development
TTSIn development
EmbeddingsIn development

One engine, one deployment model, every modality on the NPU — the same architecture MetalRT proved on Apple Silicon, now on Qualcomm.

The headline numbers (LFM 2.5 230M · Hexagon v81)

Benchmarked on Hexagon v81 (Snapdragon 8 Elite Gen-2, SM8850). We also ran llama.cpp on the CPU of the same chip, same die, same phone. Both stacks produce identical output. The only variable is speed.

  • Prefill: 12,540 tok/s on the NPU vs 871 on the CPU (Q8_0). 14.4x faster.
  • Time-to-first-token: 36ms flat for any prompt up to 512 tokens. The CPU takes 588ms at 512 tokens.
  • Decode: the CPU wins at 250 tok/s vs 172 on the NPU. At 230M params, decode is memory-bandwidth-bound, and the Oryon CPU pulls more usable bandwidth from DDR on this weight set.
  • End-to-end: the NPU wins once the prompt exceeds ~1.5x the generation length.

QHexRT's W8 graph matched the HuggingFace fp32 oracle at logits cosine 1.000000. Equal quality, NPU-only inference.

Setup

ItemValue
DeviceQualcomm SM8850 (Snapdragon 8 Elite Gen-2 class)
NPUHexagon v81 (HTP)
CPU8-core Oryon (up to ~4.4 GHz) — comparison baseline only
ModelLiquidAI/LFM 2.5 230M (229.69M params, hybrid decoder)
NPU bundlerunanywhere/lfm2_5_230m_HNPU (v81 context binaries, W8)
NPU stackQHexRT, W8 weight-only (int8 weights, fp16 activations)
CPU stackllama.cpp b1-beac530, Q8_0 + Q4_K_M (comparison only)
FairnessHTP clock pinned to TURBO; CPU governor pinned to performance, all 8 cores

Run it

Download the v81 NPU bundle from runanywhere/lfm2_5_230m_HNPU. It includes the QNN context binaries (decode, prefill, lm-head), embeddings, tokenizer, and manifest.

bash
1hf download runanywhere/lfm2_5_230m_HNPU --local-dir lfm2_5_230m_HNPU
2adb push lfm2_5_230m_HNPU/v81 /data/local/tmp/lfm230
3adb shell "cd /data/local/tmp/lfm230 && LD_LIBRARY_PATH=. \
4 ./qhx_generate lfm2-5-230m.json libQnnHtp.so libQnnSystem.so . 64 'The capital of France is'"

Stage the QAIRT v81 runtime libs (libQnnHtp.so, libQnnSystem.so, libQnnHtpV81Skel.so/Stub.so) and the qhx_generate tool into the same directory. Context binaries are pinned to Hexagon v81. Contact us for deployment access.

Throughput

EnginePrefill (tok/s)Decode (tok/s)Decode (ms/tok)Peak RAM (MB)
QHexRT NPU (v81, W8)12,5401725.8445
llama.cpp CPU Q8_08712504.0299
llama.cpp CPU Q4_K_M6802643.83209

Prefill is where the NPU wins by the widest margin. QHexRT runs it as one batched forward over a padded 512-token window, so the cost stays constant regardless of prompt length.

Prefill throughput on SM8850 — QHexRT NPU vs llama.cpp CPU

Figure 1 — Prefill throughput (log scale). QHexRT on Hexagon v81 hits 12,540 tok/s prefill on LFM 2.5 230M. llama.cpp on the same die's Oryon CPU reaches 871 tok/s (Q8_0) and 680 tok/s (Q4_K_M). That is 14.4x and 18.4x faster prefill on the NPU.

Decode on this 230M model is DDR-bandwidth-bound. Larger models shift toward compute-bound decode, where the NPU's HMX units have the advantage.

Time to first token

Prompt (tokens)NPU (ms)CPU Q8_0 (ms)CPU Q4_K_M (ms)
16361726
12836132188
51236588753
Time to first token vs prompt length — flat NPU line vs rising CPU

Figure 2 — Time to first token vs prompt length (LFM 2.5 230M). The NPU holds a flat ~36 ms TTFT for any prompt up to 512 tokens. At 512 tokens the NPU delivers its first token 16x sooner than Q8_0 (36 ms vs 588 ms).

The model catalog

LFM 2.5 230M on Hexagon v81 is the first entry. We're adding models as fast as the community ships them — Qwen3, Gemma 3, Phi-4-mini, and other sub-4B models that fit the NPU memory budget, each with the same 100% NPU path.

Next on the roadmap:

  • VLM, STT, TTS, embeddings on Hexagon — completing the full multimodal stack on the NPU.
  • More LLM models across the widest Qualcomm NPU catalog.
  • W4 quantization — roughly halving per-token bytes, projecting ~300–340 tok/s decode on this model class.
  • Power metering — tok/s-per-watt vs CPU baseline.

Summary

  • QHexRT is live — full-stack NPU inference for Qualcomm Hexagon
  • 100% on NPU — no Python, no CPU fallback during inference
  • First model: LFM 2.5 230M (runanywhere/lfm2_5_230m_HNPU)
  • 12,540 tok/s prefill · 36ms flat TTFT · identical greedy output vs fp32 oracle
  • All modalities — LLM live; VLM, STT, TTS, embeddings in development
  • Widest model catalog for Qualcomm NPUs, expanding continuously

Contact us for QHexRT deployment access. Download the first model from Hugging Face.


Benchmarked on Qualcomm SM8850 / Hexagon v81. Model: LiquidAI/LFM 2.5 230M. NPU: QHexRT W8 weight-only, QAIRT 2.47.0.260601. CPU comparison: llama.cpp b1-beac530, Q8_0 + Q4_K_M, 8 threads.

RunAnywhere

RunAnywhere Labs

We build the engines, SDKs, and agents that put inference where latency, cost, and privacy want it — on-prem, cloud, edge, or in between.

© 2026 RunAnywhere, Inc.