Y Combinator

Backed by Y Combinator

Making every device an AI-native device.

We research and build inference engines from the metal up - custom kernels, operator fusion, unified memory optimization. For the hardware you already own.

metalrt benchmark
ready
0tok/sLLM decode · Qwen3-0.6B · M4 Max
Ollama · 85Apple MLX · 220llama.cpp · 290MetalRT · 668

token output stream

0

tok/s peak decode

0x

faster than Apple MLX

0+

GitHub stars

0

platforms shipped

The problem

Most AI runs in the cloud. That won't scale.

We study inference at the hardware level. Here's what we've found.

Cost

Cloud inference costs $0.08–0.35 per minute for voice alone. Serving AI to 8 billion people through centralized GPU clusters is economically impossible. The compute has to move to the edge.

$0

marginal inference cost on-device

Latency

A round-trip to the cloud takes 300-400ms minimum. For real-time voice, vision, and autonomous systems, that’s too slow. Physics sets the floor - on-device removes it.

<7ms

time-to-first-token (Qwen3-0.6B, M4 Max)

The models are ready

Small models now match the quality of models 250x their size. The bottleneck isn’t the model - it’s the runtime. That’s what we build.

668

tok/s on a single MacBook

What We Build

Engines. SDKs. Observability.

Three layers that take on-device AI from research to production.

Inference Engines

01

MetalRT

Custom kernel runtime for the hardware you already own. 658 tok/s LLM decode, 101ms speech-to-text, 287 tok/s vision. Every kernel hand-written from scratch.

Developer SDKs

02

Cross-Platform

Swift, Kotlin, React Native, Flutter. One API across iOS, Android, and edge. Ship on-device AI with a few lines of code - LLM, STT, TTS, vision, voice agents.

Observability

03

Control Plane

Fleet dashboard, OTA model updates, policy-based routing, inference analytics. Manage thousands of devices without app store releases.

Our approach

We build from the metal up.

Custom GPU kernels, operator fusion, unified memory optimization. Our benchmarks speak for themselves: 668 tok/s LLM decode, 287 tok/s vision inference on a single MacBook.

We write GPU kernels from scratch — hand-designed memory layouts, fused operators, and custom Metal shaders that bypass every generic abstraction layer. MetalRT achieves 668 tok/s LLM decode on Apple Silicon. Every kernel targets the specific hardware it runs on.

The shift from cloud to edge will be defined by whoever builds the best runtime. We publish our research openly, ship production SDKs across Swift, Kotlin, React Native, and Flutter, and make the engine available on GitHub.

Backed by Y Combinator, we are building the infrastructure layer for on-device AI at scale — starting with Apple Silicon, then Qualcomm, then Intel. We utilize the hardware people already own.

Inference Stack

Your App

iOS · macOS · Android

RunAnywhere.load("llama-3.2-1b")

SDK Layer

Swift · Kotlin · React Native · Flutter

Cross-platform bindings → C++ core

MetalRT Runtime

C++ Inference Engine · Quantized Weights · KV Cache

Orchestrates graph execution on unified memory

Custom .metal Kernels

Hand-written Metal Shading Language

We write every GPU kernel from scratch

qmv.metalattention_decode.metalrms_norm.metalrope.metalswiglu.metalkv_cache.metal
Apple Silicon GPU

M1 · M2 · M3 · M4 · Unified Memory · 800 GB/s

simd_sum · threadgroup_barrier · [[buffer(0)]]

Output:668 tok/s decode|101ms STT

Team

“We left AWS and Intuit to write custom kernels by hand. Because the future of AI isn't in the cloud - it's on every device you already own.”

Founders

Sanchit Monga

Sanchit Monga

Co-Founder & CEO

Built SDKs used by 50M+ users at Intuit. Leads product, go-to-market, and the vision for making every device AI-native.

Ex-IntuitYC W2650M+ SDK Users
Shubham Malhotra

Shubham Malhotra

Co-Founder & CTO

Former AWS EC2 Spot and Microsoft Azure Arc. Published ML researcher. Writes the custom kernels that power MetalRT.

Ex-AWS EC2 SpotEx-Microsoft AzurePublished ML Researcher

Read the research.

Try the engine.

RunAnywhere Logo

RunAnywhere

On-device AI inference research and infrastructure. Building the fastest engines for the hardware you already own.

© 2026 RunAnywhere, Inc.

Playground