Android Use Agent: Voice-Controlled Phone Automation with On-Device AI

What if you could tell your phone "play Epstein Files on YouTube" and watch it do the rest?

We built an Android Use Agent—a native Android app that turns spoken commands into autonomous phone actions. Speak your goal, and an AI agent taps, types, and swipes through your device to get it done.

The twist: your voice never leaves your phone. Speech recognition runs entirely on-device using the RunAnywhere SDK. Only the action planning hits the cloud.

What It Does

Tap the microphone. Say what you want. The agent handles the rest:

"Open YouTube and search for lo-fi music" — Opens the app, types the query, hits search
"Set a timer for 2 minutes" — Navigates to Clock, enters the digits, starts the timer
"Open WhatsApp and message Mom" — Launches the app, finds the contact, opens the chat

No scripting. No Tasker workflows. No accessibility macros to configure. Just natural language in, autonomous actions out.

How It Works

The agent uses a hybrid architecture—local speech recognition paired with cloud-based reasoning:

text

1You: [Tap mic] "Search for cheap flights to Tokyo on Google"
2
31. On-Device STT (Whisper Tiny)
4   → Transcribes your voice locally: "Search for cheap flights to Tokyo on Google"
5
62. GPT-4o Planner (Cloud)
7   → Creates a step-by-step plan:
8      1. Open Chrome
9      2. Tap the search bar
10      3. Type "cheap flights to Tokyo"
11      4. Press Enter
12
133. GPT-4o Navigator (Cloud)
14   → Reads the screen elements at each step
15   → Decides the next action: tap, type, swipe, enter, done
16
174. Accessibility Service (On-Device)
18   → Executes each action on the actual UI
19   → Reports results back to the navigator
20
21Result: Google search results for "cheap flights to Tokyo" on screen

Three layers work together:

RunAnywhere SDK + Whisper Tiny (On-Device) — Captures audio at 16kHz, transcribes speech to text entirely on the phone. The model is 75MB, downloads once, runs offline after that.
GPT-4o (Cloud) — Receives the transcribed goal plus a snapshot of on-screen UI elements. Generates a plan, then produces one action at a time (tap element 5, type "hello", swipe up, press enter) with reasoning for each decision.
Android Accessibility Service (On-Device) — The muscle. Dispatches taps at coordinates, inputs text into fields, presses Enter via IME actions, performs swipe gestures—all by interacting with the real Android UI tree.

The agent also has a tool calling system. When GPT-4o needs factual information (current time, a calculation, device info), it calls a built-in tool instead of wasting steps navigating the UI to find the answer.

Why Hybrid?

Fully Cloud-Based	Hybrid (Local STT + Cloud Agent)
Voice audio sent to servers	Voice stays on your phone
Requires constant internet	STT works offline, agent needs network
Higher latency for speech	Near-instant local transcription
Single point of failure	STT still works if cloud is slow
Pay for speech API + reasoning API	Pay only for reasoning API

The key insight: voice data is deeply personal. Every command you speak reveals intent, context, and habits. By keeping speech recognition on-device, the cloud only ever sees the transcribed text—never your voice.

Meanwhile, action planning genuinely benefits from cloud intelligence. GPT-4o can reason about complex multi-step UI navigation, recover from errors, and adapt to unexpected screen states in ways that small on-device models can't match yet.

The hybrid approach gives you the best of both: private input, powerful reasoning.

The Tech Stack

Component	Technology	Runs Where
Speech-to-Text	Whisper Tiny (ONNX)	On-device
STT Runtime	RunAnywhere SDK + Sherpa-ONNX	On-device
Action Planning	GPT-4o	Cloud
Action Execution	Android Accessibility Service	On-device
Tool Calling	Built-in tools (time, math)	On-device
UI Framework	Jetpack Compose	On-device
Fallback LLM	Qwen 2.5 1.5B (LlamaCPP)	On-device

The app also registers on-device LLMs (SmolLM2 360M, Qwen 2.5 1.5B, LFM 2.5 1.2B) as fallbacks. If GPT-4o is unavailable, the agent switches to local inference automatically—fully offline operation at the cost of some reasoning quality.

Try It

bash

1git clone -b gpt-client-integration https://github.com/anantteotia/runanywhere-rn-agent-demo.git
2cd runanywhere-rn-agent-demo

Add your OpenAI API key to local.properties:

GPT52_API_KEY=sk-your-key-here

Build and install:

bash

1./gradlew assembleDebug
2adb install app/build/outputs/apk/debug/app-debug.apk

Requirements: Android 8.0+ (API 26), arm64 device. After installing, enable the Accessibility Service in Settings and grant microphone permission on first use.

The Power of Accessibility Services

Here's something most Android developers overlook: the Accessibility Service API is one of the most powerful interfaces on the platform.

Originally designed to help users with disabilities interact with their phones, it provides complete read/write access to the UI tree of any app. An accessibility service can:

Read every text label, button, and interactive element on screen
Tap at specific coordinates via gesture dispatch
Input text into any editable field
Perform swipe gestures in any direction
Press system buttons (Back, Home, Recents)
Take screenshots programmatically

This is exactly what an AI agent needs. Instead of relying on screenshot analysis (slow, imprecise, computationally expensive), the accessibility service gives the agent a structured representation of the screen—element labels, positions, capabilities—that an LLM can reason about directly.

The screen parser captures up to 30 interactive elements per frame, each with its label, type, position, and capabilities (tappable, editable, checkable). This compact representation fits easily in a prompt, letting GPT-4o make precise decisions about which element to interact with and how.

Where We're Headed

This project is a working prototype, but it opens up several exciting directions:

Multi-Modal Understanding

Currently the agent reads the UI tree as text. Adding screenshot analysis would let it understand visual context—colors, layouts, images—that the accessibility tree doesn't capture. Hybrid text + vision could handle complex UIs that confuse text-only parsing.

Cross-App Workflows

Real tasks span multiple apps. "Book a restaurant from my email confirmation" means reading Gmail, opening Maps, checking reviews, making a reservation. Multi-app orchestration with persistent context across switches is the next frontier.

Fully On-Device Agent

As on-device models improve, the cloud dependency for action planning can shrink. A fine-tuned 3B model specifically trained on Android UI navigation could replace GPT-4o for common tasks—making the entire pipeline private and offline.

Learning From Failure

The agent currently has basic loop detection and recovery. But imagine an agent that learns from its mistakes—remembering that a particular app's search button is in an unusual place, or that a specific workflow requires an extra confirmation step.

Cross-Platform

The same RunAnywhere SDK runs on iOS, Flutter, React Native, and Kotlin Multiplatform. The agent architecture could extend to iOS with VoiceOver APIs, or to desktop with platform accessibility frameworks.

The Bigger Picture

We're entering an era where AI doesn't just answer questions—it takes action. Browser agents automate the web. Phone agents automate your device. The common thread: AI that operates where your data lives, not in a distant data center.

Android's accessibility services provide the perfect foundation. They're built into every device, work with every app, and expose the UI in a format that LLMs understand natively. Combined with on-device speech recognition, you get a private, natural interface to an autonomous agent.

The cloud still matters—GPT-4o's reasoning capabilities are genuinely hard to replicate locally today. But the trend is clear: more processing moves to the edge, more data stays private, and the line between "assistant" and "agent" keeps blurring.

We built this to explore what's possible when you combine on-device AI with Android's built-in automation capabilities. The answer: your phone can do a lot more than you're asking it to.

This is a working prototype demonstrating voice-controlled Android automation with on-device speech recognition and cloud-powered reasoning.

Check out the source code: github.com/anantteotia/runanywhere-rn-agent-demo