Android Use Agent: Voice-Controlled Phone Automation with On-Device AI
DEVELOPERSWhat if you could tell your phone "play Epstein Files on YouTube" and watch it do the rest?
We built an Android Use Agent—a native Android app that turns spoken commands into autonomous phone actions. Speak your goal, and an AI agent taps, types, and swipes through your device to get it done.
The twist: your voice never leaves your phone. Speech recognition runs entirely on-device using the RunAnywhere SDK. Only the action planning hits the cloud.
What It Does
Tap the microphone. Say what you want. The agent handles the rest:
- "Open YouTube and search for lo-fi music" — Opens the app, types the query, hits search
- "Set a timer for 2 minutes" — Navigates to Clock, enters the digits, starts the timer
- "Open WhatsApp and message Mom" — Launches the app, finds the contact, opens the chat
No scripting. No Tasker workflows. No accessibility macros to configure. Just natural language in, autonomous actions out.
How It Works
The agent uses a hybrid architecture—local speech recognition paired with cloud-based reasoning:
1You: [Tap mic] "Search for cheap flights to Tokyo on Google"231. On-Device STT (Whisper Tiny)4 → Transcribes your voice locally: "Search for cheap flights to Tokyo on Google"562. GPT-4o Planner (Cloud)7 → Creates a step-by-step plan:8 1. Open Chrome9 2. Tap the search bar10 3. Type "cheap flights to Tokyo"11 4. Press Enter12133. GPT-4o Navigator (Cloud)14 → Reads the screen elements at each step15 → Decides the next action: tap, type, swipe, enter, done16174. Accessibility Service (On-Device)18 → Executes each action on the actual UI19 → Reports results back to the navigator2021Result: Google search results for "cheap flights to Tokyo" on screen
Three layers work together:
-
RunAnywhere SDK + Whisper Tiny (On-Device) — Captures audio at 16kHz, transcribes speech to text entirely on the phone. The model is 75MB, downloads once, runs offline after that.
-
GPT-4o (Cloud) — Receives the transcribed goal plus a snapshot of on-screen UI elements. Generates a plan, then produces one action at a time (tap element 5, type "hello", swipe up, press enter) with reasoning for each decision.
-
Android Accessibility Service (On-Device) — The muscle. Dispatches taps at coordinates, inputs text into fields, presses Enter via IME actions, performs swipe gestures—all by interacting with the real Android UI tree.
The agent also has a tool calling system. When GPT-4o needs factual information (current time, a calculation, device info), it calls a built-in tool instead of wasting steps navigating the UI to find the answer.
Why Hybrid?
| Fully Cloud-Based | Hybrid (Local STT + Cloud Agent) |
|---|---|
| Voice audio sent to servers | Voice stays on your phone |
| Requires constant internet | STT works offline, agent needs network |
| Higher latency for speech | Near-instant local transcription |
| Single point of failure | STT still works if cloud is slow |
| Pay for speech API + reasoning API | Pay only for reasoning API |
The key insight: voice data is deeply personal. Every command you speak reveals intent, context, and habits. By keeping speech recognition on-device, the cloud only ever sees the transcribed text—never your voice.
Meanwhile, action planning genuinely benefits from cloud intelligence. GPT-4o can reason about complex multi-step UI navigation, recover from errors, and adapt to unexpected screen states in ways that small on-device models can't match yet.
The hybrid approach gives you the best of both: private input, powerful reasoning.
The Tech Stack
| Component | Technology | Runs Where |
|---|---|---|
| Speech-to-Text | Whisper Tiny (ONNX) | On-device |
| STT Runtime | RunAnywhere SDK + Sherpa-ONNX | On-device |
| Action Planning | GPT-4o | Cloud |
| Action Execution | Android Accessibility Service | On-device |
| Tool Calling | Built-in tools (time, math) | On-device |
| UI Framework | Jetpack Compose | On-device |
| Fallback LLM | Qwen 2.5 1.5B (LlamaCPP) | On-device |
The app also registers on-device LLMs (SmolLM2 360M, Qwen 2.5 1.5B, LFM 2.5 1.2B) as fallbacks. If GPT-4o is unavailable, the agent switches to local inference automatically—fully offline operation at the cost of some reasoning quality.
Try It
1git clone -b gpt-client-integration https://github.com/anantteotia/runanywhere-rn-agent-demo.git2cd runanywhere-rn-agent-demo
Add your OpenAI API key to local.properties:
GPT52_API_KEY=sk-your-key-here
Build and install:
1./gradlew assembleDebug2adb install app/build/outputs/apk/debug/app-debug.apk
Requirements: Android 8.0+ (API 26), arm64 device. After installing, enable the Accessibility Service in Settings and grant microphone permission on first use.
The Power of Accessibility Services
Here's something most Android developers overlook: the Accessibility Service API is one of the most powerful interfaces on the platform.
Originally designed to help users with disabilities interact with their phones, it provides complete read/write access to the UI tree of any app. An accessibility service can:
- Read every text label, button, and interactive element on screen
- Tap at specific coordinates via gesture dispatch
- Input text into any editable field
- Perform swipe gestures in any direction
- Press system buttons (Back, Home, Recents)
- Take screenshots programmatically
This is exactly what an AI agent needs. Instead of relying on screenshot analysis (slow, imprecise, computationally expensive), the accessibility service gives the agent a structured representation of the screen—element labels, positions, capabilities—that an LLM can reason about directly.
The screen parser captures up to 30 interactive elements per frame, each with its label, type, position, and capabilities (tappable, editable, checkable). This compact representation fits easily in a prompt, letting GPT-4o make precise decisions about which element to interact with and how.
Where We're Headed
This project is a working prototype, but it opens up several exciting directions:
Multi-Modal Understanding
Currently the agent reads the UI tree as text. Adding screenshot analysis would let it understand visual context—colors, layouts, images—that the accessibility tree doesn't capture. Hybrid text + vision could handle complex UIs that confuse text-only parsing.
Cross-App Workflows
Real tasks span multiple apps. "Book a restaurant from my email confirmation" means reading Gmail, opening Maps, checking reviews, making a reservation. Multi-app orchestration with persistent context across switches is the next frontier.
Fully On-Device Agent
As on-device models improve, the cloud dependency for action planning can shrink. A fine-tuned 3B model specifically trained on Android UI navigation could replace GPT-4o for common tasks—making the entire pipeline private and offline.
Learning From Failure
The agent currently has basic loop detection and recovery. But imagine an agent that learns from its mistakes—remembering that a particular app's search button is in an unusual place, or that a specific workflow requires an extra confirmation step.
Cross-Platform
The same RunAnywhere SDK runs on iOS, Flutter, React Native, and Kotlin Multiplatform. The agent architecture could extend to iOS with VoiceOver APIs, or to desktop with platform accessibility frameworks.
The Bigger Picture
We're entering an era where AI doesn't just answer questions—it takes action. Browser agents automate the web. Phone agents automate your device. The common thread: AI that operates where your data lives, not in a distant data center.
Android's accessibility services provide the perfect foundation. They're built into every device, work with every app, and expose the UI in a format that LLMs understand natively. Combined with on-device speech recognition, you get a private, natural interface to an autonomous agent.
The cloud still matters—GPT-4o's reasoning capabilities are genuinely hard to replicate locally today. But the trend is clear: more processing moves to the edge, more data stays private, and the line between "assistant" and "agent" keeps blurring.
We built this to explore what's possible when you combine on-device AI with Android's built-in automation capabilities. The answer: your phone can do a lot more than you're asking it to.
This is a working prototype demonstrating voice-controlled Android automation with on-device speech recognition and cloud-powered reasoning.
Check out the source code: github.com/anantteotia/runanywhere-rn-agent-demo