RunAnywhere Kotlin SDK Part 4: Building a Voice Assistant with VAD

A Complete Voice Assistant Running Entirely On-Device

This is Part 4 of our RunAnywhere Kotlin SDK tutorial series:

Chat with LLMs — Project setup and streaming text generation
Speech-to-Text — Real-time transcription with Whisper
Text-to-Speech — Natural voice synthesis with Piper
Voice Pipeline (this post) — Full voice assistant with VAD

This is the culmination of the series: a voice assistant that automatically detects when you stop speaking, processes your request with an LLM, and responds with synthesized speech—all running on-device on Android.

Prerequisites

Complete Parts 1-3 to have all three model types (LLM, STT, TTS) working in your project
Physical device required — the pipeline uses microphone input
All three models downloaded (~495MB total: 400 + 75 + 20)

The Voice Pipeline Flow

text

1┌─────────────────────────────────────────────────────────────────┐
2│                     Voice Assistant Pipeline                      │
3├─────────────────────────────────────────────────────────────────┤
4│                                                                   │
5│   ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐      │
6│   │  Record │ -> │   STT   │ -> │   LLM   │ -> │   TTS   │      │
7│   │  + VAD  │    │ Whisper │    │ SmolLM2 │    │  Piper  │      │
8│   └─────────┘    └─────────┘    └─────────┘    └─────────┘      │
9│       │                                              │           │
10│       │          Auto-stop when                      │           │
11│       └──────────  silence detected  ────────────────┘           │
12│                                                                   │
13└─────────────────────────────────────────────────────────────────┘

The streamVoiceSession API

Instead of manually wiring up VAD, STT, LLM, and TTS, the RunAnywhere SDK provides streamVoiceSession—a single API that handles the entire voice pipeline. You provide a continuous audio stream, and the SDK emits events as it listens, transcribes, generates responses, and synthesizes speech.

Audio Capture as a Flow

First, create an audio capture utility that provides a Flow<ByteArray> for streamVoiceSession. Create AudioCapture.kt:

kotlin

1package com.example.localaiplayground.domain.services
2
3import android.Manifest
4import android.media.AudioFormat
5import android.media.AudioRecord
6import android.media.MediaRecorder
7import androidx.annotation.RequiresPermission
8import kotlinx.coroutines.channels.awaitClose
9import kotlinx.coroutines.flow.Flow
10import kotlinx.coroutines.flow.callbackFlow
11
12object AudioCapture {
13    private const val SAMPLE_RATE = 16000
14    private const val CHUNK_SIZE_MS = 100
15
16    @RequiresPermission(Manifest.permission.RECORD_AUDIO)
17    fun startCapture(): Flow<ByteArray> = callbackFlow {
18        val bufferSize = AudioRecord.getMinBufferSize(
19            SAMPLE_RATE,
20            AudioFormat.CHANNEL_IN_MONO,
21            AudioFormat.ENCODING_PCM_16BIT
22        )
23
24        val recorder = AudioRecord(
25            MediaRecorder.AudioSource.MIC,
26            SAMPLE_RATE,
27            AudioFormat.CHANNEL_IN_MONO,
28            AudioFormat.ENCODING_PCM_16BIT,
29            bufferSize * 2
30        )
31
32        recorder.startRecording()
33
34        val chunkSize = SAMPLE_RATE * 2 * CHUNK_SIZE_MS / 1000  // bytes per chunk
35        val buffer = ByteArray(chunkSize)
36
37        try {
38            while (!isClosedForSend) {
39                val bytesRead = recorder.read(buffer, 0, chunkSize)
40                if (bytesRead > 0) {
41                    trySend(buffer.copyOf(bytesRead))
42                }
43            }
44        } finally {
45            recorder.stop()
46            recorder.release()
47        }
48
49        awaitClose { }
50    }
51}

This produces a continuous stream of PCM audio chunks at 16kHz mono—exactly what streamVoiceSession expects.

Pipeline State

Create VoiceAssistantViewModel.kt:

kotlin

1package com.example.localaiplayground.presentation.voice
2
3import android.app.Application
4import androidx.lifecycle.AndroidViewModel
5import androidx.lifecycle.viewModelScope
6import com.example.localaiplayground.domain.services.AudioCapture
7import com.runanywhere.sdk.public.RunAnywhere
8import com.runanywhere.sdk.public.extensions.VoiceAgent.VoiceSessionConfig
9import com.runanywhere.sdk.public.extensions.VoiceAgent.VoiceSessionEvent
10import com.runanywhere.sdk.public.extensions.isVoiceAgentReady
11import com.runanywhere.sdk.public.extensions.streamVoiceSession
12import kotlinx.coroutines.*
13import kotlinx.coroutines.flow.*
14
15enum class PipelineState {
16    IDLE,
17    LISTENING,
18    PROCESSING,
19    SPEAKING
20}
21
22data class VoiceMessage(val text: String, val role: String) // "user" or "ai"
23
24data class VoiceAssistantUiState(
25    val pipelineState: PipelineState = PipelineState.IDLE,
26    val messages: List<VoiceMessage> = emptyList(),
27    val audioLevel: Float = 0f,
28    val error: String? = null,
29    val isReady: Boolean = false
30)
31
32class VoiceAssistantViewModel(application: Application) : AndroidViewModel(application) {
33    private val _uiState = MutableStateFlow(VoiceAssistantUiState())
34    val uiState: StateFlow<VoiceAssistantUiState> = _uiState.asStateFlow()
35
36    private var sessionJob: Job? = null
37
38    init {
39        checkReadiness()
40    }
41
42    private fun checkReadiness() {
43        viewModelScope.launch {
44            val isReady = RunAnywhere.isVoiceAgentReady()
45            _uiState.update { it.copy(isReady = isReady) }
46        }
47    }
48
49    fun start() {
50        if (_uiState.value.pipelineState != PipelineState.IDLE) return
51        if (!_uiState.value.isReady) {
52            _uiState.update { it.copy(error = "Models not loaded") }
53            return
54        }
55
56        sessionJob = viewModelScope.launch {
57            _uiState.update {
58                it.copy(
59                    pipelineState = PipelineState.LISTENING,
60                    error = null
61                )
62            }
63
64            try {
65                val audioFlow = AudioCapture.startCapture()
66
67                val config = VoiceSessionConfig(
68                    silenceDuration = 1.5,       // seconds of silence to trigger processing
69                    speechThreshold = 0.1f,      // audio level threshold (0.0-1.0)
70                    autoPlayTTS = false,          // we'll handle playback manually
71                    continuousMode = true         // auto-resume after each turn
72                )
73
74                RunAnywhere.streamVoiceSession(audioFlow, config).collect { event ->
75                    when (event) {
76                        is VoiceSessionEvent.Listening -> {
77                            _uiState.update {
78                                it.copy(
79                                    pipelineState = PipelineState.LISTENING,
80                                    audioLevel = event.audioLevel
81                                )
82                            }
83                        }
84                        is VoiceSessionEvent.SpeechStarted -> {
85                            // User started talking
86                        }
87                        is VoiceSessionEvent.Processing -> {
88                            _uiState.update {
89                                it.copy(pipelineState = PipelineState.PROCESSING)
90                            }
91                        }
92                        is VoiceSessionEvent.Transcribed -> {
93                            _uiState.update {
94                                it.copy(
95                                    messages = it.messages + VoiceMessage(event.text, "user")
96                                )
97                            }
98                        }
99                        is VoiceSessionEvent.Responded -> {
100                            _uiState.update {
101                                it.copy(
102                                    messages = it.messages + VoiceMessage(event.text, "ai")
103                                )
104                            }
105                        }
106                        is VoiceSessionEvent.TurnCompleted -> {
107                            _uiState.update {
108                                it.copy(pipelineState = PipelineState.SPEAKING)
109                            }
110                            // Play the synthesized audio
111                            event.audio?.let { audio ->
112                                playWavAudio(audio)
113                            }
114                            // continuousMode resumes listening automatically
115                        }
116                        is VoiceSessionEvent.Error -> {
117                            _uiState.update {
118                                it.copy(error = event.message)
119                            }
120                        }
121                        is VoiceSessionEvent.Stopped -> {
122                            _uiState.update {
123                                it.copy(pipelineState = PipelineState.IDLE)
124                            }
125                        }
126                        else -> { /* Handle other events as needed */ }
127                    }
128                }
129
130            } catch (e: Exception) {
131                _uiState.update {
132                    it.copy(
133                        pipelineState = PipelineState.IDLE,
134                        error = e.message
135                    )
136                }
137            }
138        }
139    }
140
141    fun stop() {
142        sessionJob?.cancel()
143        sessionJob = null
144        _uiState.update { it.copy(pipelineState = PipelineState.IDLE) }
145    }
146
147    private fun playWavAudio(audioData: ByteArray) {
148        // Parse WAV header and play via AudioTrack
149        // (See Part 3 for the WAV playback implementation)
150    }
151
152    override fun onCleared() {
153        super.onCleared()
154        sessionJob?.cancel()
155    }
156}

Key difference from Parts 1-3: Instead of manually calling transcribe(), chat(), and synthesize() in sequence, streamVoiceSession handles the entire pipeline internally. You just provide audio input and react to events.

Voice Assistant Screen

Create VoiceAssistantScreen.kt:

kotlin

1package com.example.localaiplayground.presentation.voice
2
3import androidx.compose.foundation.background
4import androidx.compose.foundation.clickable
5import androidx.compose.foundation.layout.*
6import androidx.compose.foundation.lazy.LazyColumn
7import androidx.compose.foundation.lazy.items
8import androidx.compose.foundation.shape.CircleShape
9import androidx.compose.foundation.shape.RoundedCornerShape
10import androidx.compose.material3.*
11import androidx.compose.runtime.*
12import androidx.compose.ui.Alignment
13import androidx.compose.ui.Modifier
14import androidx.compose.ui.draw.clip
15import androidx.compose.ui.graphics.Color
16import androidx.compose.ui.unit.dp
17import androidx.lifecycle.viewmodel.compose.viewModel
18
19@Composable
20fun VoiceAssistantScreen(
21    viewModel: VoiceAssistantViewModel = viewModel()
22) {
23    val uiState by viewModel.uiState.collectAsState()
24
25    Column(
26        modifier = Modifier
27            .fillMaxSize()
28            .background(Color.Black)
29            .padding(24.dp),
30        horizontalAlignment = Alignment.CenterHorizontally
31    ) {
32        // State indicator
33        StateIndicator(state = uiState.pipelineState)
34
35        Spacer(modifier = Modifier.height(24.dp))
36
37        // Error message
38        uiState.error?.let { error ->
39            Surface(
40                shape = RoundedCornerShape(8.dp),
41                color = Color.Red.copy(alpha = 0.1f),
42                modifier = Modifier.fillMaxWidth()
43            ) {
44                Text(
45                    text = error,
46                    color = Color.Red,
47                    modifier = Modifier.padding(12.dp)
48                )
49            }
50            Spacer(modifier = Modifier.height(16.dp))
51        }
52
53        // Conversation messages
54        LazyColumn(
55            modifier = Modifier.weight(1f),
56            verticalArrangement = Arrangement.spacedBy(8.dp)
57        ) {
58            items(uiState.messages) { message ->
59                ConversationBubble(
60                    label = if (message.role == "user") "You:" else "Assistant:",
61                    text = message.text,
62                    color = if (message.role == "user") Color(0xFF007AFF) else Color(0xFF44FF44)
63                )
64            }
65        }
66
67        // Audio level indicator
68        if (uiState.pipelineState == PipelineState.LISTENING) {
69            LinearProgressIndicator(
70                progress = { uiState.audioLevel },
71                modifier = Modifier
72                    .fillMaxWidth()
73                    .height(4.dp),
74                color = Color.Red,
75                trackColor = Color.DarkGray
76            )
77            Spacer(modifier = Modifier.height(16.dp))
78        }
79
80        // Main button
81        MainButton(
82            state = uiState.pipelineState,
83            isReady = uiState.isReady,
84            onClick = {
85                when (uiState.pipelineState) {
86                    PipelineState.IDLE -> viewModel.start()
87                    else -> viewModel.stop()
88                }
89            }
90        )
91
92        Spacer(modifier = Modifier.height(16.dp))
93
94        Text(
95            text = getStateHint(uiState.pipelineState),
96            color = Color.Gray,
97            style = MaterialTheme.typography.bodySmall
98        )
99
100        if (!uiState.isReady) {
101            Spacer(modifier = Modifier.height(8.dp))
102            Text(
103                text = "Please load LLM, STT, and TTS models first",
104                color = Color(0xFFFFAA00),
105                style = MaterialTheme.typography.bodySmall
106            )
107        }
108    }
109}
110
111@Composable
112private fun StateIndicator(state: PipelineState) {
113    Row(
114        verticalAlignment = Alignment.CenterVertically,
115        horizontalArrangement = Arrangement.Center
116    ) {
117        Box(
118            modifier = Modifier
119                .size(12.dp)
120                .clip(CircleShape)
121                .background(getStateColor(state))
122        )
123        Spacer(modifier = Modifier.width(8.dp))
124        Text(
125            text = getStateText(state),
126            color = Color.White,
127            style = MaterialTheme.typography.titleMedium
128        )
129    }
130}
131
132@Composable
133private fun ConversationBubble(label: String, text: String, color: Color) {
134    Surface(
135        modifier = Modifier.fillMaxWidth(),
136        shape = RoundedCornerShape(12.dp),
137        color = color.copy(alpha = 0.1f)
138    ) {
139        Column(modifier = Modifier.padding(16.dp)) {
140            Text(text = label, color = Color.Gray, style = MaterialTheme.typography.labelSmall)
141            Spacer(modifier = Modifier.height(4.dp))
142            Text(text = text, color = Color.White, style = MaterialTheme.typography.bodyLarge)
143        }
144    }
145}
146
147@Composable
148private fun MainButton(state: PipelineState, isReady: Boolean, onClick: () -> Unit) {
149    Box(
150        modifier = Modifier
151            .size(100.dp)
152            .clip(CircleShape)
153            .background(
154                when {
155                    !isReady -> Color.Gray
156                    state == PipelineState.IDLE -> Color(0xFF007AFF)
157                    else -> Color.Red
158                }
159            )
160            .clickable(enabled = isReady, onClick = onClick),
161        contentAlignment = Alignment.Center
162    ) {
163        Text(
164            text = if (state == PipelineState.IDLE) "🎤" else "⬛",
165            style = MaterialTheme.typography.headlineLarge
166        )
167    }
168}
169
170private fun getStateColor(state: PipelineState): Color = when (state) {
171    PipelineState.IDLE -> Color.Gray
172    PipelineState.LISTENING -> Color.Red
173    PipelineState.PROCESSING -> Color(0xFFFFAA00)
174    PipelineState.SPEAKING -> Color(0xFF44FF44)
175}
176
177private fun getStateText(state: PipelineState): String = when (state) {
178    PipelineState.IDLE -> "Ready"
179    PipelineState.LISTENING -> "Listening..."
180    PipelineState.PROCESSING -> "Processing..."
181    PipelineState.SPEAKING -> "Speaking..."
182}
183
184private fun getStateHint(state: PipelineState): String = when (state) {
185    PipelineState.IDLE -> "Tap to start"
186    PipelineState.LISTENING -> "Stops automatically when you pause"
187    PipelineState.PROCESSING -> "Transcribing and generating response..."
188    PipelineState.SPEAKING -> "Playing audio response..."
189}

Best Practices

1. Preload Models on App Start

kotlin

1// In RunAnywhereApp.kt or during onboarding
2suspend fun preloadAllModels() {
3    downloadAndLoadLLM("smollm2-360m-instruct-q8_0")
4    downloadAndLoadSTT("sherpa-onnx-whisper-tiny.en")
5    downloadAndLoadTTS("vits-piper-en_US-lessac-medium")
6}

2. Audio Format Summary

Component	Sample Rate	Format	Channels
AudioRecord	16,000 Hz	Int16	1
Whisper STT	16,000 Hz	Int16	1
Piper TTS Output	22,050 Hz	WAV/Int16	1
AudioTrack	Match output	Int16	1

3. Check Model State

kotlin

1// The SDK provides a single check for voice agent readiness
2val isReady = RunAnywhere.isVoiceAgentReady()  // checks LLM + STT + TTS

4. Session Configuration Tuning

kotlin

1// For noisy environments, adjust the session config
2val config = VoiceSessionConfig(
3    silenceDuration = 2.0,       // Longer pause tolerance (seconds)
4    speechThreshold = 0.2f,      // Higher for noisy environments
5    autoPlayTTS = false,
6    continuousMode = true
7)

5. Prevent Concurrent Operations

kotlin

1fun start() {
2    if (_uiState.value.pipelineState != PipelineState.IDLE) return  // Prevent double-starts
3    // ...
4}

Models Reference

Type	Model ID	Size	Notes
LLM	smollm2-360m-instruct-q8_0	~400MB	SmolLM2, recommended
STT	sherpa-onnx-whisper-tiny.en	~75MB	English
TTS	vits-piper-en_US-lessac-medium	~20MB	US English

Conclusion

You've built a complete voice assistant that:

Listens with automatic speech detection
Transcribes using on-device Whisper
Thinks with a local LLM
Responds with natural TTS

All processing happens on-device. No data ever leaves the phone. No API keys. No cloud costs. Pure native Android performance with Kotlin and Jetpack Compose.

This is the future of private, native Android AI applications.

Complete Source Code

The full source code is available on GitHub:

Kotlin Starter App

Includes:

Starter Kotlin app matching this tutorial series
MVVM architecture with ViewModel + StateFlow
Jetpack Compose UI
LLM chat, STT, TTS, and voice pipeline

Resources

Questions? Open an issue on GitHub or reach out on Twitter/X.