RunAnywhere Kotlin SDK Part 2: Speech-to-Text with Whisper
DEVELOPERSReal-Time Transcription with On-Device Whisper
This is Part 2 of our RunAnywhere Kotlin SDK tutorial series:
- Chat with LLMs — Project setup and streaming text generation
- Speech-to-Text (this post) — Real-time transcription with Whisper
- Text-to-Speech — Natural voice synthesis with Piper
- Voice Pipeline — Full voice assistant with VAD
Speech recognition unlocks natural interaction with your app. With RunAnywhere, you can run Whisper entirely on-device—no network requests, no privacy concerns, no API costs.
The key challenge on Android is configuring the AudioRecord API to output audio in the format Whisper expects.
Prerequisites
- Complete Part 1 first to set up your project with the RunAnywhere SDK
- Physical device required — emulator microphone support is limited
- ~75MB additional storage for the Whisper model
Register the STT Model
Add Whisper to your model registration in RunAnywhereApp.kt:
1import com.runanywhere.sdk.core.types.InferenceFramework2import com.runanywhere.sdk.public.extensions.Models.ModelCategory3import com.runanywhere.sdk.public.extensions.registerModel45// Register STT model (Whisper)6RunAnywhere.registerModel(7 id = "sherpa-onnx-whisper-tiny.en",8 name = "Whisper Tiny English",9 url = "https://github.com/RunanywhereAI/sherpa-onnx/releases/download/runanywhere-models-v1/sherpa-onnx-whisper-tiny.en.tar.gz",10 framework = InferenceFramework.ONNX,11 modality = ModelCategory.SPEECH_RECOGNITION,12 memoryRequirement = 75_000_00013)
Critical: Audio Format Requirements
Whisper requires a very specific audio format:
| Parameter | Required Value |
|---|---|
| Sample Rate | 16,000 Hz |
| Channels | 1 (mono) |
| Format | 16-bit signed integer (Int16) PCM |
Android's AudioRecord defaults to different settings. You MUST configure it correctly.
Audio Recording Service
Create AudioCaptureService.kt:
1package com.example.localaiplayground.domain.services23import android.Manifest4import android.content.Context5import android.content.pm.PackageManager6import android.media.AudioFormat7import android.media.AudioRecord8import android.media.MediaRecorder9import android.util.Log10import androidx.core.app.ActivityCompat11import kotlinx.coroutines.Dispatchers12import kotlinx.coroutines.flow.MutableStateFlow13import kotlinx.coroutines.flow.StateFlow14import kotlinx.coroutines.withContext15import java.io.ByteArrayOutputStream1617class AudioCaptureService(private val context: Context) {18 companion object {19 private const val TAG = "AudioCaptureService"20 private const val SAMPLE_RATE = 16000 // Whisper requires 16kHz21 private const val CHANNEL_CONFIG = AudioFormat.CHANNEL_IN_MONO22 private const val AUDIO_FORMAT = AudioFormat.ENCODING_PCM_16BIT23 }2425 private var audioRecord: AudioRecord? = null26 private var isRecording = false27 private val audioBuffer = ByteArrayOutputStream()2829 private val _audioLevel = MutableStateFlow(0f)30 val audioLevel: StateFlow<Float> = _audioLevel3132 fun hasPermission(): Boolean {33 return ActivityCompat.checkSelfPermission(34 context,35 Manifest.permission.RECORD_AUDIO36 ) == PackageManager.PERMISSION_GRANTED37 }3839 suspend fun startRecording(): Result<Unit> = withContext(Dispatchers.IO) {40 if (!hasPermission()) {41 return@withContext Result.failure(SecurityException("Microphone permission not granted"))42 }4344 try {45 val bufferSize = AudioRecord.getMinBufferSize(46 SAMPLE_RATE,47 CHANNEL_CONFIG,48 AUDIO_FORMAT49 )5051 audioRecord = AudioRecord(52 MediaRecorder.AudioSource.MIC,53 SAMPLE_RATE,54 CHANNEL_CONFIG,55 AUDIO_FORMAT,56 bufferSize * 2 // Double buffer for safety57 )5859 audioBuffer.reset()60 isRecording = true6162 audioRecord?.startRecording()63 Log.d(TAG, "Recording started at $SAMPLE_RATE Hz")6465 // Read audio data in a loop66 val buffer = ShortArray(bufferSize)67 while (isRecording) {68 val read = audioRecord?.read(buffer, 0, buffer.size) ?: 069 if (read > 0) {70 // Convert shorts to bytes71 val byteBuffer = ByteArray(read * 2)72 for (i in 0 until read) {73 byteBuffer[i * 2] = (buffer[i].toInt() and 0xFF).toByte()74 byteBuffer[i * 2 + 1] = (buffer[i].toInt() shr 8).toByte()75 }76 audioBuffer.write(byteBuffer)7778 // Calculate audio level for visualization79 val rms = calculateRMS(buffer, read)80 _audioLevel.value = rms81 }82 }8384 Result.success(Unit)85 } catch (e: Exception) {86 Log.e(TAG, "Recording error", e)87 Result.failure(e)88 }89 }9091 fun stopRecording(): ByteArray {92 isRecording = false9394 audioRecord?.apply {95 stop()96 release()97 }98 audioRecord = null99100 _audioLevel.value = 0f101102 val audioData = audioBuffer.toByteArray()103 Log.d(TAG, "Recording stopped: ${audioData.size} bytes")104 return audioData105 }106107 private fun calculateRMS(buffer: ShortArray, length: Int): Float {108 var sum = 0.0109 for (i in 0 until length) {110 sum += buffer[i] * buffer[i]111 }112 val rms = kotlin.math.sqrt(sum / length)113 // Normalize to 0-1 range (max short value is 32767)114 return (rms / 32767f).toFloat().coerceIn(0f, 1f)115 }116}
Important: The 16kHz sample rate and mono configuration are non-negotiable. Sending audio in a different format will produce garbage output.
STT ViewModel
Create SpeechToTextViewModel.kt:
1package com.example.localaiplayground.presentation.stt23import android.app.Application4import androidx.lifecycle.AndroidViewModel5import androidx.lifecycle.viewModelScope6import com.example.localaiplayground.domain.services.AudioCaptureService7import com.runanywhere.sdk.public.RunAnywhere8import com.runanywhere.sdk.public.extensions.availableModels9import com.runanywhere.sdk.public.extensions.downloadModel10import com.runanywhere.sdk.public.extensions.loadSTTModel11import com.runanywhere.sdk.public.extensions.transcribe12import kotlinx.coroutines.flow.*13import kotlinx.coroutines.launch1415data class STTUiState(16 val isLoading: Boolean = true,17 val isModelLoaded: Boolean = false,18 val downloadProgress: Float = 0f,19 val isRecording: Boolean = false,20 val isTranscribing: Boolean = false,21 val transcription: String = "",22 val audioLevel: Float = 0f,23 val error: String? = null24)2526class SpeechToTextViewModel(application: Application) : AndroidViewModel(application) {27 private val _uiState = MutableStateFlow(STTUiState())28 val uiState: StateFlow<STTUiState> = _uiState.asStateFlow()2930 private val audioService = AudioCaptureService(application)31 private val modelId = "sherpa-onnx-whisper-tiny.en"3233 init {34 loadModel()35 observeAudioLevel()36 }3738 private fun loadModel() {39 viewModelScope.launch {40 try {41 val models = RunAnywhere.availableModels()42 val isDownloaded = models.any { it.id == modelId && it.localPath != null }4344 if (!isDownloaded) {45 RunAnywhere.downloadModel(modelId).collect { progress ->46 _uiState.update {47 it.copy(downloadProgress = progress.progress)48 }49 }50 }5152 // Load STT model53 RunAnywhere.loadSTTModel(modelId)5455 _uiState.update {56 it.copy(isLoading = false, isModelLoaded = true)57 }5859 } catch (e: Exception) {60 _uiState.update {61 it.copy(isLoading = false, error = e.message)62 }63 }64 }65 }6667 private fun observeAudioLevel() {68 viewModelScope.launch {69 audioService.audioLevel.collect { level ->70 _uiState.update { it.copy(audioLevel = level) }71 }72 }73 }7475 fun toggleRecording() {76 if (_uiState.value.isRecording) {77 stopAndTranscribe()78 } else {79 startRecording()80 }81 }8283 private fun startRecording() {84 if (!audioService.hasPermission()) {85 _uiState.update { it.copy(error = "Microphone permission required") }86 return87 }8889 viewModelScope.launch {90 _uiState.update {91 it.copy(isRecording = true, transcription = "", error = null)92 }9394 audioService.startRecording()95 }96 }9798 private fun stopAndTranscribe() {99 viewModelScope.launch {100 _uiState.update {101 it.copy(isRecording = false, isTranscribing = true)102 }103104 try {105 val audioData = audioService.stopRecording()106107 if (audioData.isNotEmpty()) {108 val text = RunAnywhere.transcribe(audioData)109 _uiState.update {110 it.copy(transcription = text, isTranscribing = false)111 }112 } else {113 _uiState.update {114 it.copy(transcription = "No audio recorded", isTranscribing = false)115 }116 }117118 } catch (e: Exception) {119 _uiState.update {120 it.copy(121 transcription = "Error: ${e.message}",122 isTranscribing = false123 )124 }125 }126 }127 }128}
STT Screen
Create SpeechToTextScreen.kt:
1package com.example.localaiplayground.presentation.stt23import androidx.compose.foundation.background4import androidx.compose.foundation.clickable5import androidx.compose.foundation.layout.*6import androidx.compose.foundation.shape.CircleShape7import androidx.compose.foundation.shape.RoundedCornerShape8import androidx.compose.material3.*9import androidx.compose.runtime.*10import androidx.compose.ui.Alignment11import androidx.compose.ui.Modifier12import androidx.compose.ui.draw.clip13import androidx.compose.ui.graphics.Color14import androidx.compose.ui.unit.dp15import androidx.lifecycle.viewmodel.compose.viewModel1617@Composable18fun SpeechToTextScreen(19 viewModel: SpeechToTextViewModel = viewModel()20) {21 val uiState by viewModel.uiState.collectAsState()2223 Column(24 modifier = Modifier25 .fillMaxSize()26 .background(Color.Black)27 .padding(24.dp),28 horizontalAlignment = Alignment.CenterHorizontally,29 verticalArrangement = Arrangement.Center30 ) {31 // Loading state32 if (uiState.isLoading) {33 CircularProgressIndicator()34 Spacer(modifier = Modifier.height(16.dp))35 Text(36 "Downloading model... ${(uiState.downloadProgress * 100).toInt()}%",37 color = Color.White38 )39 LinearProgressIndicator(40 progress = { uiState.downloadProgress },41 modifier = Modifier42 .fillMaxWidth()43 .padding(top = 8.dp)44 )45 return46 }4748 // Transcription display49 Surface(50 modifier = Modifier51 .fillMaxWidth()52 .heightIn(min = 100.dp),53 shape = RoundedCornerShape(12.dp),54 color = Color(0xFF111111)55 ) {56 Text(57 text = uiState.transcription.ifEmpty { "Tap the microphone to record..." },58 color = Color.White,59 modifier = Modifier.padding(16.dp)60 )61 }6263 Spacer(modifier = Modifier.height(48.dp))6465 // Audio level indicator66 if (uiState.isRecording) {67 LinearProgressIndicator(68 progress = { uiState.audioLevel },69 modifier = Modifier70 .fillMaxWidth()71 .height(4.dp),72 color = Color.Red,73 trackColor = Color.DarkGray74 )75 Spacer(modifier = Modifier.height(16.dp))76 }7778 // Record button79 Box(80 modifier = Modifier81 .size(100.dp)82 .clip(CircleShape)83 .background(84 if (uiState.isRecording) Color.Red else Color(0xFF007AFF)85 )86 .clickable(87 enabled = uiState.isModelLoaded && !uiState.isTranscribing,88 onClick = { viewModel.toggleRecording() }89 ),90 contentAlignment = Alignment.Center91 ) {92 Text(93 text = if (uiState.isRecording) "⬛" else "🎤",94 style = MaterialTheme.typography.headlineLarge95 )96 }9798 Spacer(modifier = Modifier.height(16.dp))99100 if (uiState.isTranscribing) {101 Row(verticalAlignment = Alignment.CenterVertically) {102 CircularProgressIndicator(103 modifier = Modifier.size(16.dp),104 strokeWidth = 2.dp105 )106 Spacer(modifier = Modifier.width(8.dp))107 Text("Transcribing...", color = Color.White)108 }109 }110 }111}

Requesting Permissions
In your Activity, request the microphone permission:
1import android.Manifest2import androidx.activity.result.contract.ActivityResultContracts34class MainActivity : ComponentActivity() {5 private val requestPermissionLauncher = registerForActivityResult(6 ActivityResultContracts.RequestPermission()7 ) { isGranted ->8 if (isGranted) {9 // Permission granted, can start recording10 }11 }1213 override fun onCreate(savedInstanceState: Bundle?) {14 super.onCreate(savedInstanceState)1516 // Request microphone permission17 requestPermissionLauncher.launch(Manifest.permission.RECORD_AUDIO)1819 setContent {20 // Your app content21 }22 }23}
Memory Management
When you're done with STT, unload the model:
1import com.runanywhere.sdk.public.extensions.unloadSTTModel23// Unload STT model4RunAnywhere.unloadSTTModel()
STT models can be loaded independently alongside the LLM—they don't conflict.
Models Reference
| Model ID | Size | Notes |
|---|---|---|
| sherpa-onnx-whisper-tiny.en | ~75MB | English, real-time capable |
What's Next
In Part 3, we'll add text-to-speech with Piper using Android's AudioTrack for playback.
Resources
Questions? Open an issue on GitHub or reach out on Twitter/X.