RunAnywhere Kotlin SDK Part 2: Speech-to-Text with Whisper

Real-Time Transcription with On-Device Whisper

This is Part 2 of our RunAnywhere Kotlin SDK tutorial series:

Chat with LLMs — Project setup and streaming text generation
Speech-to-Text (this post) — Real-time transcription with Whisper
Text-to-Speech — Natural voice synthesis with Piper
Voice Pipeline — Full voice assistant with VAD

Speech recognition unlocks natural interaction with your app. With RunAnywhere, you can run Whisper entirely on-device—no network requests, no privacy concerns, no API costs.

The key challenge on Android is configuring the AudioRecord API to output audio in the format Whisper expects.

Prerequisites

Complete Part 1 first to set up your project with the RunAnywhere SDK
Physical device required — emulator microphone support is limited
~75MB additional storage for the Whisper model

Register the STT Model

Add Whisper to your model registration in RunAnywhereApp.kt:

kotlin

1import com.runanywhere.sdk.core.types.InferenceFramework
2import com.runanywhere.sdk.public.extensions.Models.ModelCategory
3import com.runanywhere.sdk.public.extensions.registerModel
4
5// Register STT model (Whisper)
6RunAnywhere.registerModel(
7    id = "sherpa-onnx-whisper-tiny.en",
8    name = "Whisper Tiny English",
9    url = "https://github.com/RunanywhereAI/sherpa-onnx/releases/download/runanywhere-models-v1/sherpa-onnx-whisper-tiny.en.tar.gz",
10    framework = InferenceFramework.ONNX,
11    modality = ModelCategory.SPEECH_RECOGNITION,
12    memoryRequirement = 75_000_000
13)

Critical: Audio Format Requirements

Whisper requires a very specific audio format:

Parameter	Required Value
Sample Rate	16,000 Hz
Channels	1 (mono)
Format	16-bit signed integer (Int16) PCM

Android's AudioRecord defaults to different settings. You MUST configure it correctly.

Audio Recording Service

Create AudioCaptureService.kt:

kotlin

1package com.example.localaiplayground.domain.services
2
3import android.Manifest
4import android.content.Context
5import android.content.pm.PackageManager
6import android.media.AudioFormat
7import android.media.AudioRecord
8import android.media.MediaRecorder
9import android.util.Log
10import androidx.core.app.ActivityCompat
11import kotlinx.coroutines.Dispatchers
12import kotlinx.coroutines.flow.MutableStateFlow
13import kotlinx.coroutines.flow.StateFlow
14import kotlinx.coroutines.withContext
15import java.io.ByteArrayOutputStream
16
17class AudioCaptureService(private val context: Context) {
18    companion object {
19        private const val TAG = "AudioCaptureService"
20        private const val SAMPLE_RATE = 16000  // Whisper requires 16kHz
21        private const val CHANNEL_CONFIG = AudioFormat.CHANNEL_IN_MONO
22        private const val AUDIO_FORMAT = AudioFormat.ENCODING_PCM_16BIT
23    }
24
25    private var audioRecord: AudioRecord? = null
26    private var isRecording = false
27    private val audioBuffer = ByteArrayOutputStream()
28
29    private val _audioLevel = MutableStateFlow(0f)
30    val audioLevel: StateFlow<Float> = _audioLevel
31
32    fun hasPermission(): Boolean {
33        return ActivityCompat.checkSelfPermission(
34            context,
35            Manifest.permission.RECORD_AUDIO
36        ) == PackageManager.PERMISSION_GRANTED
37    }
38
39    suspend fun startRecording(): Result<Unit> = withContext(Dispatchers.IO) {
40        if (!hasPermission()) {
41            return@withContext Result.failure(SecurityException("Microphone permission not granted"))
42        }
43
44        try {
45            val bufferSize = AudioRecord.getMinBufferSize(
46                SAMPLE_RATE,
47                CHANNEL_CONFIG,
48                AUDIO_FORMAT
49            )
50
51            audioRecord = AudioRecord(
52                MediaRecorder.AudioSource.MIC,
53                SAMPLE_RATE,
54                CHANNEL_CONFIG,
55                AUDIO_FORMAT,
56                bufferSize * 2  // Double buffer for safety
57            )
58
59            audioBuffer.reset()
60            isRecording = true
61
62            audioRecord?.startRecording()
63            Log.d(TAG, "Recording started at $SAMPLE_RATE Hz")
64
65            // Read audio data in a loop
66            val buffer = ShortArray(bufferSize)
67            while (isRecording) {
68                val read = audioRecord?.read(buffer, 0, buffer.size) ?: 0
69                if (read > 0) {
70                    // Convert shorts to bytes
71                    val byteBuffer = ByteArray(read * 2)
72                    for (i in 0 until read) {
73                        byteBuffer[i * 2] = (buffer[i].toInt() and 0xFF).toByte()
74                        byteBuffer[i * 2 + 1] = (buffer[i].toInt() shr 8).toByte()
75                    }
76                    audioBuffer.write(byteBuffer)
77
78                    // Calculate audio level for visualization
79                    val rms = calculateRMS(buffer, read)
80                    _audioLevel.value = rms
81                }
82            }
83
84            Result.success(Unit)
85        } catch (e: Exception) {
86            Log.e(TAG, "Recording error", e)
87            Result.failure(e)
88        }
89    }
90
91    fun stopRecording(): ByteArray {
92        isRecording = false
93
94        audioRecord?.apply {
95            stop()
96            release()
97        }
98        audioRecord = null
99
100        _audioLevel.value = 0f
101
102        val audioData = audioBuffer.toByteArray()
103        Log.d(TAG, "Recording stopped: ${audioData.size} bytes")
104        return audioData
105    }
106
107    private fun calculateRMS(buffer: ShortArray, length: Int): Float {
108        var sum = 0.0
109        for (i in 0 until length) {
110            sum += buffer[i] * buffer[i]
111        }
112        val rms = kotlin.math.sqrt(sum / length)
113        // Normalize to 0-1 range (max short value is 32767)
114        return (rms / 32767f).toFloat().coerceIn(0f, 1f)
115    }
116}

Important: The 16kHz sample rate and mono configuration are non-negotiable. Sending audio in a different format will produce garbage output.

STT ViewModel

Create SpeechToTextViewModel.kt:

kotlin

1package com.example.localaiplayground.presentation.stt
2
3import android.app.Application
4import androidx.lifecycle.AndroidViewModel
5import androidx.lifecycle.viewModelScope
6import com.example.localaiplayground.domain.services.AudioCaptureService
7import com.runanywhere.sdk.public.RunAnywhere
8import com.runanywhere.sdk.public.extensions.availableModels
9import com.runanywhere.sdk.public.extensions.downloadModel
10import com.runanywhere.sdk.public.extensions.loadSTTModel
11import com.runanywhere.sdk.public.extensions.transcribe
12import kotlinx.coroutines.flow.*
13import kotlinx.coroutines.launch
14
15data class STTUiState(
16    val isLoading: Boolean = true,
17    val isModelLoaded: Boolean = false,
18    val downloadProgress: Float = 0f,
19    val isRecording: Boolean = false,
20    val isTranscribing: Boolean = false,
21    val transcription: String = "",
22    val audioLevel: Float = 0f,
23    val error: String? = null
24)
25
26class SpeechToTextViewModel(application: Application) : AndroidViewModel(application) {
27    private val _uiState = MutableStateFlow(STTUiState())
28    val uiState: StateFlow<STTUiState> = _uiState.asStateFlow()
29
30    private val audioService = AudioCaptureService(application)
31    private val modelId = "sherpa-onnx-whisper-tiny.en"
32
33    init {
34        loadModel()
35        observeAudioLevel()
36    }
37
38    private fun loadModel() {
39        viewModelScope.launch {
40            try {
41                val models = RunAnywhere.availableModels()
42                val isDownloaded = models.any { it.id == modelId && it.localPath != null }
43
44                if (!isDownloaded) {
45                    RunAnywhere.downloadModel(modelId).collect { progress ->
46                        _uiState.update {
47                            it.copy(downloadProgress = progress.progress)
48                        }
49                    }
50                }
51
52                // Load STT model
53                RunAnywhere.loadSTTModel(modelId)
54
55                _uiState.update {
56                    it.copy(isLoading = false, isModelLoaded = true)
57                }
58
59            } catch (e: Exception) {
60                _uiState.update {
61                    it.copy(isLoading = false, error = e.message)
62                }
63            }
64        }
65    }
66
67    private fun observeAudioLevel() {
68        viewModelScope.launch {
69            audioService.audioLevel.collect { level ->
70                _uiState.update { it.copy(audioLevel = level) }
71            }
72        }
73    }
74
75    fun toggleRecording() {
76        if (_uiState.value.isRecording) {
77            stopAndTranscribe()
78        } else {
79            startRecording()
80        }
81    }
82
83    private fun startRecording() {
84        if (!audioService.hasPermission()) {
85            _uiState.update { it.copy(error = "Microphone permission required") }
86            return
87        }
88
89        viewModelScope.launch {
90            _uiState.update {
91                it.copy(isRecording = true, transcription = "", error = null)
92            }
93
94            audioService.startRecording()
95        }
96    }
97
98    private fun stopAndTranscribe() {
99        viewModelScope.launch {
100            _uiState.update {
101                it.copy(isRecording = false, isTranscribing = true)
102            }
103
104            try {
105                val audioData = audioService.stopRecording()
106
107                if (audioData.isNotEmpty()) {
108                    val text = RunAnywhere.transcribe(audioData)
109                    _uiState.update {
110                        it.copy(transcription = text, isTranscribing = false)
111                    }
112                } else {
113                    _uiState.update {
114                        it.copy(transcription = "No audio recorded", isTranscribing = false)
115                    }
116                }
117
118            } catch (e: Exception) {
119                _uiState.update {
120                    it.copy(
121                        transcription = "Error: ${e.message}",
122                        isTranscribing = false
123                    )
124                }
125            }
126        }
127    }
128}

STT Screen

Create SpeechToTextScreen.kt:

kotlin

1package com.example.localaiplayground.presentation.stt
2
3import androidx.compose.foundation.background
4import androidx.compose.foundation.clickable
5import androidx.compose.foundation.layout.*
6import androidx.compose.foundation.shape.CircleShape
7import androidx.compose.foundation.shape.RoundedCornerShape
8import androidx.compose.material3.*
9import androidx.compose.runtime.*
10import androidx.compose.ui.Alignment
11import androidx.compose.ui.Modifier
12import androidx.compose.ui.draw.clip
13import androidx.compose.ui.graphics.Color
14import androidx.compose.ui.unit.dp
15import androidx.lifecycle.viewmodel.compose.viewModel
16
17@Composable
18fun SpeechToTextScreen(
19    viewModel: SpeechToTextViewModel = viewModel()
20) {
21    val uiState by viewModel.uiState.collectAsState()
22
23    Column(
24        modifier = Modifier
25            .fillMaxSize()
26            .background(Color.Black)
27            .padding(24.dp),
28        horizontalAlignment = Alignment.CenterHorizontally,
29        verticalArrangement = Arrangement.Center
30    ) {
31        // Loading state
32        if (uiState.isLoading) {
33            CircularProgressIndicator()
34            Spacer(modifier = Modifier.height(16.dp))
35            Text(
36                "Downloading model... ${(uiState.downloadProgress * 100).toInt()}%",
37                color = Color.White
38            )
39            LinearProgressIndicator(
40                progress = { uiState.downloadProgress },
41                modifier = Modifier
42                    .fillMaxWidth()
43                    .padding(top = 8.dp)
44            )
45            return
46        }
47
48        // Transcription display
49        Surface(
50            modifier = Modifier
51                .fillMaxWidth()
52                .heightIn(min = 100.dp),
53            shape = RoundedCornerShape(12.dp),
54            color = Color(0xFF111111)
55        ) {
56            Text(
57                text = uiState.transcription.ifEmpty { "Tap the microphone to record..." },
58                color = Color.White,
59                modifier = Modifier.padding(16.dp)
60            )
61        }
62
63        Spacer(modifier = Modifier.height(48.dp))
64
65        // Audio level indicator
66        if (uiState.isRecording) {
67            LinearProgressIndicator(
68                progress = { uiState.audioLevel },
69                modifier = Modifier
70                    .fillMaxWidth()
71                    .height(4.dp),
72                color = Color.Red,
73                trackColor = Color.DarkGray
74            )
75            Spacer(modifier = Modifier.height(16.dp))
76        }
77
78        // Record button
79        Box(
80            modifier = Modifier
81                .size(100.dp)
82                .clip(CircleShape)
83                .background(
84                    if (uiState.isRecording) Color.Red else Color(0xFF007AFF)
85                )
86                .clickable(
87                    enabled = uiState.isModelLoaded && !uiState.isTranscribing,
88                    onClick = { viewModel.toggleRecording() }
89                ),
90            contentAlignment = Alignment.Center
91        ) {
92            Text(
93                text = if (uiState.isRecording) "⬛" else "🎤",
94                style = MaterialTheme.typography.headlineLarge
95            )
96        }
97
98        Spacer(modifier = Modifier.height(16.dp))
99
100        if (uiState.isTranscribing) {
101            Row(verticalAlignment = Alignment.CenterVertically) {
102                CircularProgressIndicator(
103                    modifier = Modifier.size(16.dp),
104                    strokeWidth = 2.dp
105                )
106                Spacer(modifier = Modifier.width(8.dp))
107                Text("Transcribing...", color = Color.White)
108            }
109        }
110    }
111}

Requesting Permissions

In your Activity, request the microphone permission:

kotlin

1import android.Manifest
2import androidx.activity.result.contract.ActivityResultContracts
3
4class MainActivity : ComponentActivity() {
5    private val requestPermissionLauncher = registerForActivityResult(
6        ActivityResultContracts.RequestPermission()
7    ) { isGranted ->
8        if (isGranted) {
9            // Permission granted, can start recording
10        }
11    }
12
13    override fun onCreate(savedInstanceState: Bundle?) {
14        super.onCreate(savedInstanceState)
15
16        // Request microphone permission
17        requestPermissionLauncher.launch(Manifest.permission.RECORD_AUDIO)
18
19        setContent {
20            // Your app content
21        }
22    }
23}

Memory Management

When you're done with STT, unload the model:

kotlin

1import com.runanywhere.sdk.public.extensions.unloadSTTModel
2
3// Unload STT model
4RunAnywhere.unloadSTTModel()

STT models can be loaded independently alongside the LLM—they don't conflict.

Models Reference

Model ID	Size	Notes
sherpa-onnx-whisper-tiny.en	~75MB	English, real-time capable

What's Next

In Part 3, we'll add text-to-speech with Piper using Android's AudioTrack for playback.

Resources

Questions? Open an issue on GitHub or reach out on Twitter/X.