January 20, 2026

RunAnywhere Swift SDK Part 2: Speech-to-Text with Whisper

RunAnywhere Swift SDK Part 2: Speech-to-Text with Whisper
DEVELOPERS

Real-Time Transcription with On-Device Whisper


This is Part 2 of our RunAnywhere Swift SDK tutorial series:

  1. Chat with LLMs — Project setup and streaming text generation
  2. Speech-to-Text (this post) — Real-time transcription with Whisper
  3. Text-to-Speech — Natural voice synthesis with Piper
  4. Voice Pipeline — Full voice assistant with VAD

Speech recognition unlocks natural interaction with your app. With RunAnywhere, you can run Whisper entirely on-device—no network requests, no privacy concerns, no API costs.

But there's a catch that trips up most developers: audio format conversion. iOS microphones don't output audio in the format Whisper expects. This tutorial covers both the API and the critical audio handling.

Prerequisites

  • Complete Part 1 first to set up your project with the RunAnywhere SDK
  • Physical iOS device required — the iOS Simulator doesn't have microphone access
  • ~75MB additional storage for the Whisper model

Register the STT Model

Add Whisper to your model registration:

swift
1// Register STT model (Whisper)
2RunAnywhere.registerModel(
3 id: "sherpa-onnx-whisper-tiny.en",
4 name: "Whisper Tiny English",
5 url: URL(string: "https://github.com/RunanywhereAI/sherpa-onnx/releases/download/runanywhere-models-v1/sherpa-onnx-whisper-tiny.en.tar.gz")!,
6 framework: .onnx,
7 modality: .speechRecognition,
8 artifactType: .archive(.tarGz, structure: .nestedDirectory),
9 memoryRequirement: 75_000_000
10)

Critical: Audio Format Requirements

This is where most tutorials fail you. Whisper requires a very specific audio format:

ParameterRequired Value
Sample Rate16,000 Hz
Channels1 (mono)
Format16-bit signed integer (Int16) PCM

But iOS microphones record at 48,000 Hz in Float32 format. You MUST resample.

Setting Up Audio Conversion

swift
1import AVFoundation
2
3class AudioService: ObservableObject {
4 private let audioEngine = AVAudioEngine()
5 private var audioConverter: AVAudioConverter?
6 private let dataLock = NSLock() // Thread safety for audio callbacks
7 private var recordedData = Data()
8
9 private let targetFormat = AVAudioFormat(
10 commonFormat: .pcmFormatInt16,
11 sampleRate: 16000,
12 channels: 1,
13 interleaved: true
14 )!
15
16 enum AudioError: Error {
17 case converterCreationFailed
18 case engineStartFailed
19 }
20
21 func startRecording() throws {
22 let session = AVAudioSession.sharedInstance()
23 try session.setCategory(.playAndRecord, mode: .default)
24 try session.setActive(true)
25
26 let inputNode = audioEngine.inputNode
27 let inputFormat = inputNode.outputFormat(forBus: 0) // Usually 48kHz Float32
28
29 // Create converter from iOS format to Whisper format
30 guard let converter = AVAudioConverter(from: inputFormat, to: targetFormat) else {
31 throw AudioError.converterCreationFailed
32 }
33 audioConverter = converter
34
35 dataLock.lock()
36 recordedData = Data()
37 dataLock.unlock()
38
39 inputNode.installTap(onBus: 0, bufferSize: 4096, format: inputFormat) { [weak self] buffer, _ in
40 self?.processAudioBuffer(buffer)
41 }
42
43 try audioEngine.start()
44 }
45
46 private func processAudioBuffer(_ inputBuffer: AVAudioPCMBuffer) {
47 guard let converter = audioConverter else { return }
48
49 // Calculate output frame count based on sample rate ratio
50 let ratio = targetFormat.sampleRate / inputBuffer.format.sampleRate
51 let outputFrameCount = AVAudioFrameCount(Double(inputBuffer.frameLength) * ratio)
52
53 guard let outputBuffer = AVAudioPCMBuffer(
54 pcmFormat: targetFormat,
55 frameCapacity: outputFrameCount
56 ) else { return }
57
58 var error: NSError?
59 let status = converter.convert(to: outputBuffer, error: &error) { inNumPackets, outStatus in
60 outStatus.pointee = .haveData
61 return inputBuffer
62 }
63
64 if status == .haveData, let int16Data = outputBuffer.int16ChannelData {
65 let data = Data(bytes: int16Data[0], count: Int(outputBuffer.frameLength) * 2)
66 // Thread-safe append (audio callback runs on audio thread)
67 dataLock.lock()
68 recordedData.append(data)
69 dataLock.unlock()
70 }
71 }
72
73 func stopRecording() -> Data {
74 audioEngine.stop()
75 audioEngine.inputNode.removeTap(onBus: 0)
76
77 dataLock.lock()
78 let data = recordedData
79 dataLock.unlock()
80
81 // Deactivate audio session when done
82 try? AVAudioSession.sharedInstance().setActive(false)
83
84 return data
85 }
86}

Important: The resampling step is non-negotiable. Sending 48kHz audio to Whisper will produce garbage output or crash.

Thread Safety: The processAudioBuffer callback runs on the audio thread, not the main thread. We use NSLock to safely append data. In production, consider using a lock-free ring buffer for better performance.

Loading and Using STT

swift
1// Download the model (one-time, ~75MB)
2if !(await RunAnywhere.isModelDownloaded("sherpa-onnx-whisper-tiny.en")) {
3 let progressStream = try await RunAnywhere.downloadModel("sherpa-onnx-whisper-tiny.en")
4 for await progress in progressStream {
5 print("Download: \(Int(progress.overallProgress * 100))%")
6 if progress.stage == .completed { break }
7 }
8}
9
10// Load STT model into memory
11try await RunAnywhere.loadSTTModel("sherpa-onnx-whisper-tiny.en")
12
13// Transcribe audio data (must be 16kHz Int16 PCM!)
14let audioData = audioService.stopRecording()
15let text = try await RunAnywhere.transcribe(audioData)
16print("Transcription: \(text)")

Why loadSTTModel() instead of loadModel()? The SDK uses separate methods for each modality: loadModel() for LLMs, loadSTTModel() for speech-to-text, and loadTTSVoice() for text-to-speech. This reflects that each uses a different runtime (LlamaCPP vs ONNX) and can be loaded simultaneously without conflicts.

Speech-to-text recording with waveform

Complete Recording Flow

Here's how to wire it up in a SwiftUI view:

swift
1struct SpeechToTextView: View {
2 @StateObject private var audioService = AudioService()
3 @State private var isRecording = false
4 @State private var transcription = ""
5 @State private var isTranscribing = false
6
7 var body: some View {
8 VStack(spacing: 24) {
9 // Transcription display
10 Text(transcription.isEmpty ? "Tap to record..." : transcription)
11 .font(.body)
12 .padding()
13 .frame(maxWidth: .infinity, minHeight: 100)
14 .background(Color.secondary.opacity(0.1))
15 .cornerRadius(12)
16
17 // Record button
18 Button(action: toggleRecording) {
19 Image(systemName: isRecording ? "stop.circle.fill" : "mic.circle.fill")
20 .font(.system(size: 64))
21 .foregroundColor(isRecording ? .red : .blue)
22 }
23 .disabled(isTranscribing)
24
25 if isTranscribing {
26 ProgressView("Transcribing...")
27 }
28 }
29 .padding()
30 }
31
32 private func toggleRecording() {
33 if isRecording {
34 stopAndTranscribe()
35 } else {
36 startRecording()
37 }
38 }
39
40 private func startRecording() {
41 do {
42 try audioService.startRecording()
43 isRecording = true
44 } catch {
45 print("Failed to start recording: \(error)")
46 }
47 }
48
49 private func stopAndTranscribe() {
50 isRecording = false
51 let audioData = audioService.stopRecording()
52
53 Task {
54 isTranscribing = true
55 do {
56 transcription = try await RunAnywhere.transcribe(audioData)
57 } catch {
58 transcription = "Error: \(error.localizedDescription)"
59 }
60 isTranscribing = false
61 }
62 }
63}

Memory Management

When you're done with STT, unload the model to free memory:

swift
1// Unload STT model (no parameters needed)
2try await RunAnywhere.unloadSTTModel()

STT models can be loaded independently alongside the LLM—they don't conflict.

Models Reference

Model IDSizeNotes
sherpa-onnx-whisper-tiny.en~75MBEnglish, real-time capable

What's Next

In Part 3, we'll add text-to-speech with Piper, including the inverse audio conversion challenge.


Resources


Questions? Open an issue on GitHub or reach out on Twitter/X.

RunAnywhere Logo

RunAnywhere

Connect with developers, share ideas, get support, and stay updated on the latest features. Our Discord community is the heart of everything we build.

Company

Copyright © 2025 RunAnywhere, Inc.