How Voxr works: from voice to text in seconds

Voxr converts your voice into polished text through a three-stage pipeline that runs entirely on your Mac. Each stage is handled by a dedicated component, and the whole process typically takes just a few seconds from the moment you stop speaking to the text appearing in your active app.

Here’s how each piece works.

Stage 1: Speech capture and transcription

When you activate Voxr, the SpeechRecorder component takes over. It uses two Apple frameworks working in tandem:

AVAudioEngine captures raw audio from your microphone and routes it through an audio processing pipeline.
SFSpeechRecognizer receives audio buffers in real time and converts them to text using Apple’s on-device speech recognition.

The key detail here is that Voxr uses SFSpeechAudioBufferRecognitionRequest, which streams audio to Apple’s recognizer rather than recording a file and processing it after the fact. This means you see a live transcription building up as you speak. The text updates in real time, word by word.

When you stop recording, the recognizer finalizes the transcription and hands it off to the next stage.

Stage 2: Local LLM processing

Raw speech transcription is rarely perfect. You might say “um” or pause mid-sentence. Punctuation is often missing or incorrect. Run-on sentences are common. This is where the LLMProcessor comes in.

The LLMProcessor takes the raw transcription and sends it to a local AI model running on your Mac. The model cleans up the text while preserving your original meaning.

The model handles several things:

Punctuation and capitalization. Adding proper sentence structure that speech recognition often misses.
Filler word removal. Cleaning up “um,” “uh,” “like,” and other verbal artifacts.
Minor corrections. Fixing obvious transcription errors where the speech recognizer misheard a word.

The processing happens entirely on your Mac, which is central to Voxr’s privacy-first design. The model runs using your machine’s CPU and GPU, and returns the cleaned-up text.

On Apple Silicon Macs, this processing step typically takes 1-3 seconds depending on the length of your transcription and which model you’re running.

Stage 3: Automatic pasting

Once the LLM returns the processed text, the TextPaster component handles getting it into your active application. This happens in two steps:

The processed text is copied to the system clipboard via NSPasteboard.
Voxr simulates a Cmd+V keystroke to paste the text into whatever app currently has focus.

The paste simulation uses CGEvent to create a synthetic keyboard event. This approach works across virtually all macOS applications: text editors, email clients, chat apps, browsers, and anything else that accepts keyboard input.

The result is seamless: you speak, wait a moment, and the polished text appears right where your cursor is. No switching apps, no manual copy-paste, no cleanup needed. For tips on getting the best results from each stage, check out our guide to Voxr tips and shortcuts.

The full picture

Putting it all together, the flow looks like this:

You speak
  → Microphone captures audio (AVAudioEngine)
  → Audio streamed to speech recognizer (SFSpeechRecognizer)
  → Raw transcription produced
  → Sent to local AI for processing
  → Cleaned text returned
  → Placed on clipboard (NSPasteboard)
  → Pasted into active app (CGEvent Cmd+V)
You see the text

Each component is independent and communicates through published properties using SwiftUI’s reactive data flow. The main ContentView observes state changes and triggers the next stage automatically when the previous one completes.

Built with native frameworks

One thing worth noting is that Voxr doesn’t use any third-party speech recognition SDKs or audio processing libraries. The entire speech capture and transcription pipeline is built on Apple’s own frameworks, AVFoundation and Speech. This means it benefits directly from Apple’s ongoing improvements to on-device recognition and integrates naturally with macOS permission prompts and system settings.

There are no external dependencies. The entire application is self-contained, using only Apple’s native frameworks and on-device AI for processing.