Voice-to-text without the cloud: why privacy matters

Voice data is uniquely personal. It’s not just the words you say. It’s how you say them, the context you say them in, and the biometric signature of your voice itself. When you use a cloud-based voice-to-text service, you’re handing over all of that to a third party.

Most people don’t think much about it. But maybe they should.

What happens to your audio in the cloud

When you use a cloud transcription service, your audio is typically:

Recorded on your device
Transmitted over the internet to the provider’s servers
Processed by their speech recognition models
Stored for some period of time (varies by provider)
Potentially used to improve their models

That last point is where things get interesting. Many providers include language in their terms of service that allows them to use your data for model training and improvement. Some let you opt out. Some don’t. The details are buried in privacy policies that change regularly.

Even providers with strong privacy commitments still have your audio on their servers, at least temporarily. That means it’s subject to their security practices, their employees’ access controls, and their compliance with government data requests.

The voice biometric problem

Text can be anonymized. Voice can’t, at least not easily. Your voice is a biometric identifier, as unique as a fingerprint. When you send audio to a cloud service, you’re sending data that can identify you specifically, not just what you said.

This creates risks that don’t exist with text-based inputs:

Voice profiles can be built and linked across services
Audio recordings can be subpoenaed or leaked
Deepfake potential increases with more samples of your voice available in external databases

This isn’t theoretical. Major tech companies have faced scrutiny for having contractors listen to voice assistant recordings. Data breaches at cloud providers have exposed sensitive audio. The risks are real.

What you’re actually dictating

Think about what people typically dictate: emails to colleagues, notes about projects, personal journal entries, medical symptoms, legal communications, financial details. This is often sensitive content that you’d be careful about sharing in any other context.

Yet the default voice-to-text workflow sends all of it to a cloud server. The same person who would never type their medical symptoms into a random website happily dictates them through a cloud service without a second thought.

How local processing changes the equation

With local voice-to-text, the entire pipeline runs on your machine:

Audio is captured by your microphone
Speech recognition happens on-device
Text processing happens on-device via a local LLM
The result goes to your clipboard and into your app

At no point does your audio or text leave your computer. There’s no server to breach, no terms of service to parse, no privacy policy that might change next quarter. Your voice data exists in your computer’s memory for the duration of processing, and that’s it.

This isn’t a privacy policy. It’s a technical guarantee. The data physically cannot leave your machine because the software never sends it anywhere.

Voxr’s approach

Voxr was designed from the start around this principle. The entire application architecture ensures your data stays local:

Speech recognition uses Apple’s on-device SFSpeechRecognizer
Text processing uses on-device AI via Voxr’s three-stage pipeline. The request never hits the network
The processed text goes to NSPasteboard (your system clipboard) and is pasted locally

There’s no analytics, no telemetry, no phone-home behavior. Voxr doesn’t even have a server to send data to if it wanted to.

Privacy shouldn’t require trust

The best privacy architecture is one where you don’t have to trust the software vendor at all. You shouldn’t need to read a privacy policy, check for opt-out settings, or hope that a company keeps its promises.

With local processing, privacy is enforced by the architecture itself. That’s the approach we think voice-to-text software should take, and it’s what we built Voxr to do.