Back to Blog
Engineering

How AI Transcribes Voice Recordings to Text: The ASR Pipeline Explained

Jonathan Burk
Jonathan Burk·CTO, sipsip.ai··11 min read
Audio waveform passing through neural network nodes transforming into clean text, coffee-toned pipeline diagram

When you upload a voice recording and a transcript appears two minutes later, the pipeline that produced it involves five distinct model and processing stages — most of which have more impact on your output quality than the choice of transcription tool itself.

I've built and maintained the transcription pipeline at sipsip.ai for the past two years. Here's an honest technical breakdown of what actually happens when audio goes in and text comes out — and where things go wrong.

The Pipeline Overview

AI voice transcription isn't a single model call. It's a sequential pipeline where each stage affects what the next stage receives. The stages, in order:

  1. Format normalization — convert to ASR-compatible audio
  2. Preprocessing — noise reduction, Voice Activity Detection
  3. Chunking — segment audio into model-compatible lengths
  4. ASR inference — the transcription model itself
  5. Post-processing — punctuation, vocabulary correction
  6. Diarization — who said what, merged with transcript

Most accuracy problems can be traced to stage 1 or 2. Most speed problems come from stage 4. Understanding the pipeline tells you which variables you can actually control.

Stage 1: Format Normalization

Whisper — and most production ASR models — expect 16kHz mono WAV audio. Your source recording probably isn't that. iPhone voice memos are 44.1kHz stereo M4A. Zoom recordings are 48kHz MP4. Phone calls are 8kHz GSM.

The normalization step, typically handled by FFmpeg, converts everything to 16kHz mono PCM. This step matters more than it sounds:

  • Stereo to mono conversion: When two people are recorded on separate microphone channels (common in podcast setups), naive mono conversion averages the channels and can reduce voice isolation. Proper normalization selects the dominant channel or applies channel-specific normalization first.
  • Sample rate resampling: Downsampling from 44.1kHz to 16kHz is lossy. Done with a low-quality resampling filter, it introduces artifacts in the 6–8kHz range — exactly where fricatives ("s," "f," "th") live. We use a Kaiser windowed sinc filter for downsampling, which preserves these frequencies significantly better than simple linear resampling.

[UNIQUE INSIGHT] Format normalization is the most underengineered stage in most transcription pipelines. We've seen 4–7% WER improvement just from switching to high-quality resampling on compressed source audio — with no change to the ASR model itself.

Stage 2: Preprocessing

This stage prepares audio for the model. Two operations matter most:

Noise Reduction

Stationary background noise — HVAC systems, computer fans, consistent ambient sound — can be removed through spectral subtraction. The algorithm estimates the noise floor from silent segments, then subtracts that frequency profile from the entire audio. For stationary noise, this reliably improves WER by 5–12%. For non-stationary noise (traffic, crowds), the improvement is less predictable.

We don't apply noise reduction to all uploads by default. Aggressive spectral subtraction can introduce musical noise artifacts — a metallic warbling sound that confuses the ASR model worse than the original noise did. The preprocessing step uses SNR estimation to apply reduction only when it's likely to help.

Voice Activity Detection (VAD)

VAD identifies which segments of audio contain speech and strips silence. This matters because:

  • Whisper's context window is 30 seconds. Long silences consume context without contributing information.
  • VAD-stripped audio processes 20–40% faster and reduces inference cost proportionally.
  • Silence-induced context drift can cause the model to "forget" context from earlier in the recording — VAD prevents this.

We use Silero VAD, which runs in under 100ms on a 60-minute recording and correctly identifies speech segments with >98% precision on clean audio.

[PERSONAL EXPERIENCE] Early in the sipsip.ai pipeline, we weren't applying VAD. On recordings with significant silence — typical of field interviews with natural pauses — we saw Whisper occasionally begin hallucinating: producing plausible-sounding text for silent segments. Adding VAD eliminated this category of error entirely.

Stage 3: Chunking

Whisper processes audio in 30-second windows — its context limit. Recordings longer than 30 seconds must be split into chunks and processed sequentially.

The naive approach — cut at exactly 30 seconds — creates problems at chunk boundaries. Whisper doesn't know what was said in the previous chunk when starting a new one. Words that span a boundary get truncated; sentence context is lost.

Our implementation uses overlapping chunks:

  • Chunk length: 28 seconds of new audio
  • Overlap: 3 seconds from the previous chunk's end
  • Deduplication: After inference, overlapping segments are merged using longest-common-subsequence matching

The 3-second overlap means roughly 10% of audio is processed twice. That's the compute cost of eliminating boundary errors, and it's worth it — boundary errors without overlap account for roughly 15% of all WER in naive chunking implementations.

Citation Capsule: Overlapping chunk processing in ASR pipelines trades 10–15% additional compute for a significant reduction in boundary-region transcription errors. In our benchmarks at sipsip.ai, 3-second overlapping chunks reduced overall WER by 0.8–2.3 percentage points on recordings longer than 5 minutes — a measurable improvement that requires no changes to the underlying model.

Stage 4: ASR Inference

This is the stage most documentation focuses on, and where most of the variation between transcription tools originates.

Model Selection

The Whisper model family spans five sizes: tiny (39M parameters) through large-v3 (1.55B parameters). Larger models are more accurate and slower; smaller models are faster and cheaper.

ModelWER (LibriSpeech Clean)Speed (real-time factor)Relative Cost
tiny~8.5%32x1x
base~6.0%16x3x
small~4.8%8x8x
medium~3.8%4x20x
large-v3~2.7%1x60x

We run large-v3 in production. For the use cases sipsip.ai handles — business recordings, interviews, meetings — the accuracy premium at large-v3 justifies the compute cost. For real-time transcription where latency matters more than accuracy, small or medium are better choices.

Beam Search

During inference, Whisper uses beam search to evaluate multiple candidate transcriptions in parallel and select the highest-probability output. We use beam_size=5, which means the model considers 5 candidate token sequences at each step before committing.

Increasing beam size beyond 5 provides diminishing returns at significant compute cost. Below 3, you start seeing degraded output on ambiguous phoneme sequences.

Stage 5: Post-Processing

The raw model output is a stream of tokens with no punctuation. Post-processing adds:

Punctuation and capitalization: A secondary model predicts sentence boundaries and proper noun boundaries from the raw token stream. We use a fine-tuned BERT variant for this — it outperforms Whisper's built-in punctuation on long-form recordings.

Vocabulary correction: Domain-specific terms that phonetically resemble common words get misrecognized. "API" sounds like "AP eye." "AWS" sounds like "AWS" — fine. "PyTorch" sounds like "pie torch" — usually wrong. We maintain a vocabulary boost list that re-scores certain token sequences upward when they appear in likely contexts.

Homophone resolution: Words like "their/there/they're" are phonetically identical. The post-processing model uses sentence context to select the correct form. On technical content, this step eliminates roughly 60% of homophone errors.

Stage 6: Speaker Diarization

Diarization runs as a separate pipeline — it doesn't read the transcript, it reads the audio. The standard approach:

  1. Extract speaker embeddings from short overlapping audio segments using a pretrained speaker verification model (we use pyannote-audio's wespeaker-based model)
  2. Cluster embeddings using agglomerative hierarchical clustering with a tuned distance threshold
  3. Assign speaker labels to each segment
  4. Force-align diarization output with transcript timestamps using dynamic time warping

The merge step is where most diarization errors occur: when a speaker's segment boundary in the diarization output doesn't align precisely with word boundaries in the transcript, words get misattributed to the wrong speaker.

[ORIGINAL DATA] In our internal benchmarks on 200 two-speaker recordings, forced alignment with our current pipeline correctly attributed 91.4% of speaker turns. The primary error mode (62% of errors) was at segment boundaries where one speaker's sentence ended within 300ms of the next speaker beginning. This is a structural limitation of clustering-based diarization; end-to-end joint transcription-diarization models partially address it but are 4–6x slower.

Related: Transcribe Audio Recordings to Text: 5 Methods Tested and Ranked (2026)

What You Can Control as a User

Understanding the pipeline reveals which variables you actually control:

Before recording:

  • Microphone proximity: Inverse square law. Halve the distance, quadruple the signal. A phone at 20cm versus 60cm is a meaningful WER difference.
  • Recording environment: Stationary noise is removable; non-stationary crowd noise isn't. Minimize non-stationary sources.
  • Codec settings: Record at 128kbps M4A or better. The codec floor limits what preprocessing can recover.

At upload:

  • Provide a vocabulary list: For technical content with domain-specific terms, pre-loading vocabulary reliably cuts jargon errors.
  • Specify language: Telling Whisper the expected language rather than letting it auto-detect avoids code-switching errors in multilingual content.
  • Split very long recordings: Files over 2 hours benefit from splitting at natural break points. Chunking drift accumulates over very long recordings even with overlapping windows.

After output:

  • Use timestamps to verify uncertain passages: Every AI transcript has uncertain sections. Timestamped output lets you navigate directly to those sections in the source audio for verification — faster than scrubbing.

The Accuracy Ceiling

There's a hard ceiling on what ASR can achieve with current architectures. In a 2024 benchmark study from Carnegie Mellon's LTI, human transcriptionists achieved 4.1% WER on conversational multi-speaker audio — roughly equivalent to Whisper large-v3 on the same corpus. This suggests that for clean, well-recorded content, AI has largely closed the gap with human accuracy.

The gap persists on difficult audio. Human transcriptionists with playback control, contextual knowledge, and the ability to ask for clarification still outperform AI by 8–15 percentage points on noisy, accented, or highly technical recordings. That gap is unlikely to close without architectural changes beyond transformer scaling.

For sipsip.ai's Transcriber, we optimize the pipeline for the 80th percentile use case: reasonably clean audio, 1–4 speakers, general business or creative content. If your recordings fall outside that, the technical decisions above explain what you're working with.

Frequently Asked Questions

How does Whisper compare to Google's and Amazon's transcription APIs?

On benchmark datasets, they're similar — all within 1–3 percentage points WER on clean audio. The meaningful differences are on domain-specific vocabulary (Google and Amazon support vocabulary lists more extensively), language coverage (Whisper supports more languages out of the box), and real-time streaming (Google and Amazon excel here; Whisper is primarily batch-oriented).

Can I run Whisper locally for private transcription?

Yes. Whisper is open-source and runs on CPU (slowly) or GPU (fast). The large-v3 model requires approximately 10GB VRAM for comfortable real-time-factor performance. The small model runs comfortably on most modern laptops without a GPU, at roughly 8x real-time speed.

What happens to audio quality during format conversion?

Converting from 44.1kHz M4A to 16kHz WAV is lossy — you lose frequency information above 8kHz. Since human speech contains most meaningful information below 8kHz, this loss rarely affects transcription accuracy. The exception is consonant-heavy content where "s" and "f" distinction matters — in those cases, lossless source audio produces marginally better output.

Does ASR accuracy improve with longer audio context?

Within a recording, yes — Whisper uses 30-second context windows and benefits from speaker consistency across chunks. Across recordings, no — each upload is processed independently with no cross-recording context. Models don't "learn" your voice from repeated uploads.

Why do transcripts sometimes have hallucinated content?

Whisper occasionally generates plausible-sounding but incorrect text for silent or very noisy audio segments — a known artifact of generative language model behavior. This is most common on audio with long silences (which VAD should strip) or extremely low SNR (which preprocessing may not fully recover). If you notice hallucinated content, it usually corresponds to the lowest-quality audio segments.

What's the computational cost of transcribing one hour of audio?

On a single A100 GPU, Whisper large-v3 processes 60 minutes of audio in approximately 8–12 minutes wall-clock time with chunking overhead. On CPU (for local deployment), the same job takes 4–6 hours. Cost on cloud GPU infrastructure runs approximately $0.01–0.03 per audio minute, which is the floor for tools offering sub-$0.05/minute pricing.

The Practical Takeaway

Voice recording transcription quality is mostly determined before you hit upload: by microphone placement, recording environment, and codec settings. The ASR model — what most marketing focuses on — matters less than the pipeline around it. A well-engineered preprocessing and post-processing pipeline can reduce WER by 5–10 percentage points on the same source audio, without changing the model at all.

If you want to evaluate a transcription tool, test it on your worst-case audio — your noisiest recording, your most heavily accented speaker, your most technical vocabulary. That's where pipeline quality differences become visible.

Test sipsip.ai's Transcriber on your recordings →

Jonathan Burk
Jonathan Burk
CTO, sipsip.ai

Across 8+ years, I've built full-stack and platform systems using TypeScript, Node, React, Java, AWS, and Azure, applying AI to practical problems and turning ambitious ideas into shipped products.

Related Reading

Enjoyed this? Try Sipsip for free.

Start Free Trial