Back to Blog
Engineering

Building Production-Grade Transcription with Faster-Whisper

Jonathan Burk
Jonathan Burk·CTO of sipsip.ai··10 min read
Building production-grade transcription with Faster-Whisper at scale

OpenAI Whisper is remarkable. Running it reliably at scale requires a lot of engineering you won't find in the docs.

Why Faster-Whisper Instead of OpenAI's API?

OpenAI's transcription API is simple and accurate, but at scale the cost adds up quickly — $0.006 per minute means a 2-hour video costs $0.72. Running Faster-Whisper on your own GPU drops that to a fraction of a cent per minute.

The trade-off: infrastructure complexity. You need to manage GPU instances, handle model loading latency, implement batching, and deal with audio preprocessing edge cases that the hosted API handles for you.

The Batching Problem

Faster-Whisper processes audio in chunks. For long videos, naive chunking without overlap produces transcript breaks at sentence midpoints. We implement a 200-token overlap between chunks and a merging step that stitches clean sentences.

Language Detection Edge Cases

Whisper has excellent multilingual support, but its language detection can fail on: very short audio clips (< 30 seconds), content with multilingual code-switching, and instrumental music with sparse lyrics. We run detection on the first 30 seconds and the middle 30 seconds and take the consensus.

Jonathan Burk
Jonathan Burk
CTO of sipsip.ai

Across 8+ years, I've built full-stack and platform systems using TypeScript, Node, React, Java, AWS, and Azure, applying AI to practical problems and turning ambitious ideas into shipped products.

Related Reading

Enjoyed this? Try Sipsip for free.

Start Free Trial