faster-whisper
Whisper, 4× faster, same accuracy. CTranslate2 backend.
Open Source 2–6 GB VRAM
- Min VRAM
- 2 GB
- GPU class
- Entry GPU
- Quant
- INT8
Actually FreeNo SignupOpen SourceWatermark-Free
Whisper + speaker diarisation + word-level timestamps.
Runs locally · Entry GPU (6–8 GB)
Diarisation model adds ~2 GB VRAM on top of Whisper.
WhisperX takes faster-whisper and adds the things production transcription actually needs: forced-alignment for word-accurate timestamps (via wav2vec2), speaker diarisation (via pyannote), and VAD-based chunking. The de facto open-source pipeline for subtitling and podcast transcripts.
BSD-4. Note: pyannote diarisation requires HF model access (free).
Whisper, 4× faster, same accuracy. CTranslate2 backend.
The reference open-source speech-to-text model.