Skip to content
AI Tools Finder

WhisperX

Whisper + speaker diarisation + word-level timestamps.

Open Source 4–8 GB VRAMRuns locally
Actually FreeOpen SourceWatermark-FreeHobbyist-Friendly
Visit WhisperXUpdated 2026-03-12 · Direct link

Hardware requirements

Runs locally · Entry GPU (6–8 GB)

4–8 GB VRAM
Min VRAM
4 GB
Rec. VRAM
8 GB
Min RAM
16 GB
Rec. RAM
16 GB
Disk
12 GB
GPU class
Entry GPU
11.8+No Apple SiliconGPU RequiredQuant: INT8, FP16

Diarisation model adds ~2 GB VRAM on top of Whisper.

Screenshot placeholder · WhisperX

What is WhisperX?

WhisperX takes faster-whisper and adds the things production transcription actually needs: forced-alignment for word-accurate timestamps (via wav2vec2), speaker diarisation (via pyannote), and VAD-based chunking. The de facto open-source pipeline for subtitling and podcast transcripts.

Pros & cons

Pros

  • Word-level timestamps that are actually word-level
  • Speaker diarisation in the same pipeline
  • Built on faster-whisper — speed inherited

Cons

  • Diarisation step needs accepting HF model terms
  • Heavier dependency stack (PyAnnote, wav2vec2)

What's actually free?

BSD-4. Note: pyannote diarisation requires HF model access (free).

✓ Actually FreeOpen SourceWatermark-Free

Alternatives

faster-whisper

Whisper, 4× faster, same accuracy. CTranslate2 backend.

Open Source 2–6 GB VRAM
Min VRAM
2 GB
GPU class
Entry GPU
Quant
INT8
Actually FreeNo SignupOpen SourceWatermark-Free

OpenAI Whisper

The reference open-source speech-to-text model.

Open Source 2–10 GB VRAM
Min VRAM
2 GB
GPU class
Entry GPU
Quant
FP16
Actually FreeNo SignupOpen SourceWatermark-Free