XTTS v2 (Coqui)
Multilingual voice cloning in 6 seconds.
Open Source 4–6 GB VRAM
- Min VRAM
- 4 GB
- GPU class
- Entry GPU
- Quant
- FP16
Actually FreeNo SignupOpen SourceWatermark-Free
Zero-shot voice cloning TTS — 15 s of audio is enough.
Runs locally · Mid GPU (12 GB)
8 GB VRAM viable with shorter contexts; 12 GB for full sequence lengths.
F5-TTS is the current state-of-the-art open-weight zero-shot text-to-speech model. Give it a 15-second voice sample plus target text and it produces natural-sounding speech in that voice. Trained on 100k hours of multilingual audio; runs on a single 12 GB GPU.
CC-BY-NC-4.0 — free for non-commercial use.
Multilingual voice cloning in 6 seconds.
Suno's expressive transformer-based TTS.
Slow, but the quality is worth the wait.