Skip to content
AI Tools Finder

F5-TTS

Zero-shot voice cloning TTS — 15 s of audio is enough.

Open Source 8–12 GB VRAMRuns locally
Actually FreeNo SignupOpen SourceWatermark-FreeHobbyist-Friendly
Visit F5-TTSUpdated 2026-04-18 · Direct link

Hardware requirements

Runs locally · Mid GPU (12 GB)

8–12 GB VRAM
Min VRAM
8 GB
Rec. VRAM
12 GB
Min RAM
16 GB
Rec. RAM
32 GB
Disk
15 GB
GPU class
Mid GPU
11.8+Apple Silicon ✓GPU RequiredQuant: FP16

8 GB VRAM viable with shorter contexts; 12 GB for full sequence lengths.

Screenshot placeholder · F5-TTS

What is F5-TTS?

F5-TTS is the current state-of-the-art open-weight zero-shot text-to-speech model. Give it a 15-second voice sample plus target text and it produces natural-sounding speech in that voice. Trained on 100k hours of multilingual audio; runs on a single 12 GB GPU.

Pros & cons

Pros

  • Zero-shot cloning genuinely works on 15 s of clean audio
  • Naturalness comparable to closed commercial TTS
  • Active research lab maintenance (SWivid)

Cons

  • Non-commercial license — not for paid products
  • English / Chinese strongest; other languages weaker

What's actually free?

CC-BY-NC-4.0 — free for non-commercial use.

✓ Actually FreeNo SignupOpen SourceWatermark-Free

Alternatives

XTTS v2 (Coqui)

Multilingual voice cloning in 6 seconds.

Open Source 4–6 GB VRAM
Min VRAM
4 GB
GPU class
Entry GPU
Quant
FP16
Actually FreeNo SignupOpen SourceWatermark-Free

Bark

Suno's expressive transformer-based TTS.

Open Source 8–12 GB VRAM
Min VRAM
8 GB
GPU class
Entry GPU
Quant
FP16
Actually FreeNo SignupOpen SourceWatermark-Free

Tortoise TTS

Slow, but the quality is worth the wait.

Open Source 4–8 GB VRAM
Min VRAM
4 GB
GPU class
Entry GPU
Quant
FP16
Actually FreeNo SignupOpen SourceWatermark-Free