Skip to content
AI Tools Finder

XTTS v2 (Coqui)

Multilingual voice cloning in 6 seconds.

Open Source 4–6 GB VRAMRuns locally
Actually FreeNo SignupOpen SourceWatermark-FreeHobbyist-FriendlyAPI
Visit XTTS v2 (Coqui)Updated 2025-09-30 · Direct link

Hardware requirements

Runs locally · Entry GPU (6–8 GB)

4–6 GB VRAM
Min VRAM
4 GB
Rec. VRAM
6 GB
Min RAM
8 GB
Rec. RAM
16 GB
Disk
4 GB
GPU class
Entry GPU
11.7+Apple Silicon ✓CPU-CapableQuant: FP16

Real-time on 4 GB+. Apple Silicon MPS works.

Screenshot placeholder · XTTS v2 (Coqui)

What is XTTS v2 (Coqui)?

Coqui's XTTS v2 is the production TTS workhorse: clone a voice from 6 seconds of audio, generate speech in 17 languages, run on a 4 GB GPU. Coqui the company is gone but the model lives on under a permissive licence, and it's the backbone of most current OSS voice apps.

Pros & cons

Pros

  • 6-second voice cloning that actually works
  • 17 languages including cross-lingual cloning
  • Real-time on a single mid-range GPU
  • Used as the engine inside many higher-level apps

Cons

  • Coqui (the company) shut down — community-maintained from here
  • Licence is permissive but not OSI-approved; check before commercial use

What's actually free?

Coqui Public Model Licence — free for personal & commercial use.

✓ Actually FreeNo SignupOpen SourceWatermark-Free

Alternatives

Bark

Suno's expressive transformer-based TTS.

Open Source 8–12 GB VRAM
Min VRAM
8 GB
GPU class
Entry GPU
Quant
FP16
Actually FreeNo SignupOpen SourceWatermark-Free

Tortoise TTS

Slow, but the quality is worth the wait.

Open Source 4–8 GB VRAM
Min VRAM
4 GB
GPU class
Entry GPU
Quant
FP16
Actually FreeNo SignupOpen SourceWatermark-Free

F5-TTS

Zero-shot voice cloning TTS — 15 s of audio is enough.

Open Source 8–12 GB VRAM
Min VRAM
8 GB
GPU class
Mid GPU
Quant
FP16
Actually FreeNo SignupOpen SourceWatermark-Free