coding

KaniTTS2: Fast Local Text-to-Speech with Cloning

KaniTTS2 provides a fast, locally-run text-to-speech system with voice cloning capabilities, enabling users to generate natural-sounding speech from text while

Someone just open-sourced KaniTTS2, a pretty fast text-to-speech model that runs locally and includes voice cloning.

The interesting bits:

  • Hits ~0.2 RTF on an RTX 5090 (basically real-time)
  • Only needs 3GB VRAM
  • Supports English and Spanish, with accents
  • They released the full training code, not just inference

Links to grab:

The training code is the cool part - you can actually train your own TTS from scratch for specific languages or accents. They trained theirs in 6 hours on 8x H100s using 10k hours of speech data. Apache 2.0 licensed, so no weird restrictions.