KaniTTS2: Fast Local Text-to-Speech with Cloning

Someone just open-sourced KaniTTS2, a pretty fast text-to-speech model that runs locally and includes voice cloning.

The interesting bits:

Hits ~0.2 RTF on an RTX 5090 (basically real-time)
Only needs 3GB VRAM
Supports English and Spanish, with accents
They released the full training code, not just inference

Links to grab:

Pretrained model: https://huggingface.co/nineninesix/kani-tts-2-pt
English version: https://huggingface.co/nineninesix/kani-tts-2-en
Training code: https://github.com/nineninesix-ai/kani-tts-2-pretrain

The training code is the cool part - you can actually train your own TTS from scratch for specific languages or accents. They trained theirs in 6 hours on 8x H100s using 10k hours of speech data. Apache 2.0 licensed, so no weird restrictions.

KaniTTS2: Fast Local Text-to-Speech with Cloning

Related Tips

AdaLLM: True FP4 Inference on RTX 4090s Without FP16 Fallbac

Chatbot Framework Rebuilt in Rust: 10MB Binary

Femtobot: 10MB Rust Telegram Bot vs 350MB Python