KaniTTS2: Fast Local TTS with Voice Cloning
KaniTTS2 is an open-source text-to-speech system that generates natural-sounding speech with voice cloning capabilities on consumer hardware, requiring only
KaniTTS2: Fast Local Text-to-Speech with Cloning
What It Is
KaniTTS2 is an open-source text-to-speech system that generates natural-sounding speech on consumer hardware. The model performs voice cloning, meaning it can mimic a target speaker’s voice characteristics from a short audio sample. Unlike cloud-based TTS services, KaniTTS2 runs entirely on local machines, requiring just 3GB of VRAM to operate.
The system achieves approximately 0.2 real-time factor (RTF) on an RTX 5090 GPU, which translates to generating audio faster than playback speed. Current language support includes English and Spanish, with built-in handling for various accents within those languages. The project ships under an Apache 2.0 license, allowing commercial use without licensing complications.
What distinguishes this release is the inclusion of complete training code alongside inference models. Most TTS projects share only pre-trained weights, but KaniTTS2 provides the full pipeline for training custom models from scratch at https://github.com/nineninesix-ai/kani-tts-2-pretrain.
Why It Matters
The release addresses several pain points in the current TTS landscape. Many high-quality voice synthesis systems either require expensive API calls or demand server-grade hardware. KaniTTS2’s modest VRAM requirements put professional-grade voice synthesis within reach of developers working on standard gaming GPUs.
Publishing the training code creates opportunities for language communities underserved by existing TTS systems. Research teams can adapt the architecture for low-resource languages without reverse-engineering proprietary systems. The developers trained their model in six hours using eight H100 GPUs and 10,000 hours of speech data, suggesting that organizations with moderate compute budgets can produce specialized models.
Voice cloning capabilities open applications in accessibility tools, content creation, and interactive systems. Developers building audiobook narration tools, voice assistants, or game dialogue systems gain a foundation that doesn’t depend on third-party services. The local-first architecture also addresses privacy concerns around sending voice data to external servers.
Getting Started
The pretrained multilingual model is available at https://huggingface.co/nineninesix/kani-tts-2-pt, while an English-specific version exists at https://huggingface.co/nineninesix/kani-tts-2-en. Installation typically follows the standard Hugging Face workflow:
model = AutoModel.from_pretrained("nineninesix/kani-tts-2-en")
tokenizer = AutoTokenizer.from_pretrained("nineninesix/kani-tts-2-en")
# Generate speech from text audio = model.generate(
text="Sample text for synthesis",
speaker_embedding=reference_audio
)
For voice cloning, the system requires a reference audio sample of the target speaker. The model extracts speaker characteristics from this sample and applies them during synthesis. Shorter reference clips work for basic cloning, though longer samples typically improve quality.
Teams interested in training custom models should examine the repository at https://github.com/nineninesix-ai/kani-tts-2-pretrain. The training pipeline requires prepared speech datasets with corresponding transcriptions, along with multi-GPU infrastructure for reasonable training times.
Context
KaniTTS2 enters a crowded field that includes Coqui TTS, Bark, and various commercial offerings. Coqui TTS provides similar local synthesis capabilities but has faced maintenance challenges since the company’s closure. Bark excels at expressiveness but runs significantly slower and demands more VRAM.
The 0.2 RTF performance metric means generating five seconds of audio takes roughly one second of processing time. This speed makes real-time applications feasible, though it still lags behind the fastest streaming TTS systems. The 3GB VRAM requirement excludes older or mobile GPUs but remains accessible compared to models requiring 16GB or more.
Voice cloning quality depends heavily on reference audio characteristics. Background noise, compression artifacts, or unusual recording conditions can degrade results. The system also inherits typical neural TTS limitations around pronunciation of rare words, proper nouns, and specialized terminology.
The Apache 2.0 license removes legal ambiguity around commercial deployment, contrasting with models released under research-only or non-commercial licenses. This licensing choice, combined with published training code, positions KaniTTS2 as infrastructure rather than a finished product - a foundation for building specialized voice synthesis systems rather than a drop-in replacement for commercial APIs.
Related Tips
AgentHandover: AI Skill Builder from Screen Activity
AgentHandover is an AI skill builder that learns from screen activity to automate repetitive tasks, enabling users to train intelligent agents by demonstrating
Codesight: AI-Ready Codebase Structure Generator
Codesight is an AI-ready codebase structure generator that creates organized, well-documented project architectures optimized for AI code assistants and
Real-time Multimodal AI on M3 Pro with Gemma 2B
A technical guide exploring how to run real-time multimodal AI applications using the Gemma 2B model on Apple's M3 Pro chip, demonstrating local inference