Fast CPU-Only TTS: Sopro Clones Voices in 0.25 RTF

Someone built a surprisingly fast text-to-speech model called Sopro that runs on regular CPUs without needing a GPU.

The interesting bit is the speed - it hits 0.25 RTF, which means generating 30 seconds of audio only takes 7.5 seconds on a CPU. Most TTS models either need serious hardware or take forever to process.

Key specs:

169M parameters (pretty small)
Zero-shot voice cloning with just 3-12 seconds of reference audio
Streaming support for real-time applications
Apache 2.0 license (completely open)

The creator admits it’s not perfect - voice cloning can be hit-or-miss and it gets unstable sometimes. Only does English too, since it was trained on a single L40S GPU.

Still, for a side project that runs locally without GPU requirements, it’s a solid option for quick prototypes.

Repo: https://github.com/samuel-vitorino/sopro

Fast CPU-Only TTS: Sopro Clones Voices in 0.25 RTF

Related Tips

Verity: Local AI Search Engine Like Perplexity

ACE-Step 1.5: Free Local Music AI Rivals Suno v4/v5

MOVA: Open-Source Synchronized Video & Audio Gen