chatgpt

Fast CPU-Only TTS: Sopro Clones Voices in 0.25 RTF

Sopro delivers fast CPU-only text-to-speech with voice cloning capabilities, achieving impressive 0.25 real-time factor performance without requiring GPU

Someone built a surprisingly fast text-to-speech model called Sopro that runs on regular CPUs without needing a GPU.

The interesting bit is the speed - it hits 0.25 RTF, which means generating 30 seconds of audio only takes 7.5 seconds on a CPU. Most TTS models either need serious hardware or take forever to process.

Key specs:

  • 169M parameters (pretty small)
  • Zero-shot voice cloning with just 3-12 seconds of reference audio
  • Streaming support for real-time applications
  • Apache 2.0 license (completely open)

The creator admits it’s not perfect - voice cloning can be hit-or-miss and it gets unstable sometimes. Only does English too, since it was trained on a single L40S GPU.

Still, for a side project that runs locally without GPU requirements, it’s a solid option for quick prototypes.

Repo: https://github.com/samuel-vitorino/sopro