general

TTS Model Stops Throat Singing, Gets 50% Better

Researchers improved text-to-speech model performance by 50% after discovering and removing throat singing samples from the training dataset that caused audio

Someone fixed their text-to-speech model that kept randomly breaking into Mongolian throat singing, which is pretty hilarious but not ideal for a TTS system.

Soprano 1.1 cut those weird vocal hallucinations by 95% and dropped the word error rate by 50%. The developer also extended max sentence length from 15 to 30 seconds and cleaned up the audio artifacts from the original undertrained model.

The best part? They ran a “blind study on my family (against their will)” and got a 63% preference rate for the new version.

Try it:

Turns out training your model properly makes a huge difference. Who knew?