Supertonic: 66M Param TTS at 166x Real-Time Speed
Supertonic achieves 66 million parameter text-to-speech synthesis running at 166 times real-time speed, demonstrating efficient neural voice generation
Supertonic: 66M Parameter TTS at 166x Real-Time Speed
import torch
from supertonic import SupertonicTTS
model = SupertonicTTS.from_pretrained("collabora/supertonic-66m")
audio = model.synthesize("Neural speech synthesis just got significantly faster.", speed=1.0)
This code snippet generates natural-sounding speech from text using Supertonic, a compact text-to-speech model from Collabora that processes audio at 166 times real-time speed on consumer hardware. With just 66 million parameters, the model produces intelligible speech while maintaining a computational footprint small enough to run on edge devices and older GPUs.
Training Approach
Supertonic builds on the GAN-TTS architecture but introduces several optimizations that reduce both model size and inference latency. The training process uses a multi-speaker dataset combining LibriTTS and VCTK, totaling approximately 580 hours of English speech from 2,456 speakers. This diversity helps the model generalize across different voice characteristics without requiring speaker-specific fine-tuning.
The architecture employs a feed-forward transformer for text encoding paired with a convolutional generator that produces mel-spectrograms. Rather than using autoregressive generation, which processes one timestep at a time, Supertonic generates entire utterances in parallel. This non-autoregressive approach accounts for much of the speed improvement, though it requires careful alignment modeling during training.
Collabora’s team applied knowledge distillation from larger teacher models, allowing the 66M parameter student to capture acoustic patterns that would typically require networks ten times its size. The discriminator network during training uses multiple scales to evaluate both fine-grained phonetic details and broader prosodic patterns. Training ran for 500,000 steps with a batch size of 32 on four NVIDIA A100 GPUs, taking approximately six days to converge.
Notable Results
Benchmark tests show Supertonic achieving 166x real-time factor on an NVIDIA RTX 3090, meaning it generates 166 seconds of audio per second of processing time. On more modest hardware like the RTX 2060, the model still maintains 89x real-time speed. CPU-only inference on an AMD Ryzen 9 5900X reaches 12x real-time, making it viable for server deployments without dedicated accelerators.
Subjective listening tests place Supertonic’s mean opinion score (MOS) at 3.8 out of 5.0 for naturalness, compared to 4.2 for ground truth recordings. While this trails state-of-the-art models like VITS-2 (4.1 MOS) and Tortoise-TTS (4.0 MOS), the gap narrows considerably when accounting for inference speed. Supertonic generates speech roughly 40 times faster than VITS-2 and 300 times faster than Tortoise-TTS at comparable quality levels.
The model handles punctuation-based prosody effectively, inserting appropriate pauses and intonation changes for commas, periods, and question marks. Speaker consistency remains stable across long-form generation, with minimal drift in voice characteristics over multi-paragraph synthesis. Phoneme accuracy tests show 96.2% correct pronunciation on standard English words, though performance degrades on technical jargon and proper nouns not well-represented in training data.
Running Locally
Installation requires PyTorch 2.0 or later with CUDA support for GPU acceleration:
pip install supertonic-tts torch torchaudio
The model downloads automatically on first use, requiring approximately 250MB of disk space for weights and configuration files. Basic synthesis requires minimal code:
from supertonic import SupertonicTTS
tts = SupertonicTTS.from_pretrained("collabora/supertonic-66m")
waveform = tts.synthesize("Speech synthesis in three lines of code.")
tts.save_wav(waveform, "output.wav", sample_rate=22050)
Advanced users can adjust speaking rate, add silence padding, and control pitch variance through optional parameters. The model outputs 22.05kHz audio by default, though upsampling to 44.1kHz or 48kHz using standard audio processing libraries introduces minimal quality degradation.
For production deployments, Collabora provides ONNX export functionality that further reduces inference latency by 15-20% through graph optimization and operator fusion. Docker containers with pre-configured environments are available at https://github.com/collabora/supertonic for streamlined deployment.
Trade-offs
Supertonic’s speed comes with compromises in expressiveness and voice quality. The model lacks fine-grained prosody control found in slower alternatives, making it less suitable for applications requiring emotional speech or dramatic narration. Voice cloning capabilities are absent—the model produces a single, averaged voice rather than mimicking specific speakers.
Audio quality occasionally exhibits artifacts during rapid phoneme transitions, particularly with consonant clusters and diphthongs. Background noise in synthesized speech, while subtle, becomes noticeable in quiet listening environments or when compressed for streaming. The 22.05kHz sample rate, while adequate for most applications, falls short of the 44.1kHz standard used in music production and high-fidelity audio applications.
Despite these limitations, Supertonic occupies a valuable niche for applications where speed and efficiency outweigh maximum quality—real-time assistants, accessibility tools, and embedded systems all benefit from its compact architecture and rapid inference.
Related Tips
AI Code Speed Outpaces Developer Understanding
Artificial intelligence now generates code faster than developers can comprehend it, creating a growing gap between production speed and human understanding of
ACE-Step 1.5: ByteDance's Fast Music AI Generator
ByteDance releases ACE-Step 1.5, a high-speed music generation AI model that creates songs in seconds using advanced distillation techniques and flow matching
ACE-Step v1: Music Generation on 8GB VRAM
ACE-Step v1 demonstrates efficient music generation capabilities running on consumer hardware with just 8GB VRAM, making AI music creation accessible to users