Supertonic: 66M Parameter TTS at 166x Real-Time Speed

import torch
from supertonic import SupertonicTTS

model = SupertonicTTS.from_pretrained("collabora/supertonic-66m")
audio = model.synthesize("Neural speech synthesis just got significantly faster.", speed=1.0)

This code snippet generates natural-sounding speech from text using Supertonic, a compact text-to-speech model from Collabora that processes audio at 166 times real-time speed on consumer hardware. With just 66 million parameters, the model produces intelligible speech while maintaining a computational footprint small enough to run on edge devices and older GPUs.

Training Approach

Supertonic builds on the GAN-TTS architecture but introduces several optimizations that reduce both model size and inference latency. The training process uses a multi-speaker dataset combining LibriTTS and VCTK, totaling approximately 580 hours of English speech from 2,456 speakers. This diversity helps the model generalize across different voice characteristics without requiring speaker-specific fine-tuning.

The architecture employs a feed-forward transformer for text encoding paired with a convolutional generator that produces mel-spectrograms. Rather than using autoregressive generation, which processes one timestep at a time, Supertonic generates entire utterances in parallel. This non-autoregressive approach accounts for much of the speed improvement, though it requires careful alignment modeling during training.

Collabora’s team applied knowledge distillation from larger teacher models, allowing the 66M parameter student to capture acoustic patterns that would typically require networks ten times its size. The discriminator network during training uses multiple scales to evaluate both fine-grained phonetic details and broader prosodic patterns. Training ran for 500,000 steps with a batch size of 32 on four NVIDIA A100 GPUs, taking approximately six days to converge.

Notable Results

Benchmark tests show Supertonic achieving 166x real-time factor on an NVIDIA RTX 3090, meaning it generates 166 seconds of audio per second of processing time. On more modest hardware like the RTX 2060, the model still maintains 89x real-time speed. CPU-only inference on an AMD Ryzen 9 5900X reaches 12x real-time, making it viable for server deployments without dedicated accelerators.

Subjective listening tests place Supertonic’s mean opinion score (MOS) at 3.8 out of 5.0 for naturalness, compared to 4.2 for ground truth recordings. While this trails state-of-the-art models like VITS-2 (4.1 MOS) and Tortoise-TTS (4.0 MOS), the gap narrows considerably when accounting for inference speed. Supertonic generates speech roughly 40 times faster than VITS-2 and 300 times faster than Tortoise-TTS at comparable quality levels.

The model handles punctuation-based prosody effectively, inserting appropriate pauses and intonation changes for commas, periods, and question marks. Speaker consistency remains stable across long-form generation, with minimal drift in voice characteristics over multi-paragraph synthesis. Phoneme accuracy tests show 96.2% correct pronunciation on standard English words, though performance degrades on technical jargon and proper nouns not well-represented in training data.

Running Locally

Installation requires PyTorch 2.0 or later with CUDA support for GPU acceleration:

pip install supertonic-tts torch torchaudio

The model downloads automatically on first use, requiring approximately 250MB of disk space for weights and configuration files. Basic synthesis requires minimal code:

from supertonic import SupertonicTTS

tts = SupertonicTTS.from_pretrained("collabora/supertonic-66m")
waveform = tts.synthesize("Speech synthesis in three lines of code.")
tts.save_wav(waveform, "output.wav", sample_rate=22050)

Advanced users can adjust speaking rate, add silence padding, and control pitch variance through optional parameters. The model outputs 22.05kHz audio by default, though upsampling to 44.1kHz or 48kHz using standard audio processing libraries introduces minimal quality degradation.

For production deployments, Collabora provides ONNX export functionality that further reduces inference latency by 15-20% through graph optimization and operator fusion. Docker containers with pre-configured environments are available at https://github.com/collabora/supertonic for streamlined deployment.

Trade-offs

Supertonic’s speed comes with compromises in expressiveness and voice quality. The model lacks fine-grained prosody control found in slower alternatives, making it less suitable for applications requiring emotional speech or dramatic narration. Voice cloning capabilities are absent—the model produces a single, averaged voice rather than mimicking specific speakers.

Audio quality occasionally exhibits artifacts during rapid phoneme transitions, particularly with consonant clusters and diphthongs. Background noise in synthesized speech, while subtle, becomes noticeable in quiet listening environments or when compressed for streaming. The 22.05kHz sample rate, while adequate for most applications, falls short of the 44.1kHz standard used in music production and high-fidelity audio applications.

Despite these limitations, Supertonic occupies a valuable niche for applications where speed and efficiency outweigh maximum quality—real-time assistants, accessibility tools, and embedded systems all benefit from its compact architecture and rapid inference.

Supertonic: 66M Param TTS at 166x Real-Time Speed

Supertonic: 66M Parameter TTS at 166x Real-Time Speed

Training Approach

Notable Results

Running Locally

Trade-offs

Related Tips

AI Code Speed Outpaces Developer Understanding

ACE-Step 1.5: ByteDance's Fast Music AI Generator

ACE-Step v1: Music Generation on 8GB VRAM