general

Fish Audio S2: Text-to-Speech with Natural Language Control

Fish Audio's S2 model enables text-to-speech synthesis using natural language instructions embedded in text, allowing developers to control vocal emotion and

Fish Audio S2: Natural Language Voice Synthesis

What It Is

Fish Audio’s S2 model represents a shift in how text-to-speech systems handle emotional expression. Rather than adjusting numerical parameters or toggling settings, the model accepts plain English descriptions of how speech should sound. Developers can embed tags like [whispers sweetly] or [speaking confidently] directly into text, and the model interprets these instructions to generate corresponding vocal characteristics.

The system supports over 80 languages and achieves 100ms time-to-first-audio latency. One notable capability: generating multi-speaker conversations in a single inference pass, eliminating the need to swap between different voice models or configurations. The model is available as an open-source release at https://huggingface.co/fishaudio/s2-pro/.

Why It Matters

This approach addresses a persistent friction point in voice synthesis workflows. Traditional TTS systems require users to understand acoustic parameters—pitch curves, speaking rate multipliers, energy levels—that don’t map intuitively to desired outcomes. A developer wanting “nervous laughter” might spend time adjusting multiple sliders before approximating the right sound.

Natural language control lowers the barrier for non-specialists. Content creators, game developers, and accessibility tool builders can describe vocal qualities the same way they’d direct a voice actor. This matters particularly for rapid prototyping scenarios where iterating on emotional tone shouldn’t require deep technical knowledge of speech synthesis.

The benchmark performance adds weight to the approach. S2 scored above proprietary models from Google and OpenAI on both the Audio Turing Test and EmergentTTS-Eval assessments. These results suggest that training models to understand emotional descriptors doesn’t compromise output quality—it may actually improve how well synthesized speech matches human expectations.

For teams building conversational AI or narrative experiences, the multi-speaker capability streamlines production. Generating dialogue between characters without model switching reduces complexity in the audio pipeline and maintains consistent processing overhead.

Getting Started

The model lives on Hugging Face at https://huggingface.co/fishaudio/s2-pro/. Developers can access it through standard Hugging Face inference APIs or download weights for local deployment.

A basic implementation might look like:


model = AutoModel.from_pretrained("fishaudio/s2-pro")
tokenizer = AutoTokenizer.from_pretrained("fishaudio/s2-pro")

text = "[speaking nervously] I'm not sure this is the right approach. [pauses] [more confidently] But let's try it anyway."
audio = model.generate(tokenizer(text))

The emotional tags integrate directly into the text string. The model parses these annotations and applies corresponding acoustic modifications during generation. Developers can experiment with different descriptors to find what produces desired results—there’s no fixed vocabulary, though common emotional states and speaking styles tend to work reliably.

For multi-speaker scenarios, tags can differentiate between voices within the same generation call, though specific syntax may vary based on implementation details in the model documentation.

Context

Most production TTS systems still rely on parameter-based control. Google’s WaveNet derivatives, Amazon Polly, and Azure Speech Services expose settings for pitch, rate, and volume but require users to translate creative intent into numerical adjustments. Some newer models like Bark have explored prompt-based control, though S2’s benchmark scores suggest its approach to emotional interpretation may be more refined.

The 100ms latency positions S2 competitively for real-time applications, though actual performance depends on hardware and batch size. Developers should test latency under production conditions rather than assuming benchmark figures will hold across all deployment scenarios.

Limitations worth noting: open-source models typically lack the infrastructure support and SLA guarantees that come with commercial services. Teams need to handle hosting, scaling, and maintenance themselves. The 80+ language claim also warrants testing—quality often varies significantly across languages in multilingual models, with less common languages receiving less training attention.

The natural language control paradigm raises questions about consistency. Unlike numerical parameters that produce deterministic results, text descriptions introduce ambiguity. “[speaking nervously]” might generate different acoustic patterns across runs or model versions, which could complicate workflows requiring exact reproducibility.