Hibiki Zero: Direct Speech-to-Speech Translation
Kyutai's Hibiki Zero is a 3 billion parameter speech-to-speech translation model that converts audio directly into translated audio without intermediate text
Kyutai’s Hibiki Zero: 3B Speech-to-Speech Model
What It Is
Hibiki Zero represents a different approach to voice translation. Rather than following the traditional pipeline of speech-to-text, text translation, then text-to-speech, this 3 billion parameter model processes audio directly into translated audio. The architecture skips intermediate text representations entirely, working with speech signals from input to output.
Kyutai designed the model to handle conversational nuances that typically get lost in cascade systems. It preserves timing elements like natural pauses, manages overlapping speech between speakers, and maintains prosodic patterns that make conversations sound human. The model supports English, French, Spanish, and Japanese, with weights released under an open license at https://huggingface.co/kyutai/hibiki-zero-3b-pytorch-bf16.
The 3B parameter count puts Hibiki Zero in an unusual category - small enough to run on consumer GPUs but large enough to capture complex speech patterns. This contrasts sharply with models requiring 70B+ parameters or specialized inference infrastructure.
Why It Matters
Direct speech-to-speech processing solves problems that plague traditional translation pipelines. When systems convert speech to text, translate the text, then synthesize new speech, they lose paralinguistic information - the tone, rhythm, and emotional content that carries meaning beyond words. Hibiki Zero’s approach preserves these elements throughout the translation process.
Developers building real-time translation tools gain a practical option that doesn’t require chaining multiple models together. Each step in a cascade system introduces latency and potential failure points. A unified model reduces both concerns while maintaining conversational flow that sounds less robotic.
The accessibility factor matters for smaller teams and individual developers. Running a 3B model requires roughly 6GB of VRAM in bfloat16 precision, achievable on mid-range consumer GPUs. This opens voice translation capabilities to projects that can’t afford enterprise-grade infrastructure or API costs for high-volume processing.
Research teams working on multilingual voice interfaces now have a reference implementation for end-to-end speech translation. The open weights allow fine-tuning for specific domains or language pairs, potentially improving performance for specialized use cases like medical interpretation or technical support.
Getting Started
The model weights are available through Hugging Face’s model hub. Developers can download them directly:
model = AutoModel.from_pretrained(
"kyutai/hibiki-zero-3b-pytorch-bf16",
trust_remote_code=True
)
Audio samples demonstrating the model’s capabilities across different language pairs can be heard at https://huggingface.co/spaces/kyutai/hibiki-zero-samples. These examples show how the model handles various conversational scenarios, from formal presentations to casual dialogue.
Kyutai’s technical blog post at https://kyutai.org/blog/2026-02-12-hibiki-zero provides implementation details, including preprocessing requirements and inference optimization strategies. The documentation covers audio format specifications and expected latency characteristics.
For production deployments, developers should test the model with representative audio samples from their target use case. Background noise, accent variation, and domain-specific terminology can all affect performance in ways that general benchmarks don’t capture.
Context
Hibiki Zero competes with cascade approaches like combining Whisper for transcription, machine translation models, and TTS systems like Bark or XTTS. While cascade systems offer flexibility in swapping components, they struggle with maintaining conversational naturalness and introduce cumulative latency.
Meta’s SeamlessM4T takes a similar end-to-end approach but at significantly larger scale. Hibiki Zero’s smaller footprint makes it more practical for edge deployment or applications where cloud API calls aren’t viable.
Limitations include the current language support - four languages covers major use cases but leaves gaps for many language pairs. The model’s performance on low-resource languages or heavily accented speech remains unclear without extensive testing. Audio quality degrades with poor input recordings, though this affects all speech models.
The open release strategy contrasts with proprietary voice translation services from major cloud providers. Teams gain control over their deployment and data but assume responsibility for model hosting and maintenance.
Related Tips
Liquid AI MoE Models Run in Browser via WebGPU
Liquid AI's Mixture of Experts language models now run directly in web browsers using WebGPU technology, enabling client-side AI inference without servers or
LLMs Develop Universal Internal Language Representation
Research shows large language models develop a universal internal representation across languages in their middle layers, with identical content in different
LLMs Develop Universal Internal Representation
Research reveals that large language models develop language-agnostic internal representations, where identical content in different languages produces more