Direct Speech-to-Speech Translation Without Text
Researchers develop a neural model that translates spoken language directly into another spoken language without converting speech to text as an intermediate
Hibiki Zero: Direct Speech-to-Speech Translation
Hibiki Zero eliminates the text intermediary in speech translation by converting spoken language directly into another spoken language without generating written transcripts.
Breaking the Text Barrier
Traditional speech translation systems follow a three-step process: converting speech to text, translating that text, then synthesizing new speech. Hibiki Zero, developed by researchers at Kyoto University and NTT Corporation, collapses this pipeline into a single neural network that processes audio waveforms directly. The system learns to map acoustic features from one language to another while preserving the speaker’s vocal characteristics, including pitch, tone, and speaking rhythm.
This architecture relies on self-supervised learning from massive amounts of unlabeled speech data. Rather than requiring parallel corpora of translated sentences, Hibiki Zero trains on monolingual audio in both source and target languages. The model learns internal representations of speech that capture both linguistic content and paralinguistic features—the subtle vocal qualities that convey emotion and identity beyond mere words.
Technical Architecture
The system employs a multi-stage encoder-decoder framework. The encoder processes raw audio from the source language and extracts hierarchical representations at different time scales. Lower layers capture phonetic details while higher layers encode semantic meaning. A cross-lingual attention mechanism then maps these representations to the target language space.
The decoder generates speech in the target language using a neural vocoder that produces high-fidelity audio waveforms. Unlike traditional text-to-speech systems that require explicit phoneme sequences, this vocoder works directly from the encoder’s latent representations. This design preserves prosodic information—stress patterns, intonation, and rhythm—that would be lost in text-based intermediaries.
Training involves multiple objectives simultaneously. The model learns to reconstruct speech in both languages, align semantic content across language pairs, and maintain speaker identity. A key innovation is the use of contrastive learning to ensure that semantically equivalent utterances in different languages produce similar internal representations, even when acoustic properties differ substantially.
The system handles 12 language pairs initially, focusing on Japanese, English, Mandarin Chinese, and Spanish. Performance metrics show translation accuracy comparable to cascaded systems while reducing latency by 40-60%. The model requires approximately 180 million parameters and runs inference on standard GPU hardware.
Code and model weights are available at https://github.com/ntt-hilab-gensp/hibiki-zero for research purposes, though commercial deployment requires additional licensing.
Real-World Applications
Medical settings benefit significantly from this technology. During emergency consultations with non-native speakers, doctors can communicate while hearing translated responses that maintain the patient’s emotional tone. A trembling voice or hesitation carries diagnostic information that text-based systems discard. Hospitals in Tokyo have piloted Hibiki Zero for triage interviews, where detecting anxiety or pain levels through vocal cues improves assessment accuracy.
International business negotiations gain nuance when participants hear not just translated words but also confidence levels and emphasis patterns. A firm statement versus a tentative suggestion becomes apparent through preserved prosody. Remote teams at multinational corporations report more natural conversations compared to reading translated transcripts or listening to monotone synthetic voices.
The system also serves accessibility needs. Individuals with reading difficulties can participate in multilingual conversations without relying on written text. Language learners hear authentic pronunciation in their target language while speaking their native language, creating a scaffold for gradual acquisition.
Limitations remain significant. The system struggles with low-resource languages lacking sufficient training data. Background noise degrades performance more severely than in text-based systems because acoustic details matter throughout the pipeline. Domain-specific terminology, particularly technical jargon, sometimes produces awkward translations since the model lacks explicit lexical grounding.
Future Developments
Research teams are exploring real-time streaming translation where the system begins outputting target language speech before the source utterance completes. This requires predicting likely sentence continuations and managing the risk of incorrect anticipations. Early experiments show promise for language pairs with similar word order but face challenges with structurally divergent languages.
Integration with video conferencing platforms represents the next deployment frontier. Combining Hibiki Zero with lip-sync technology could create the illusion of speakers naturally conversing in each other’s languages. Privacy concerns around voice cloning and deepfakes require careful consideration as these capabilities mature.
Expanding to low-resource languages depends on developing better transfer learning techniques. Researchers are investigating whether models trained on high-resource pairs can bootstrap learning for related languages with minimal additional data. Cross-lingual phonetic similarities might enable faster adaptation than current methods allow.
The shift from text-mediated to direct speech translation marks a fundamental change in how machines process language—treating it as inherently acoustic rather than symbolic.
Related Tips
20B Parameter AI Model Runs in Your Browser
A 20 billion parameter AI language model has been optimized to run entirely within web browsers, enabling private local inference without cloud servers.
30B Model Handles 10M Tokens via Subquadratic Attention
A 30-billion parameter language model achieves 10-million token context processing through innovative subquadratic attention mechanisms that reduce
ByteDance Fixes Recurrent Transformer Long-Context Flaw
ByteDance researchers identify and resolve a critical architectural flaw in recurrent transformers that previously limited their effectiveness in processing