general by Promptsicle Team

MOVA: Unified Model for Synced Video-Audio Generation

MOVA presents a unified diffusion transformer model that generates synchronized video and audio content jointly, enabling coherent multimodal media creation

MOVA: Synchronized Video & Audio Generation Model

ByteDance researchers released MOVA, a diffusion-based model that generates synchronized video and audio from text prompts. Unlike previous approaches that treat visual and audio generation as separate tasks, MOVA produces both modalities simultaneously while maintaining tight temporal alignment between what viewers see and hear.

Benchmarks

MOVA demonstrates substantial improvements over existing video-audio generation systems across multiple metrics. The model achieves a Fréchet Video Distance (FVD) of 87.3 on the Landscape dataset, outperforming prior methods by approximately 15%. For audio quality, MOVA scores a Fréchet Audio Distance (FAD) of 1.82, indicating generated audio closely matches real-world distributions.

The synchronization metrics reveal MOVA’s core strength. The model achieves an Audio-Visual Synchronization Score (AVSS) of 0.91 on the VGGSound benchmark, compared to 0.73 for cascaded generation pipelines that produce video first, then add audio. Human evaluators preferred MOVA outputs 78% of the time when comparing against MM-Diffusion and other joint generation approaches.

Testing across diverse prompts shows consistent performance. For prompts like “waves crashing on a rocky shore,” MOVA generates 4-second clips where visual wave impacts align within 50 milliseconds of corresponding audio peaks. The model handles complex scenarios including “fireworks exploding over a city” and “jazz band performing in a smoky club” with appropriate audio-visual correspondence.

How to Run It

MOVA remains a research project without public code release as of early 2025. The technical paper describes the architecture but ByteDance has not published model weights or inference code at https://github.com/bytedance-research.

Based on the paper’s methodology, running MOVA would require substantial computational resources. The model uses a joint latent diffusion framework with separate encoders for visual and audio streams, then applies cross-modal attention mechanisms during the denoising process. Training involved 8 NVIDIA A100 GPUs for approximately 12 days on a dataset of 2.3 million video-audio pairs.

For researchers interested in similar capabilities, alternative approaches exist. AudioLDM 2 (https://github.com/haoheliu/AudioLDM2) generates audio from text and can be paired with video generation models, though synchronization requires manual alignment. Make-A-Video extended with audio components provides another research direction, though results typically show weaker temporal coupling than MOVA’s joint generation approach.

The inference process described in the paper suggests generation times of 45-60 seconds per 4-second clip on high-end GPUs, using 50 diffusion steps. The model accepts text prompts up to 77 tokens and outputs 256x256 resolution video at 8 frames per second with 16kHz audio.

Limitations

MOVA faces several constraints that limit practical deployment. The 256x256 resolution falls below current standards for video generation models like Runway Gen-2 or Pika, which produce 720p or higher outputs. The 4-second maximum duration restricts applications requiring longer content.

The model struggles with fine-grained synchronization in specific scenarios. When generating “person speaking dialogue,” lip movements often drift out of sync after 2-3 seconds. Musical instrument generation shows similar issues - a “guitarist playing chords” may display finger positions that don’t match the generated audio frequencies.

Training data composition introduces biases. MOVA performs best on nature scenes, musical performances, and common sound events well-represented in web-scraped video datasets. Prompts involving rare instruments, specific cultural contexts, or technical audio phenomena produce less reliable results. The model occasionally generates audio-visual combinations that are physically implausible, such as small objects producing disproportionately loud sounds.

Computational requirements present barriers to widespread adoption. The multi-GPU training setup and lengthy inference times exceed resources available to most researchers. The model architecture’s complexity makes fine-tuning or adaptation challenging without significant engineering effort.

Verdict

MOVA represents meaningful progress in joint audio-visual generation, particularly in temporal synchronization between modalities. The benchmark improvements over cascaded approaches validate the joint training strategy, and the synchronization metrics demonstrate practical advantages for applications requiring tight audio-visual coupling.

However, the resolution limitations, duration constraints, and lack of public availability limit immediate impact. The model serves primarily as a research contribution showing the viability of joint diffusion-based generation. For production applications, current separate video and audio generation tools with post-processing alignment remain more practical.

The architectural insights from MOVA will likely influence future multimodal generation systems. The cross-modal attention mechanisms and joint latent space design provide blueprints for other researchers. Whether ByteDance releases the model publicly will determine if MOVA becomes a widely-used tool or remains a reference point in academic literature.