general by Promptsicle Team

ACE Studio Open-Sources Singing Voice AI Model

ACE Studio releases its singing voice synthesis AI model as open-source software, enabling developers and researchers to create realistic vocal performances.

ACE Studio Releases Open-Source Music AI Model

ACE Studio has released its singing voice synthesis model under an Apache 2.0 license, making it one of the first commercially-viable vocal AI systems available without restrictions. The model, trained on over 1,000 hours of professional vocal recordings across multiple languages, generates singing performances from MIDI files and lyrics.

Key Specs

The ACE Studio model operates as a diffusion-based system that converts musical notation into realistic vocal performances. Unlike text-to-speech systems, it handles pitch variation, vibrato, breath control, and phoneme timing specific to singing.

The base model supports English, Japanese, Chinese, and Korean vocals out of the box. It processes standard MIDI files with accompanying lyric sheets, outputting 44.1kHz audio suitable for professional music production. The architecture uses a two-stage approach: a linguistic frontend converts text to phonemes, while a diffusion backend generates the acoustic features.

Training data came from licensed studio recordings, with each vocalist providing 20-50 hours of material across different musical styles. The model weights total approximately 2.4GB, requiring a GPU with at least 8GB VRAM for real-time inference. CPU-based generation works but takes roughly 10x longer per phrase.

The GitHub repository (https://github.com/acestudioai/ace-vocal-model) includes pre-trained weights, inference code, and a simplified training pipeline for custom voice creation. Documentation covers the phoneme format, pitch curve specifications, and parameter tuning for different musical genres.

Who Benefits

Independent musicians gain access to vocal synthesis previously limited to studios with expensive software licenses. The model handles demo creation, reference tracks, and background vocals without hiring session singers. Producers working in genres like electronic music, lo-fi, or experimental pop can integrate AI vocals directly into their DAW workflow.

Game developers and content creators find particular value in the multi-language support. A single model generates vocals for international markets without managing multiple voice actors. The open-source nature allows commercial use in games, videos, and streaming content without royalty concerns.

Researchers studying singing voice synthesis now have a reproducible baseline for comparison. The training code enables experiments with different architectures, loss functions, and data augmentation strategies. Academic teams can fine-tune the model on specialized datasets like opera, folk music, or historical recordings.

Karaoke and music education applications benefit from the controllable output. The model accepts detailed pitch and timing adjustments, making it useful for generating practice tracks or demonstrating vocal techniques. Developers can build interactive tools where users modify MIDI parameters and hear immediate vocal changes.

Quick Start

Installation requires Python 3.9+ and PyTorch 2.0 or later. Clone the repository and install dependencies:

git clone https://github.com/acestudioai/ace-vocal-model
cd ace-vocal-model
pip install -r requirements.txt

Download the pre-trained weights (2.4GB) from the releases page. Basic inference takes a MIDI file and lyric text file as input:

from ace_vocal import VocalSynthesizer

synth = VocalSynthesizer(model_path="ace_base_v1.ckpt")
audio = synth.generate(
    midi_file="melody.mid",
    lyrics="path/to/lyrics.txt",
    language="en"
)
audio.save("output.wav")

The lyrics file uses a simple format with one syllable per line, aligned with MIDI note timing. Advanced users can specify phonemes directly for precise pronunciation control.

Fine-tuning on custom voices requires 30-60 minutes of clean vocal recordings. The training script handles data preprocessing, though manual cleanup improves results. A consumer GPU (RTX 3080 or better) completes training in 12-18 hours.

Alternatives

Synthesizer V offers a commercial alternative with a polished interface and extensive voice libraries. It uses concatenative synthesis rather than neural generation, providing more predictable results but less flexibility. The free tier limits export length, while the paid version costs $89.

Vocaloid remains the industry standard for professional vocal synthesis. Version 6 includes AI-assisted features but requires purchasing individual voice banks at $150-200 each. The proprietary format locks users into the Yamaha ecosystem.

Diff-SVC provides another open-source option focused on voice conversion rather than synthesis from scratch. It transforms existing vocal recordings into different timbres, useful for style transfer but requiring source audio. The model architecture shares similarities with ACE Studio’s diffusion approach.

NNSVS takes a traditional concatenative approach with neural components. It requires more manual tuning than diffusion models but runs on lower-end hardware. The Japanese vocal synthesis community actively develops NNSVS voice banks.

ACE Studio’s release fills a gap between expensive commercial tools and research-grade systems. The permissive license and reasonable hardware requirements make professional-quality vocal synthesis accessible to individual creators and small teams.