coding

Running SAM-Audio on 4GB GPUs with AudioGhost

AudioGhost AI enables Meta's SAM-Audio natural language stem separation to run on consumer 4GB GPUs through optimization, making text-prompted instrument

Running Meta’s SAM-Audio on 4GB GPUs with AudioGhost

What It Is

Meta’s SAM-Audio applies the Segment Anything Model architecture to audio processing, enabling natural language-based stem separation. Instead of clicking visual masks, users type prompts like “extract the violin” or “isolate the drums” to pull specific instruments from mixed audio tracks.

The original implementation demanded 20GB+ of VRAM, making it accessible only to workstation-class hardware. AudioGhost AI solves this by removing unnecessary components - specifically the vision encoders and ranking modules that ship with SAM-Audio but serve no purpose for audio-only workflows. This optimization brings memory requirements down to 4-6GB for the Small model and roughly 10GB for the Large variant, putting the technology within reach of consumer laptops and mid-range desktop GPUs.

The project packages everything into a Windows-friendly installer that handles FFmpeg, TorchCodec, and other dependencies automatically. The interface displays real-time waveforms and includes mixing controls for extracted stems, eliminating the need to juggle multiple audio tools.

Why It Matters

Audio stem separation has traditionally required either cloud services with usage limits or specialized models trained for specific instrument types. SAM-Audio’s prompt-based approach offers more flexibility - the same model handles vocals, percussion, strings, or any other sound source described in plain language.

Making this accessible on consumer hardware changes who can experiment with advanced audio processing. Music producers working on laptops, podcast editors cleaning up recordings, or researchers analyzing field recordings can now run sophisticated separation models without cloud costs or hardware upgrades.

The local processing aspect matters for privacy-sensitive work. Studio sessions, unreleased tracks, or confidential audio content never leaves the machine. Processing a 4-minute song in under a minute on an RTX 4090 means the workflow stays interactive rather than becoming a batch job sent to remote servers.

For the broader AI ecosystem, this demonstrates how model optimization can democratize access more effectively than raw compute scaling. Stripping unused components and targeting specific use cases often delivers better results than running bloated multi-modal systems.

Getting Started

Clone the repository from https://github.com/0x0funky/audioghost-ai and run the included install.bat script on Windows systems. The installer configures the Python environment and downloads required dependencies automatically.

After installation completes, launch the interface and load an audio file. The waveform display shows the input signal, and the prompt field accepts natural language descriptions of what to extract. Typing “piano” or “background vocals” generates a separated stem that can be previewed, adjusted, and exported.

The Small model provides faster processing with lower quality, while Large delivers better separation at the cost of doubled VRAM usage and longer processing times. Most users find the Small model sufficient for initial experiments before committing to the larger variant.

Context

Traditional stem separation tools like Spleeter or Demucs use fixed categories - typically vocals, drums, bass, and other. These models excel at their specific task but cannot adapt to unusual requests like “extract the accordion” or “isolate the crowd noise.” SAM-Audio’s prompt-based approach trades some separation quality for flexibility.

Ultimate Vocal Remover and similar tools offer more polished interfaces and better results for standard use cases like karaoke track creation. AudioGhost AI targets scenarios where the sound source does not fit predefined categories or where experimentation with different prompts matters more than perfect isolation.

The 4GB memory floor still excludes older laptops with integrated graphics. Users with 2GB cards or less will need to explore cloud options or lighter models like basic Spleeter implementations.

Processing speed scales with GPU capability - the sub-minute timing applies to high-end cards like the RTX 4090. Mid-range GPUs may take several minutes per track, though this remains faster than uploading to cloud services and waiting in queue.