News
NVIDIA Adapts Text-to-3D Sampling for Prompt-Based Audio Generation and Source Separation
- By John K. Waters
- 05/20/2025
Researchers at NVIDIA, in collaboration with MIT CSAIL, have extended Score Distillation Sampling (SDS)—originally developed for text-to-3D generation—into the audio domain, introducing a new framework called Audio-SDS. The approach repurposes large pretrained audio diffusion models for a diverse set of text-guided audio tasks, including source separation, FM synthesis calibration, and impact sound simulation.
The core contribution, detailed in a new arXiv preprint, centers on using SDS as an optimization loop that adjusts the parameters of procedural or synthesized audio until the result aligns with a user-supplied natural language prompt. The team shows that Audio-SDS works across generative pipelines without requiring task-specific fine-tuning or labeled training data.
Distillation for Audio: From Text to Timbre
Score Distillation Sampling was initially proposed to generate 3D models from text by leveraging image diffusion models as priors. In Audio-SDS, the same principle applies: render a baseline audio signal (e.g., via a synthesizer), measure how far it deviates from a diffusion-model-guided expectation conditioned on a prompt, and backpropagate updates to the source parameters.
The framework supports multiple backends, including:
- Physically-informed impact synthesis, where object and reverb impulses are modulated to reflect prompts like striking a metal pot with a wooden spoon.
- FM synthesis control, with optimization over modulation matrices, frequency ratios, and envelope parameters.
- Prompt-guided source separation, where a mixed audio buffer is decomposed into component sources aligned with descriptive prompts.
Notably, all tasks are performed using a single pretrained text-to-audio diffusion model with no specialized training.
Prompt-Driven Separation with No Labels
In the source separation setup, Audio-SDS jointly optimizes the latent representations of individual sources, guided by different prompts. For example, from a clip mixing traffic noise and a saxophone solo, the method can extract components matching prompts like cars passing by on a busy street and jazzy, modal saxophone melody.
The model does not rely on ground-truth source labels during optimization. Instead, it uses prompt alignment scores (e.g., CLAP embeddings) and reconstruction loss to ensure separated signals sum to the original mixture.
The team also demonstrated a fully automated pipeline using YouTube audio, an audio captioner, and an LLM (e.g., ChatGPT) to generate candidate source prompts, feeding them into Audio-SDS for separation—suggesting a scalable, zero-label pipeline for in-the-wild audio.
Versatility with Limitations
Audio-SDS offers a unified method for audio generation and editing tasks under natural language control, but its performance depends heavily on the pretrained model. The current implementation uses the Stable Audio Open diffusion model, which the authors note has known biases—particularly toward Western musical instruments and audio with silence-padding artifacts.
The framework also operates on clips under 10 seconds in length. Scaling to longer durations may require hierarchical scheduling or memory-efficient architectures.
Despite these limitations, Audio-SDS highlights a broader trend in AI research: reuse of large generative priors to drive optimization in other domains. As seen previously in 3D generation, distillation-based strategies are proving effective in cross-modal synthesis, now extended to audio without retraining.
About the Author
John K. Waters is the editor in chief of a number of Converge360.com sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS. He can be reached at [email protected].