Nvidia Unveils Fugatto, an AI Model for Sound Creation and Transformation -- Pure AI

Nvidia Unveils Fugatto, an AI Model for Sound Creation and Transformation

By John K. Waters
11/26/2024

The sound of music may never be the same. Nvidia, the Silicon Valley chipmaking giant synonymous with powering the AI revolution, this week introduced a new generative AI model designed to generate or transform any mix of music, voices, and sounds described with prompts using any combination of text and audio files.

Dubbed Fugatto, short for "Foundational Generative Audio Transformer Opus 1," it can compose music, modify voices, and even create entirely new sounds, the company says. But it doesn’t just generate music, it reshapes the very fabric of sound. Nvidia says the model can generate audio never heard before, making it a revolutionary tool for creatives across industries. Its developers are calling it a "Swiss Army knife for sound," because of its broad capabilities—everything from adding or removing instruments in a song to altering the accent and emotion of a voice.

"This thing is wild," said Ido Zmishlany, a multi-platinum music producer and cofounder of One Take Audio, a startup in Nvidia’s Inception program, in a statement. "The idea that I can create entirely new sounds on the fly in the studio is incredible."

Fugatto represents Nvidia’s foray into the burgeoning world of generative AI music creation. It's the first foundational generative AI model to combine numerous audio generation and transformation tasks in one system. Built on 2.5 billion parameters and trained on NVIDIA’s DGX systems, the model showcases "emergent properties" that allow it to synthesize and adapt audio in unprecedented ways.

"We wanted to create a model that understands and generates sound like humans do, said Rafael Valle, NVIDIA’s manager of applied audio research and an orchestral conductor, in a blog post. "Fugatto is our first step toward a future where unsupervised multitask learning in audio synthesis and transformation emerges from data and model scale."

NVIDIA sees a wide range of applications for Fugatto. Music producers can prototype songs, test different styles, or enhance audio quality, while advertisers can adjust voiceovers with different accents and emotions for localized campaigns.

"AI is writing the next chapter of music," Zmishlany said. "We have a new tool for making music, and that’s super exciting."

The model’s potential extends to gaming, the company says, where developers can use it to create real-time audio assets that adapt to gameplay. Language learning tools could use the technology to let users choose personalized voices, including those of family or friends.

Fugatto also has the ability to combine attributes "creatively." For example, users can prompt the model to generate text spoken in a French accent with a sorrowful tone, blending artistic elements in novel ways.

Key to Fugatto’s versatility is a feature called ComposableART, which enables the model to merge instructions it encountered separately during training. It also introduces temporal interpolation, allowing users to craft sounds that evolve over time, such as a rainstorm transitioning into birdsong at dawn.

The NVIDIA team spent more than a year refining Fugatto, the company says, which required creating a blended dataset of millions of audio samples. The effort spanned global contributors from India, Brazil, China, and beyond, strengthening the model’s multilingual and multi-accent capabilities.

"When it generated music from a prompt for the first time, it blew our minds," Valle said.

Fugatto also offers users fine-grained control, from balancing accents in speech to crafting dynamic soundscapes. Unlike many AI models, it can produce entirely new outputs beyond its training data, such as a saxophone mimicking a cat’s meow.

Nvidia isn't the first in this space. Meta, OpenAI, and Runway AI all have GenAI models designed to create new music and audio from human language prompts. And Nvidia does not have immediate plans to publicly release Fugatto. But its promotional YouTube video offers a preview of the model's capabilities.

About the Author

John K. Waters is the editor in chief of a number of Converge360.com sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS. He can be reached at [email protected].