Facebook Unveils First Audio-Visual Platform for Embodied AI

Facebook AI just announced that it has open sourced a new audio simulation platform designed to allow researchers to train artificial intelligence (AI) agents in 3D environments with realistic acoustics. In other words, the agents will be able to navigate via sound, as well as sight.

Billed as the first audio-visual platform for embodied AI, SoundSpaces provides a new audio sensor that makes it possible to insert high-fidelity, realistic simulations of virtually any sound source from real-world scanned environments. It's built on top of Facebook AI's AI Habitat simulation platform, and it includes audio renderings from two sets of publicly available 3D environments: Matterport3D and Replica Dataset.

"Today's embodied agents are deaf, lacking this multimodal semantic understanding of the 3D world around them," Facebook AI research scientists Kristen Grauman and Dhruv Batra said in a blog post. "We've built and are now open-sourcing SoundSpaces to address this need."

Embodied AI is one of those science-fictiony terms emerging from the artificial intelligence space that, at one level, simply refers to smart software combined with a real-world physical system--essentially, a robot. But on another level, it addresses the idea that computational intelligence is enhanced by a physical context; a brain in a body equipped with sensors is more intelligent.

Computer vision systems have been around in one form or another since the 1950s, but Facebook AI is breaking new ground with SoundSpaces. "Adding sound not only yields faster training and more accurate navigation at inference," Grauman and Batra wrote, "but also enables the agent to discover the goal on its own from afar." Giving AI agents the ability to "hear" means they can do things like navigate toward a sound-emitting target or learn via echolocation. "With SoundSpaces, researchers can train an agent to identify and move toward a sound source, even if it's behind a couch, for example, or to respond to sounds it has never heard before," they wrote.

SoundSpaces will help embodied AI agents learn more humanlike skills, the Facebook AI researchers predicted, "from multimodal sensory understanding to complex reasoning about objects and places."

To help the AI community more easily reproduce and build on their work, the Facebook AI researchers are providing precomputed audio simulations to allow on-the-fly audio sensing in Matterport3D and the Replica Dataset. "By extending these AI Habitat-compatible 3D assets with our audio simulator, we enable researchers to take advantage of the efficient Habitat API and easily incorporate audio for AI agent training" they wrote.

Matterport3D is a large-scale RGB-D dataset containing 10,800 panoramic views from 194,400 RGB-D images of 90 building-scale scenes. An "RGB-D" image is one that combines the color image (red, green, blue) with a corresponding depth image, resulting in an image that is photo-realistic. Replica Dataset is a set of high-quality reconstructions of o indoor spaces. "Each reconstruction has clean dense geometry, high resolution and high dynamic range textures, glass and mirror surface information, and planar segmentation, as well as semantic class and instance segmentation," the GitHub page reads.

"By pursuing these related research agendas and sharing our work with the wider AI community, we hope to accelerate progress in building embodied AI systems and AI assistants that can help people accomplish a wide range of complex tasks in the physical world," they wrote.

The blog post includes details of their experiments with SoundSpaces, useful graphics, and video clips. The team also posted a research paper ("SoundSpaces: Audio-Visual Navigation in 3D Environments") with even more information.

About the Author

John K. Waters is the editor in chief of a number of sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS.  He can be reached at