News

Meta Unveiled 'Fastest' AI Supercomputer

Meta, the company formerly known as Facebook, unveiled a new supercomputer this week it claims will be the fastest in the world once it is fully built out in mid-2022.

Called the AI Research SuperCluster (RSC), the next-gen supercomputer will be able to perform quintillions of operations per second, the company says, helping its artificial intelligence (AI) researchers build better AI models that can learn from trillions of examples. Those researchers have already begun using RSC to train large models in natural language processing (NLP) and computer vision for research, with the aim of eventually training models with trillions of parameters, the company said.

"We hope RSC will help us build entirely new AI systems that can, for example, power real-time voice translations to large groups of people, each speaking a different language, so they can seamlessly collaborate on a research project or play an AR game together," said technical program manager Kevin Lee and software engineer Shubho Sengupta, in a company blog post. "Ultimately, the work done with RSC will pave the way toward building technologies for the next major computing platform — the
metaverse, where AI-driven applications and products will play an important role."

AI supercomputers are built by combining multiple GPUs into compute nodes, Lee and Sengupta explained, which are then connected by a high-performance network fabric to allow fast communication among those GPUs. RSC today comprises a total of 760 NVIDIA DGX A100 systems as its compute nodes, for a total of 6,080 GPUs, with each A100 GPU being more powerful than the V100 used in the company's previous system. Each DGX communicates via an NVIDIA Quantum 1600 Gb/s InfiniBand two-level Clos fabric that has no oversubscription. RSC’s storage tier has 175 petabytes of Pure Storage FlashArray, 46 petabytes of cache storage in Penguin Computing Altus systems, and 10 petabytes of Pure Storage FlashBlade.

Development of the RISC supercomputer began during the pandemic as a completely remote project, the company said, from a single shared document, which grew into a functioning cluster in about a year and a half.

COVID-19 and industry-wide wafer supply constraints also caused supply chain issues that made it difficult to get everything from chips to components like optics and GPUs, and even construction materials, the company said.

"To build this cluster efficiently, we had to design it from scratch, creating many entirely new Meta-specific conventions and rethinking previous ones along the way," Lee and Sengupta wrote. "We had to write new rules around our data center designs, including their cooling, power, rack layout, cabling, and networking (including a completely new control plane), among other important considerations. We had to ensure that all the teams, from construction to hardware to software and AI, were working in lockstep and in coordination with our partners."

In its announcement, Meta emphasized the benefits of AI for ferreting out harmful content, and implied that RSC would provide the new computing infrastructure to "accelerate progress" in that area.

"With RSC, we can more quickly train models that use multimodal signals to determine whether an action, sound or image is harmful or benign," the company said in a statement. "This research will not only help keep people safe on our services today, but also in the future, as we build for the metaverse. As RSC moves into its next phase, we plan for it to grow bigger and more powerful, as we begin laying the groundwork for the metaverse."

RSC is up and running today, the company says, but its development is ongoing. When phase two of the project is completed later this year, Metz expects to have the fastest AI supercomputer in the world, performing at nearly 5 exaflops of mixed precision compute. The company says it will continue to work through the coming year to increase the number of GPUs from 6,080 to 16,000, which will increase AI training performance by more than 2.5x. The InfiniBand fabric will expand to support 16,000 ports in a two-layer topology with no oversubscription, the company says, and the storage system will have a target delivery bandwidth of 16 TB/s and exabyte-scale capacity to meet increased demand.

"We expect such a step function change in compute capability to enable us not only to create more accurate AI models for our existing services," Lee and Sengupta wrote," but also to enable completely new user experiences, especially in the metaverse. Our long-term investments in self-supervised learning and in building next-generation AI infrastructure with RSC are helping us create the foundational technologies that will power the metaverse and advance the broader AI community as well."

About the Author

John K. Waters is the editor in chief of a number of Converge360.com sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS.  He can be reached at jwaters@converge360.com.

Featured