Facebook Open Sources PyTorch Tool for 'Extremely Large' Graphs
Facebook today announced that it has developed and released PyTorch-BigGraph (PBG), a new open source tool that "makes it much faster and easier to produce graph embeddings for extremely large graphs."
Being able to effectively work with such graphs -- for example, embedding multi-relation graphs where a model is too large to fit in memory -- is crucial to advancing artificial intelligence research and the application of artificial intelligence (AI), Facebook's AI team commented.
"PBG is faster than commonly used embedding software and produces embeddings of comparable quality to state-of-the-art models on standard benchmarks," the company said in its announcement.
"With this new tool, anyone can take a large graph and quickly produce high-quality embeddings using a single machine or multiple machines in parallel."
Graphs are used in all types of programming for representing data but can get particularly large and complicated with AI projects due to the sheer amount of data involved. As an example of what PBG can handle, the PBG development team have released "the first published embeddings of the full Wikidata graph of 50 million Wikipedia concepts." The .tsv file for this project can be downloaded here.
Facebook is also promoting how the tool improves upon the current process of embedding extremely large graphs by using block partitioning and different training methods for embeddings, among other techniques:
"There are two challenges for embedding graphs of this size. First, an embedding system must be fast enough to allow for practical research and production uses. With existing methods, for example, training a graph with a trillion edges could take weeks or even years. Memory is a second significant challenge. For example, embedding two billion nodes with 128 float parameters per node would require 1 terabyte of parameters. That exceeds the memory capacity of commodity servers.
PBG uses a block partitioning of the graph to overcome the memory limitations of graph embeddings. Nodes are randomly divided into P partitions that are sized so that two partitions can fit in memory. The edges are then divided into P2 buckets based on their source and destination node...Once the nodes and edges are partitioned, training can be performed on one bucket at a time. The training of bucket (i, j) only requires the embeddings for partitions i and j to be stored in memory.
PBG provides two ways to train embeddings of partitioned graph data. In single-machine training, embeddings and edges are swapped out to disk when they are not being used. In distributed training, embeddings are distributed across the memory of multiple machines."
PBG also has features to more effectively deal with negative sampling, including the use of entity types.
More information on how PBG works can be found here. The GitHub page for the project is located here.
About the Author
Becky Nagel serves as vice president of AI for 1105 Media specializing in developing media, events and training for companies around AI and generative AI technology. She also regularly writes and reports on AI news, and is the founding editor of PureAI.com. She's the author of "ChatGPT Prompt 101 Guide for Business Users" and other popular AI resources with a real-world business perspective. She regularly speaks, writes and develops content around AI, generative AI and other business tech. Find her on X/Twitter @beckynagel.