Expert Insights for the AI Business Revolution

Facebook Open Sources PyTorch Tool for 'Extremely Large' Graphs

Facebook today announced that it has developed and released PyTorch-BigGraph (PBG), a new open source tool that "makes it much faster and easier to produce graph embeddings for extremely large graphs."

Being able to effectively work with such graphs -- for example, embedding multi-relation graphs where a model is too large to fit in memory -- is crucial to advancing artificial intelligence research and the application of artificial intelligence (AI), Facebook's AI team commented.

"PBG is faster than commonly used embedding software and produces embeddings of comparable quality to state-of-the-art models on standard benchmarks," the company said in its announcement.

"With this new tool, anyone can take a large graph and quickly produce high-quality embeddings using a single machine or multiple machines in parallel."

Graphs are used in all types of programming for representing data but can get particularly large and complicated with AI projects due to the sheer amount of data involved. As an example of what PBG can handle, the PBG development team have released "the first published embeddings of the full Wikidata graph of 50 million Wikipedia concepts." The .tsv file for this project can be downloaded here.

Facebook is also promoting how the tool improves upon the current process of embedding extremely large graphs by using block partitioning and different training methods for embeddings, among other techniques:

"There are two challenges for embedding graphs of this size. First, an embedding system must be fast enough to allow for practical research and production uses. With existing methods, for example, training a graph with a trillion edges could take weeks or even years. Memory is a second significant challenge. For example, embedding two billion nodes with 128 float parameters per node would require 1 terabyte of parameters. That exceeds the memory capacity of commodity servers.

PBG uses a block partitioning of the graph to overcome the memory limitations of graph embeddings. Nodes are randomly divided into P partitions that are sized so that two partitions can fit in memory. The edges are then divided into P2 buckets based on their source and destination node...Once the nodes and edges are partitioned, training can be performed on one bucket at a time. The training of bucket (i, j) only requires the embeddings for partitions i and j to be stored in memory.

PBG provides two ways to train embeddings of partitioned graph data. In single-machine training, embeddings and edges are swapped out to disk when they are not being used. In distributed training, embeddings are distributed across the memory of multiple machines."

PBG also has features to more effectively deal with negative sampling, including the use of entity types.

More information on how PBG works can be found here. The GitHub page for the project is located here.

About the Author

Becky Nagel is the vice president of Web & Digital Strategy for 1105's Converge360 Group, where she oversees the front-end Web team and deals with all aspects of digital projects at the company, including launching and running the group's popular virtual summit and Coffee talk series . She an experienced tech journalist (20 years), and before her current position, was the editorial director of the group's sites. A few years ago she gave a talk at a leading technical publishers conference about how changes in Web browser technology would impact online advertising for publishers. Follow her on twitter @beckynagel.