News

Microsoft Open Sources AI Tool for Exploring Large Document Datasets

Earlier this month Microsoft announced GraphRAG, a "complex data discovery" tool that allows "a graph-based approach to retrieval-augmented generation (RAG) that enables question-answering over private or previously unseen datasets."

Available on Github, the tool "uses a large language model (LLM) to automate the extraction of a rich knowledge graph from any collection of text documents," the company said in its announcement of the release.

One of the most exciting features of this graph-based data index is its ability to report on the semantic structure of the data prior to any user queries. It does this by detecting communities; of densely connected nodes in a hierarchical fashion, partitioning the graph at multiple levels from high-level themes to low-level topics…[creating] a hierarchical summary of the data, providing an overview of a dataset without needing to know which questions to ask in advance. Each community serves as the basis of a community summary that describes its entities and their relationships.

Along with making GraphRag open source, Microsoft also released a solution accelerator to help speed up the implementation process.

Microsoft suggests using the tool when native RAG options aren't pulling the output needed for LLM training. However, it may not work in every scenario:

LLMs can successfully derive rich knowledge graphs from unstructured text inputs, and these graphs can support a new class of global queries for which (a) naive RAG cannot generate appropriate responses, and (b) hierarchical source text summarization is prohibitively expensive per query. The overall suitability of GraphRAG for any given use case, however, depends on whether the benefits of structured knowledge representations, readymade community summaries, and support for global queries outweigh the upfront costs of graph index construction.

You can read the full announcement here.

About the Author

Becky Nagel serves as vice president of AI for 1105 Media specializing in developing media, events and training for companies around AI and generative AI technology. She also regularly writes and reports on AI news, and is the founding editor of PureAI.com. She's the author of "ChatGPT Prompt 101 Guide for Business Users" and other popular AI resources with a real-world business perspective. She regularly speaks, writes and develops content around AI, generative AI and other business tech. Find her on X/Twitter @beckynagel.

Featured