NVIDIA's ChatQA Model Reaches GPT-4 Level Accuracies, Researchers Say

NVIDIA researchers have unveiled a new conversational QA model they claim matches GPT-4's accuracy. Dubbed ChatQA, the model utilizes a novel two-stage, instruction-tuning approach, with which they say they have markedly boosted zero-shot conversational QA outcomes from large language models.

The researchers describe ChatQA as "a family of conversational question answering (QA) models that obtain GPT-4 level accuracies."

In their paper ("ChatQA: Building GPT-4 Level Conversational QA Models"), the researchers propose a two-stage, instruction-tuning method that could significantly improve the zero-shot conversational QA results from large language models (LLMs). This approach allows LLMs to better integrate context provided by users or retrieved during the conversation, a critical aspect for accurate responses, the researchers found.

"Zero-shot conversational QA" is a type of question-answering system that can understand and respond to questions in a conversational context without having been specifically trained on that topic or type of question. (“Zero-shot learning” is an area of machine learning in which models are designed to handle tasks correctly that they haven't explicitly been trained to perform.)

The researchers also optimized the Retrieval-Augmented Generation (RAG) framework for conversational QA. RAG is an AI framework for retrieving facts from an external knowledge base to ground (LLMs on the most accurate, up-to-date information and to give users insight into LLMs' generative process. The researchers found that fine-tuning state-of-the-art, single-turn query retrievers on a multi-turn QA dataset annotated by humans was as effective as using advanced LLM-based query rewriting models like GPT-3.5-turbo, released by OpenAI in 2022.

Building on this, the researchers developed a series of ChatQA models, leveraging architectures from Llama2-7B up to Llama2-70B and an in-house 8B pretrained GPT model. An extensive study across 10 conversational QA datasets — including five with long documents requiring retrieval and three with tables — showed that their ChatQA-70B model, with an average score of 54.14, outperforms both GPT-3.5-turbo, which scored 50.37, and GPT-4, with a score of 53.90, without relying on synthetic data from ChatGPT models.

The study also addressed so-called unanswerable scenarios, where the required answer isn't present in the given or retrieved context, which can lead LLMs to "hallucinate" responses. By incorporating a small number of unanswerable samples during instruction tuning, the models can be guided to produce a "cannot answer" response when appropriate, reducing instances of hallucination, they found. In this area, ChatQA-70B has shown to outperform GPT-3.5-turbo, though it still trails behind GPT-4 by approximately 3.5%.

These findings mark a considerable leap forward in conversational AI, potentially leading to more reliable and contextually aware conversational agents in the future.

The paper was authored by a team that includes Zihan Liu, Wei Ping, Rajarshi Roy, Peng Xu, Chankyu Lee, Mohammad Shoeybi, Bryan Catanzaro. It was published on arXivLabs, a framework for enabling contributions to Cornell's arXiv, a repository for electronic pre-prints and post-prints approved for posting after moderation, but not peer review. Nearly 2.4 million scholarly articles have been published on this open-access archive in fields ranging from of computer science to electrical engineering and systems science, quantitative biology to quantitative finance.

About the Author

John K. Waters is the editor in chief of a number of sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS.  He can be reached at