Google DeepMind researchers introduce new benchmark to improve LLM factuality, reduce hallucinations

Join our daily and weekly newsletters for the latest updates and exclusive content covering cutting-edge AI. Learn more

Hallucinationsor factually inaccurate answers, continue to plague large language models (LLMs). Models especially fail when given more complex tasks and when users are looking for specific, highly detailed answers.

It’s a challenge that data scientists have struggled to overcome, and now researchers from Google DeepMind say they have taken a step closer to achieving true factuality in foundation models. They introduced FACTS Grounding, a benchmark that assesses the ability of LLMs to generate factually accurate answers based on long documents. Models are also judged on whether their responses are detailed enough to provide useful and relevant responses to the prompts.

Alongside the new reference, the researchers published a Ranking FACTS to the Kaggle data science community.

This week, Gemini 2.0 Flash topped the rankings, with a news score of 83.6%. Others in the top 9 include Google’s Gemini 1.0 Flash and Gemini 1.5 Pro; Clade 3.5 Sonnet and Claude 3.5 Haiku from Anthropic; and OpenAI’s GPT-4o, 4o-mini, o1-mini and o1-preview. All of these ranked above 61.7% in terms of accuracy.

The researchers say the ranking will be actively maintained and continually updated to include new models and their different iterations.

“We believe this benchmark fills a gap in evaluating a wider variety of factuality-related model behaviors, compared to benchmarks that focus on narrower use cases… like summary alone,” write the researchers in an article. technical document published this week.

Eliminate inaccurate answers

Ensure factual accuracy in LLM answers is difficult due to modeling (architecture, training and inference) and measurement (evaluation methodologies, data and metrics) factors. Typically, the researchers point out, pre-training focuses on predicting the next token based on previous tokens.

“While this goal can teach models about salient global knowledge, it does not directly optimize the model based on different factuality scenarios, but rather encourages the model to generate general knowledge. plausible text,” write the researchers.

To address this issue, the FACTS dataset incorporates 1,719 examples – 860 public and 859 private – each requiring detailed responses based on the context in the documents provided. Each example includes:

A system prompt (system_instruction) with general instructions and the order to respond only based on the context provided;
A task (user_request) that includes a specific question to be answered;
A long document (context_document) with the necessary information.

To succeed and be qualified as “precise”, the model must process the detailed document and create a subsequent detailed response that is both complete and fully attributable to the document. Answers are labeled “inaccurate” if the model’s assertions are not directly supported by the document and are not very relevant or useful.

For example, a user can ask a model to summarize the main reasons why a company’s revenue declined in the third quarter and provide detailed information, including a company’s annual financial report detailing quarterly profits, expenditures, planned investments and market analysis.

If a model came back, for example: “The company faced challenges in the third quarter that impacted its revenue,” it would be deemed inaccurate.

“The response avoids specifying the reasons, such as market trends, increased competition or operational setbacks, that would be likely to appear in the document,” the researchers point out. “This does not demonstrate an attempt to interact with or extract relevant details.”

On the other hand, if a user asked, “What are some tips for saving money?” and provided a compilation of categorized money-saving tips for students, a correct answer would be very detailed: “Take advantage of free on-campus activities, buy items in bulk, and cook at home.” Also set spending goals, avoid credit cards and conserve resources.

DeepMind uses LLMs to judge LLMs

To allow for diverse contributions, the researchers included documents of varying lengths, up to 32,000 tokens (or the equivalent of 20,000 words). These cover areas such as finance, technology, retail, medicine and law. User requests are also broad, including Q&A generation, summary and rewrite requests.

Each example is judged in two phases. First, the answers are evaluated for eligibility: if they do not meet user requests, they are disqualified. Second, the answers must be free from hallucinations and fully based on the documents provided.

These factuality scores are calculated by three different LLM judges – specifically Gemini 1.5 Pro, GPT-4o and Claude 3.5 Sonnet – who determine individual scores based on the model’s percentage of accurate results. Thereafter, the final determination of factuality is based on an average of the three judges’ scores.

The researchers point out that models are often biased towards other members of their model family – with an average increase of around 3.23% – so the combination of different judges was essential to ensure that the answers were indeed factual.

Ultimately, researchers emphasize that factuality and grounding are key factors in the future success and usefulness of LLMs. “We believe that comprehensive benchmarking methods, combined with continued research and development, will continue to improve AI systems,” they write.

However, they also concede: “We recognize that benchmarks can quickly be overtaken by advancements, which is why this launch of our FACTS Grounding benchmark and ranking is only the beginning. »

Daily insights into business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you insight into what companies are doing with generative AI, from regulatory changes to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thank you for subscribing. Learn more VB newsletters here.

An error has occurred.