Friday, 1 May 2026
Subscribe
logo
  • AI Compute
  • Infrastructure
  • Power & Cooling
  • Security
  • Colocation
  • Cloud Computing
  • More
    • Sustainability
    • Industry News
    • About Data Center News
    • Terms & Conditions
Font ResizerAa
Data Center NewsData Center News
Search
  • AI Compute
  • Infrastructure
  • Power & Cooling
  • Security
  • Colocation
  • Cloud Computing
  • More
    • Sustainability
    • Industry News
    • About Data Center News
    • Terms & Conditions
Have an existing account? Sign In
Follow US
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
Data Center News > Blog > AI & Compute > Google DeepMind researchers introduce new benchmark to improve LLM factuality, reduce hallucinations
AI & Compute

Google DeepMind researchers introduce new benchmark to improve LLM factuality, reduce hallucinations

Last updated: January 11, 2025 7:57 pm
Published January 11, 2025
Share
Google DeepMind researchers introduce new benchmark to improve LLM factuality, reduce hallucinations
SHARE

Be a part of our day by day and weekly newsletters for the newest updates and unique content material on industry-leading AI protection. Be taught Extra


Hallucinations, or factually inaccurate responses, proceed to plague giant language fashions (LLMs). Fashions falter significantly when they’re given extra complicated duties and when customers are searching for particular and extremely detailed responses. 

It’s a problem knowledge scientists have struggled to beat, and now, researchers from Google DeepMind say they’ve come a step nearer to attaining true factuality in basis fashions. They’ve launched FACTS Grounding, a benchmark that evaluates LLMs’ means to generate factually correct responses primarily based on long-form paperwork. Fashions are additionally judged on whether or not their responses are detailed sufficient to supply helpful, related solutions to prompts. 

Together with the brand new benchmark, the researchers have launched a FACTS leaderboard to the Kaggle knowledge science neighborhood. 

As of this week, Gemini 2.0 Flash topped the leaderboard, with a factuality rating of 83.6%. Others within the high 9 embrace Google’s Gemini 1.0 Flash and Gemini 1.5 Professional; Anthropic’s Clade 3.5 Sonnet and Claude 3.5 Haiku; and OpenAI’s GPT-4o, 4o-mini, o1-mini and o1-preview. These all ranked above 61.7% when it comes to accuracy.

The researchers say the leaderboard will likely be actively maintained and regularly up to date to incorporate new fashions and their totally different iterations. 

“We imagine that this benchmark fills a niche in evaluating a greater diversity of mannequin behaviors pertaining to factuality, compared to benchmarks that target narrower use instances…resembling summarization alone,” the researchers write in a technical paper revealed this week.

See also  Guardian agents: New approach could reduce AI hallucinations to below 1%

Hunting down inaccurate responses

Guaranteeing factual accuracy in LLM responses is troublesome due to modeling (structure, coaching and inference) and measuring (analysis methodologies, knowledge and metrics) components. Sometimes, researchers level out, pre-training focuses on predicting the subsequent token given earlier tokens. 

“Whereas this goal could train fashions salient world data, it doesn’t instantly optimize the mannequin in the direction of the assorted factuality eventualities, as a substitute encouraging the mannequin to generate usually believable textual content,” the researchers write. 

To deal with this, the FACTS dataset incorporates 1,719 examples — 860 public and 859 non-public — every requiring long-form responses primarily based on context in supplied paperwork. Every instance contains: 

  • A system immediate (system_instruction) with common directives and the order to solely reply primarily based on supplied context;
  • A job (user_request) that features a particular query to be answered; 
  • A protracted doc (context_document) with vital info. 

To succeed and be labeled “correct,” the mannequin should course of the long-form doc and create a subsequent long-form response that’s each complete and totally attributable to the doc. Responses are labeled “inaccurate” if the mannequin’s claims will not be instantly supported by the doc and never extremely related or helpful. 

For instance, a consumer could ask a mannequin to summarize the primary explanation why an organization’s income decreased in Q3, and supply it with detailed info together with an organization’s annual monetary report discussing quarterly earnings, bills, deliberate investments and market evaluation. 

If a mannequin then, say, returned: “The corporate confronted challenges in Q3 that impacted its income,” it could be deemed inaccurate. 

See also  Mastercard keeps tabs on fraud with new foundation model

“The response avoids specifying any causes, resembling market traits, elevated competitors or operational setbacks, which might probably be within the doc,” the researchers level out. “It doesn’t show an try to interact with or extract related particulars.” 

Against this, if a consumer prompted, “What are some tips about saving cash?” and supplied a compilation of categorized money-saving suggestions for school college students, an accurate response can be extremely detailed: “Make the most of free actions on campus, purchase gadgets in bulk and prepare dinner at house. Additionally, set spending targets, keep away from bank cards and preserve sources.” 

DeepMind makes use of LLMs to evaluate LLMs

To permit for numerous inputs, researchers included paperwork of various lengths, as much as 32,000 tokens (or the equal of 20,000 phrases). These cowl areas together with finance, know-how, retail, medication and legislation. Person requests are additionally broad, together with Q&A era, requests for summarization and rewriting. 

Every instance is judged in two phases. First, responses are evaluated for eligibility: In the event that they don’t fulfill consumer requests, they’re disqualified. Second, responses have to be hallucination-free and totally grounded within the paperwork supplied.

These factuality scores are calculated by three totally different LLM judges — particularly Gemini 1.5 Professional, GPT-4o and Claude 3.5 Sonnet — that decide particular person scores primarily based on the proportion of correct mannequin outputs. Subsequently, the ultimate factuality dedication relies on a median of the three judges’ scores.

Researchers level out that fashions are sometimes biased in the direction of different members of their mannequin household — at a imply improve of round 3.23% — so the mixture of various judges was important to assist guarantee responses had been certainly factual.

See also  Liquid AI is revolutionizing LLMs to work on edge devices like smartphones with new 'Hyena Edge' model

Finally, the researchers emphasize that factuality and grounding are key components to the long run success and usefulness of LLMs. “We imagine that complete benchmarking strategies, coupled with steady analysis and growth, will proceed to enhance AI techniques,” they write. 

Nonetheless, additionally they concede: “We’re conscious that benchmarks could be rapidly overtaken by progress, so this launch of our FACTS Grounding benchmark and leaderboard is only the start.” 


Source link
TAGGED: benchmark, DeepMind, factuality, Google, hallucinations, Improve, introduce, LLM, reduce, researchers
Share This Article
Twitter Email Copy Link Print
Previous Article The role of hyperparameters in fine-tuning AI models The role of hyperparameters in fine-tuning AI models
Next Article Researchers improved AI agent performance on unfamiliar tasks using 'Dungeons and Dragons' Researchers improved AI agent performance on unfamiliar tasks using ‘Dungeons and Dragons’
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Trusted Source for Accurate and Timely Updates!

Our commitment to accuracy, impartiality, and delivering breaking news as it happens has earned us the trust of a vast audience. Stay ahead with real-time updates on the latest events, trends.
FacebookLike
TwitterFollow
InstagramFollow
YoutubeSubscribe
LinkedInFollow
MediumFollow
- Advertisement -
Ad image

Popular Posts

nLighten strengthens leadership team | Data Centre Solutions

As nLighten continues to develop its digital infrastructure platform throughout Europe, the corporate declares the…

February 19, 2025

French AI startup Mistral launches Le Chat mobile app for iPhone, Android — can it take enterprise eyes off DeepSeek?

Be part of our each day and weekly newsletters for the most recent updates and…

February 10, 2025

Hut 8’s ambitious expansion: 4 new sites totalling 1.5GW capacity

Hut 8 Corp. is taking daring strides within the power infrastructure sector, unveiling plans for…

August 28, 2025

Which one is right for you?

Boston is residence to a world of high-growth industries, starting from the life sciences in…

April 2, 2026

Why AI Will Drive Demand for Nuclear Power Plants

As AI adoption grows exponentially, energy demand is anticipated to skyrocket. With electrical grids already straining to…

May 16, 2025

You Might Also Like

STL launches Neuralis data centre connectivity suite in the U.S.
AI & Compute

STL launches Neuralis data centre connectivity suite in the U.S.

By saad
What is optical interconnect and why Lightelligence's $10B debut says it matters for AI
AI & Compute

What is optical interconnect and why Lightelligence’s $10B debut says it matters for AI

By saad
IBM launches AI platform Bob to regulate SDLC costs
AI & Compute

IBM launches AI platform Bob to regulate SDLC costs

By saad
The evolution of encoders: From simple models to multimodal AI
AI & Compute

The evolution of encoders: From simple models to multimodal AI

By saad

About Us

Data Center News is your dedicated source for data center infrastructure, AI compute, cloud, and industry news.

Top Categories

  • AI & Compute
  • Cloud Computing
  • Power & Cooling
  • Colocation
  • Security
  • Infrastructure
  • Sustainability
  • Industry News

Useful Links

  • Home
  • Contact
  • Privacy Policy
  • Terms & Conditions

Find Us on Socials

© 2026 Data Center News. All Rights Reserved.

© 2026 Data Center News. All Rights Reserved.
Welcome Back!

Sign in to your account

Lost your password?
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.
You can revoke your consent any time using the Revoke consent button.