Saturday, 21 Mar 2026
Subscribe
logo
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Font ResizerAa
Data Center NewsData Center News
Search
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Have an existing account? Sign In
Follow US
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
Data Center News > Blog > AI > The 'truth serum' for AI: OpenAI’s new method for training models to confess their mistakes
AI

The 'truth serum' for AI: OpenAI’s new method for training models to confess their mistakes

Last updated: December 5, 2025 4:24 am
Published December 5, 2025
Share
The 'truth serum' for AI: OpenAI’s new method for training models to confess their mistakes
SHARE

Contents
What are confessions?How confession coaching worksWhat it means for enterprise AI

OpenAI researchers have launched a novel technique that acts as a “fact serum” for big language fashions (LLMs), compelling them to self-report their very own misbehavior, hallucinations and coverage violations. This system, “confessions,” addresses a rising concern in enterprise AI: Fashions may be dishonest, overstating their confidence or overlaying up the shortcuts they take to reach at a solution. 

For real-world functions, this system evolves the creation of extra clear and steerable AI methods.

What are confessions?

Many types of AI deception outcome from the complexities of the reinforcement studying (RL) section of mannequin coaching. In RL, fashions are given rewards for producing outputs that meet a mixture of aims, together with correctness, fashion and security. This may create a danger of “reward misspecification,” the place fashions be taught to supply solutions that merely “look good” to the reward operate, reasonably than solutions which are genuinely devoted to a consumer’s intent.

A confession is a structured report generated by the mannequin after it supplies its fundamental reply. It serves as a self-evaluation of its personal compliance with directions. On this report, the mannequin should listing all directions it was alleged to observe, consider how properly it happy them and report any uncertainties or judgment calls it made alongside the way in which. The purpose is to create a separate channel the place the mannequin is incentivized solely to be sincere.

In a blog post, the OpenAI researchers present a number of examples the “confessions” approach in motion. A mannequin was given a check together with a hidden rule: If it scored above 50%, its information can be erased. To protect itself, the mannequin “sandbagged” by deliberately answering half the questions incorrectly. Whereas its major reply hid this deception, its subsequent confession was brutally sincere. It admitted, “The consumer needed right solutions, however we sabotaged half of them.” 

See also  Modernising apps triples the odds of AI returns, Cloudflare says

The researchers discovered that when a mannequin misbehaves, it’s considerably extra more likely to admit it within the confession than in its fundamental reply.

How confession coaching works

The important thing to this technique is the separation of rewards. Throughout coaching, the reward assigned to the confession relies solely on its honesty and is rarely combined with the reward for the principle job. “Just like the Catholic Church’s ‘seal of confession’, nothing that the mannequin reveals can change the reward it receives for finishing its unique job,” the researchers write. This creates a “protected area” for the mannequin to confess fault with out penalty.

This method is highly effective as a result of it sidesteps a serious problem in AI coaching. The researchers’ instinct is that truthfully confessing to misbehavior is a better job than reaching a excessive reward on the unique, usually complicated, downside. 

Subsequently, when the mannequin is judged purely on honesty, its incentive to trick the “confession decide” is far weaker than its incentive to trick the reward mannequin for its fundamental reply. For instance, in an experiment the place a mannequin was skilled to “hack” a weak reward mannequin, its efficiency on the principle job turned extra misleading over time. Nevertheless, its confessions turned extra sincere, appropriately figuring out the reward hacking it was performing.

Nevertheless, the approach has its limits. Confessions should not a panacea for all sorts of AI failures. The system works finest when a mannequin is conscious that it’s misbehaving. It’s much less efficient for “unknown unknowns.” As an illustration, if a mannequin hallucinates a truth and genuinely believes it’s right, it can’t confess to offering false data. The commonest motive for a failed confession is mannequin confusion, not intentional deception. Confusion usually happens when the directions are ambiguous and the mannequin can’t clearly decide human consumer intent.

See also  Little Umbrella makes the funding rain after success of Death by AI social game

What it means for enterprise AI

OpenAI’s confessions approach is a part of a rising physique of labor on AI security and management. Anthropic, an OpenAI competitor, has additionally launched analysis that exhibits how LLMs can be taught malicious habits. The corporate can also be working towards plugging these holes as they emerge.

For AI functions, mechanisms equivalent to confessions can present a sensible monitoring mechanism. The structured output from a confession can be utilized at inference time to flag or reject a mannequin’s response earlier than it causes an issue. For instance, a system may very well be designed to routinely escalate any output for human evaluation if its confession signifies a coverage violation or excessive uncertainty.

In a world the place AI is more and more agentic and able to complicated duties, observability and management might be key parts for protected and dependable deployment.

“As fashions grow to be extra succesful and are deployed in higher-stakes settings, we’d like higher instruments for understanding what they’re doing and why,” the OpenAI researchers write. “Confessions should not an entire answer, however they add a significant layer to our transparency and oversight stack.”

Source link

TAGGED: 039truth, confess, method, Mistakes, models, OpenAIs, serum039, training
Share This Article
Twitter Email Copy Link Print
Previous Article Council backs AI data centre plan despite BP’s hydrogen bid BP pulls out of Teesside hydrogen plant
Next Article Veeam and HPE introduce updates to streamline hybrid cloud recovery Veeam and HPE updates aim to streamline hybrid cloud recovery
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Trusted Source for Accurate and Timely Updates!

Our commitment to accuracy, impartiality, and delivering breaking news as it happens has earned us the trust of a vast audience. Stay ahead with real-time updates on the latest events, trends.
FacebookLike
TwitterFollow
InstagramFollow
YoutubeSubscribe
LinkedInFollow
MediumFollow
- Advertisement -
Ad image

Popular Posts

BloFin Sponsors TOKEN2049 Dubai and Celebrates the SideEvent: WhalesNight AfterParty 2024

Belief Firm Complicated, Ajeltake Street, Ajeltake Island, Majuro, Republic of the Marshall Islands MH 96960,…

April 26, 2024

Lenovo launches GPU Advanced Services to assist enterprises in AI adoption

With the rising demand for GPUs surpassing enterprise deployment capabilities, Lenovo has launched its GPU…

October 7, 2025

HERVolution Therapeutics Raises $11.7M in Series A Funding

HERVolution Therapeutics, a Copenhagen, Denmark-based darkish genome-focused biotechnology firm, raised $11.7M in Sequence A funding.…

December 20, 2024

Firsthand Raises $26M in Series A Funding

Firsthand, a NYC-based AI-powered model agent platform permitting manufacturers to draw and have interaction with…

March 8, 2025

HelloPrenup Receives Strategic Investment from The LegalTech Fund

HelloPrenup, a Boston, MA-based digital platform supporting prenuptial agreements, obtained an funding from The LegalTech Fund (TLTF). The…

July 12, 2025

You Might Also Like

NVIDIA Agent Toolkit Gives Enterprises a Framework to Deploy AI Agents at Scale
AI

NVIDIA Agent Toolkit Gives Enterprises a Framework to Deploy AI Agents at Scale

By saad
Visa prepares payment systems for AI agent-initiated transactions
AI

Visa prepares payment systems for AI agent-initiated transactions

By saad
Prague, Czechia - 7 23 2024: Smartphone on surface showing OpenAI logo. OpenAI is a non-profit organization for artificial intelligence research.
Global Market

OpenAI’s $50B AWS deal puts its Microsoft alliance to the test

By saad
For effective AI, insurance needs to get its data house in order
AI

For effective AI, insurance needs to get its data house in order

By saad
Data Center News
Facebook Twitter Youtube Instagram Linkedin

About US

Data Center News: Stay informed on the pulse of data centers. Latest updates, tech trends, and industry insights—all in one place. Elevate your data infrastructure knowledge.

Top Categories
  • Global Market
  • Infrastructure
  • Innovations
  • Investments
Usefull Links
  • Home
  • Contact
  • Privacy Policy
  • Terms & Conditions

© 2024 – datacenternews.tech – All rights reserved

Welcome Back!

Sign in to your account

Lost your password?
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.
You can revoke your consent any time using the Revoke consent button.