Friday, 1 May 2026
Subscribe
logo
  • AI Compute
  • Infrastructure
  • Power & Cooling
  • Security
  • Colocation
  • Cloud Computing
  • More
    • Sustainability
    • Industry News
    • About Data Center News
    • Terms & Conditions
Font ResizerAa
Data Center NewsData Center News
Search
  • AI Compute
  • Infrastructure
  • Power & Cooling
  • Security
  • Colocation
  • Cloud Computing
  • More
    • Sustainability
    • Industry News
    • About Data Center News
    • Terms & Conditions
Have an existing account? Sign In
Follow US
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
Data Center News > Blog > AI & Compute > Your AI models are failing in production—Here’s how to fix model selection
AI & Compute

Your AI models are failing in production—Here’s how to fix model selection

Last updated: June 4, 2025 11:46 am
Published June 4, 2025
Share
Your AI models are failing in production—Here's how to fix model selection
SHARE

Be a part of our every day and weekly newsletters for the most recent updates and unique content material on industry-leading AI protection. Study Extra


Enterprises must know if the fashions that energy their purposes and brokers work in real-life situations. This sort of analysis can generally be complicated as a result of it’s onerous to foretell particular situations. A revamped model of the RewardBench benchmark appears to offer organizations a greater thought of a mannequin’s real-life efficiency. 

The Allen Institute of AI (Ai2) launched RewardBench 2, an up to date model of its reward mannequin benchmark, RewardBench, which they declare supplies a extra holistic view of mannequin efficiency and assesses how fashions align with an enterprise’s targets and requirements. 

Ai2 constructed RewardBench with classification duties that measure correlations by inference-time compute and downstream coaching. RewardBench primarily offers with reward fashions (RM), which may act as judges and consider LLM outputs. RMs assign a rating or a “reward” that guides reinforcement studying with human suggestions (RHLF).

RewardBench 2 is right here! We took a very long time to be taught from our first reward mannequin analysis device to make one that’s considerably tougher and extra correlated with each downstream RLHF and inference-time scaling. pic.twitter.com/NGetvNrOQV

— Ai2 (@allen_ai) June 2, 2025

Nathan Lambert, a senior analysis scientist at Ai2, instructed VentureBeat that the primary RewardBench labored as supposed when it was launched. Nonetheless, the mannequin atmosphere quickly advanced, and so ought to its benchmarks. 

“As reward fashions turned extra superior and use instances extra nuanced, we shortly acknowledged with the neighborhood that the primary model didn’t absolutely seize the complexity of real-world human preferences,” he stated. 

See also  Deep Cogito open LLMs use IDA to outperform same size models

Lambert added that with RewardBench 2, “we got down to enhance each the breadth and depth of analysis—incorporating extra numerous, difficult prompts and refining the methodology to mirror higher how people really choose AI outputs in follow.” He stated the second model makes use of unseen human prompts, has a tougher scoring setup and new domains. 

Utilizing evaluations for fashions that consider

Whereas reward fashions take a look at how nicely fashions work, it’s additionally necessary that RMs align with firm values; in any other case, the fine-tuning and reinforcement studying course of can reinforce unhealthy habits, corresponding to hallucinations, scale back generalization, and rating dangerous responses too excessive.

RewardBench 2 covers six totally different domains: factuality, exact instruction following, math, security, focus and ties.

“Enterprises ought to use RewardBench 2 in two alternative ways relying on their utility. In the event that they’re performing RLHF themselves, they need to undertake the perfect practices and datasets from main fashions in their very own pipelines as a result of reward fashions want on-policy coaching recipes (i.e. reward fashions that mirror the mannequin they’re attempting to coach with RL). For inference time scaling or knowledge filtering, RewardBench 2 has proven that they’ll choose the perfect mannequin for his or her area and see correlated efficiency,” Lambert stated. 

Lambert famous that benchmarks like RewardBench provide customers a strategy to consider the fashions they’re selecting based mostly on the “dimensions that matter most to them, reasonably than counting on a slim one-size-fits-all rating.” He stated the thought of efficiency, which many analysis strategies declare to evaluate, may be very subjective as a result of a superb response from a mannequin extremely depends upon the context and targets of the consumer. On the identical time, human preferences get very nuanced. 

See also  Trillion-parameter AI model: Ant Group's Ling-1T launch

Ai 2 launched the primary model of RewardBench in March 2024. On the time, the corporate stated it was the primary benchmark and leaderboard for reward fashions. Since then, a number of strategies for benchmarking and bettering RM have emerged. Researchers at Meta’s FAIR got here out with reWordBench. DeepSeek launched a brand new approach known as Self-Principled Critique Tuning for smarter and scalable RM. 

Tremendous excited that our second reward mannequin analysis is out. It is considerably tougher, a lot cleaner, and nicely correlated with downstream PPO/BoN sampling.

Comfortable hillclimbing!

Big congrats to @saumyamalik44 who lead the undertaking with a complete dedication to excellence. https://t.co/c0b6rHTXY5

— Nathan Lambert (@natolambert) June 2, 2025

How fashions carried out

Since RewardBench 2 is an up to date model of RewardBench, Ai2 examined each present and newly educated fashions to see in the event that they proceed to rank excessive. These included quite a lot of fashions, corresponding to variations of Gemini, Claude, GPT-4.1, and Llama-3.1, together with datasets and fashions like Qwen, Skywork, and its personal Tulu. 

The corporate discovered that bigger reward fashions carry out greatest on the benchmark as a result of their base fashions are stronger. General, the strongest-performing fashions are variants of Llama-3.1 Instruct. When it comes to focus and security, Skywork knowledge “is especially useful,” and Tulu did nicely on factuality. 

Ai2 stated that whereas they consider RewardBench 2 “is a step ahead in broad, multi-domain accuracy-based analysis” for reward fashions, they cautioned that mannequin analysis must be primarily used as a information to choose fashions that work greatest with an enterprise’s wants. 

See also  Microsoft AutoGen v0.4: A turning point toward more intelligent AI agents for enterprise developers

Source link
TAGGED: failing, Fix, Model, models, productionHeres, Selection
Share This Article
Twitter Email Copy Link Print
Previous Article AI deployemnt security and governance, with Deloitte AI deployemnt security and governance, with Deloitte
Next Article Reinventing the edge | Data Centre Solutions Reinventing the edge | Data Centre Solutions
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Trusted Source for Accurate and Timely Updates!

Our commitment to accuracy, impartiality, and delivering breaking news as it happens has earned us the trust of a vast audience. Stay ahead with real-time updates on the latest events, trends.
FacebookLike
TwitterFollow
InstagramFollow
YoutubeSubscribe
LinkedInFollow
MediumFollow
- Advertisement -
Ad image

Popular Posts

Georgia and Pennsylvania: Rising stars in the US data centre market

Georgia and Pennsylvania have quickly emerged as two crucial gamers within the US knowledge centre…

September 5, 2025

Salesforce to buy Informatica in $8B deal

Salesforce has agreed to accumulate information administration agency Informatica in a deal valued at round…

May 28, 2025

Beyond RAG: How cache-augmented generation reduces latency, complexity for smaller workloads

Be a part of our every day and weekly newsletters for the most recent updates…

January 18, 2025

Avner Papouchado, Serverfarm CEO, Speaking at TMT M&A Forum USA 2022

Avner Papouchado, Serverfarm CEO, Talking at TMT M&A Discussion board USA 2022 September 21, 2022…

April 5, 2026

Why ‘prosocial AI’ must be the framework for designing, deploying and governing AI

Be part of our every day and weekly newsletters for the newest updates and unique…

January 25, 2025

You Might Also Like

STL launches Neuralis data centre connectivity suite in the U.S.
AI & Compute

STL launches Neuralis data centre connectivity suite in the U.S.

By saad
What is optical interconnect and why Lightelligence's $10B debut says it matters for AI
AI & Compute

What is optical interconnect and why Lightelligence’s $10B debut says it matters for AI

By saad
IBM launches AI platform Bob to regulate SDLC costs
AI & Compute

IBM launches AI platform Bob to regulate SDLC costs

By saad
The evolution of encoders: From simple models to multimodal AI
AI & Compute

The evolution of encoders: From simple models to multimodal AI

By saad

About Us

Data Center News is your dedicated source for data center infrastructure, AI compute, cloud, and industry news.

Top Categories

  • AI & Compute
  • Cloud Computing
  • Power & Cooling
  • Colocation
  • Security
  • Infrastructure
  • Sustainability
  • Industry News

Useful Links

  • Home
  • Contact
  • Privacy Policy
  • Terms & Conditions

Find Us on Socials

© 2026 Data Center News. All Rights Reserved.

© 2026 Data Center News. All Rights Reserved.
Welcome Back!

Sign in to your account

Lost your password?
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.
You can revoke your consent any time using the Revoke consent button.