Sunday, 8 Feb 2026
Subscribe
logo
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Font ResizerAa
Data Center NewsData Center News
Search
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Have an existing account? Sign In
Follow US
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
Data Center News > Blog > AI > AI agent benchmarks are misleading, study warns
AI

AI agent benchmarks are misleading, study warns

Last updated: July 7, 2024 2:11 am
Published July 7, 2024
Share
AI agent benchmarks are misleading, study warns
SHARE

We need to hear from you! Take our fast AI survey and share your insights on the present state of AI, the way you’re implementing it, and what you anticipate to see sooner or later. Learn More


AI brokers have gotten a promising new analysis course with potential functions in the true world. These brokers use basis fashions similar to massive language fashions (LLMs) and imaginative and prescient language fashions (VLMs) to take pure language directions and pursue complicated objectives autonomously or semi-autonomously. AI brokers can use numerous instruments similar to browsers, search engines like google and code compilers to confirm their actions and cause about their objectives. 

Nonetheless, a recent analysis by researchers at Princeton University has revealed a number of shortcomings in present agent benchmarks and analysis practices that hinder their usefulness in real-world functions.

Their findings spotlight that agent benchmarking comes with distinct challenges, and we will’t consider brokers in the identical approach that we benchmark basis fashions.

Value vs accuracy trade-off

One main concern the researchers spotlight of their research is the dearth of price management in agent evaluations. AI brokers will be far more costly to run than a single mannequin name, as they typically depend on stochastic language fashions that may produce completely different outcomes when given the identical question a number of occasions. 


Countdown to VB Remodel 2024

Be part of enterprise leaders in San Francisco from July 9 to 11 for our flagship AI occasion. Join with friends, discover the alternatives and challenges of Generative AI, and discover ways to combine AI functions into your business. Register Now


To extend accuracy, some agentic programs generate a number of responses and use mechanisms like voting or exterior verification instruments to decide on one of the best reply. Typically sampling lots of or 1000’s of responses can improve the agent’s accuracy. Whereas this method can enhance efficiency, it comes at a big computational price. Inference prices should not at all times an issue in analysis settings, the place the purpose is to maximise accuracy.

See also  The rise and fall of AI at the McDonald's drive-thru

Nonetheless, in sensible functions, there’s a restrict to the funds out there for every question, making it essential for agent evaluations to be cost-controlled. Failing to take action could encourage researchers to develop extraordinarily expensive brokers merely to prime the leaderboard. The Princeton researchers suggest visualizing analysis outcomes as a Pareto curve of accuracy and inference price and utilizing methods that collectively optimize the agent for these two metrics.

The researchers evaluated accuracy-cost tradeoffs of various prompting methods and agentic patterns launched in numerous papers.

“For considerably comparable accuracy, the price can differ by virtually two orders of magnitude,” the researchers write. “But, the price of working these brokers isn’t a top-line metric reported in any of those papers.”

The researchers argue that optimizing for each metrics can result in “brokers that price much less whereas sustaining accuracy.” Joint optimization may allow researchers and builders to commerce off the mounted and variable prices of working an agent. For instance, they will spend extra on optimizing the agent’s design however scale back the variable price through the use of fewer in-context studying examples within the agent’s immediate.

The researchers examined joint optimization on HotpotQA, a well-liked question-answering benchmark. Their outcomes present that joint optimization formulation gives a option to strike an optimum stability between accuracy and inference prices.

“Helpful agent evaluations should management for price—even when we in the end don’t care about price and solely about figuring out modern agent designs,” the researchers write. “Accuracy alone can not determine progress as a result of it may be improved by scientifically meaningless strategies similar to retrying.”

Mannequin improvement vs downstream functions

One other concern the researchers spotlight is the distinction between evaluating fashions for analysis functions and growing downstream functions. In analysis, accuracy is usually the first focus, with inference prices being largely ignored. Nonetheless, when growing real-world functions on AI brokers, inference prices play a vital position in deciding which mannequin and approach to make use of.

See also  AI in manufacturing set to unleash new era of profit

Evaluating inference prices for AI brokers is difficult. For instance, completely different mannequin suppliers can cost completely different quantities for a similar mannequin. In the meantime, the prices of API calls are frequently altering and may differ based mostly on builders’ selections. For instance, on some platforms, bulk API calls are charged in another way. 

The researchers created a website that adjusts mannequin comparisons based mostly on token pricing to deal with this concern. 

In addition they carried out a case research on NovelQA, a benchmark for question-answering duties on very lengthy texts. They discovered that benchmarks meant for mannequin analysis will be deceptive when used for downstream analysis. For instance, the unique NovelQA research makes retrieval-augmented era (RAG) look a lot worse than long-context fashions than it’s in a real-world situation. Their findings present that RAG and long-context fashions have been roughly equally correct, whereas long-context fashions are 20 occasions costlier.

Overfitting is an issue

In studying new duties, machine studying (ML) fashions typically discover shortcuts that permit them to attain nicely on benchmarks. One outstanding sort of shortcut is “overfitting,” the place the mannequin finds methods to cheat on the benchmark checks and gives outcomes that don’t translate to the true world. The researchers discovered that overfitting is a significant issue for agent benchmarks, as they are typically small, sometimes consisting of just a few hundred samples. This concern is extra extreme than data contamination in coaching basis fashions, as data of take a look at samples will be immediately programmed into the agent.

To deal with this drawback, the researchers recommend that benchmark builders ought to create and preserve holdout take a look at units which are composed of examples that may’t be memorized throughout coaching and might solely be solved via a correct understanding of the goal process. Of their evaluation of 17 benchmarks, the researchers discovered that many lacked correct holdout datasets, permitting brokers to take shortcuts, even unintentionally. 

See also  [New Study] 2023 Data Center Managed Services Market Strategic Planning and Projection

“Surprisingly, we discover that many agent benchmarks don’t embrace held-out take a look at units,” the researchers write. “Along with making a take a look at set, benchmark builders ought to think about holding it secret to stop LLM contamination or agent overfitting.”

In addition they that several types of holdout samples are wanted based mostly on the specified stage of generality of the duty that the agent accomplishes.

“Benchmark builders should do their finest to make sure that shortcuts are unimaginable,” the researchers write. “We view this because the duty of benchmark builders quite than agent builders, as a result of designing benchmarks that don’t permit shortcuts is far simpler than checking each single agent to see if it takes shortcuts.”

The researchers examined WebArena, a benchmark that evaluates the efficiency of AI brokers in fixing issues with completely different web sites. They discovered a number of shortcuts within the coaching datasets that allowed the brokers to overfit to duties in ways in which would simply break with minor adjustments in the true world. For instance, the agent might make assumptions in regards to the construction of net addresses with out contemplating that it would change sooner or later or that it might not work on completely different web sites.

These errors inflate accuracy estimates and result in over-optimism about agent capabilities, the researchers warn.

With AI brokers being a brand new discipline, the analysis and developer communities have but a lot to study methods to take a look at the boundaries of those new programs which may quickly turn into an essential a part of on a regular basis functions.

“AI agent benchmarking is new and finest practices haven’t but been established, making it arduous to differentiate real advances from hype,” the researchers write. “Our thesis is that brokers are sufficiently completely different from fashions that benchmarking practices must be rethought.”


Source link
TAGGED: Agent, benchmarks, misleading, study, warns
Share This Article
Twitter Email Copy Link Print
Previous Article Introducing Argentina Local IP and Data Center for VPS Server Hosting by TheServerHost Introducing Argentina Local IP and Data Center for VPS Server Hosting by TheServerHost
Next Article Cloud security, IAM, data encryption, endpoint protection, IDS/IPS, compliance, staff training Cloud security, IAM, data encryption, endpoint protection, IDS/IPS, compliance, staff training
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Trusted Source for Accurate and Timely Updates!

Our commitment to accuracy, impartiality, and delivering breaking news as it happens has earned us the trust of a vast audience. Stay ahead with real-time updates on the latest events, trends.
FacebookLike
TwitterFollow
InstagramFollow
YoutubeSubscribe
LinkedInFollow
MediumFollow
- Advertisement -
Ad image

Popular Posts

Evaluating cloud infrastructure: Six key considerations

Dave Errington, Cloud Specialist at CSI Ltd, discusses the shift from cloud to on-site infrastructure,…

September 11, 2024

Power shortages, carbon capture, and AI automation: What’s ahead for data centers in 2026

“Regardless of a broader use of AI instruments in enterprises and by customers, that doesn't…

January 17, 2026

Zayo introduces new wavelength route between London & Paris

Zayo Europe has launched a brand new DWDM route which connects London to Paris. The…

September 26, 2024

Whale.io Says Goodbye to Telegram and Focuses on Web

In a landmark transfer, Whale Casino, the most important venture and one of the vital…

February 15, 2025

AI Domain Name Creation And Registration From Hosted.com

Hosted.com has simply launched its AI Area Identify Generator. This AI (Synthetic Intelligence) area title…

November 22, 2024

You Might Also Like

SuperCool review: Evaluating the reality of autonomous creation
AI

SuperCool review: Evaluating the reality of autonomous creation

By saad
Top 7 best AI penetration testing companies in 2026
AI

Top 7 best AI penetration testing companies in 2026

By saad
Intuit, Uber, and State Farm trial AI agents inside enterprise workflows
AI

Intuit, Uber, and State Farm trial enterprise AI agents

By saad
How separating logic and search boosts AI agent scalability
AI

How separating logic and search boosts AI agent scalability

By saad
Data Center News
Facebook Twitter Youtube Instagram Linkedin

About US

Data Center News: Stay informed on the pulse of data centers. Latest updates, tech trends, and industry insights—all in one place. Elevate your data infrastructure knowledge.

Top Categories
  • Global Market
  • Infrastructure
  • Innovations
  • Investments
Usefull Links
  • Home
  • Contact
  • Privacy Policy
  • Terms & Conditions

© 2024 – datacenternews.tech – All rights reserved

Welcome Back!

Sign in to your account

Lost your password?
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.
You can revoke your consent any time using the Revoke consent button.