Friday, 1 May 2026
Subscribe
logo
  • AI Compute
  • Infrastructure
  • Power & Cooling
  • Security
  • Colocation
  • Cloud Computing
  • More
    • Sustainability
    • Industry News
    • About Data Center News
    • Terms & Conditions
Font ResizerAa
Data Center NewsData Center News
Search
  • AI Compute
  • Infrastructure
  • Power & Cooling
  • Security
  • Colocation
  • Cloud Computing
  • More
    • Sustainability
    • Industry News
    • About Data Center News
    • Terms & Conditions
Have an existing account? Sign In
Follow US
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
Data Center News > Blog > AI & Compute > When AI reasoning goes wrong: Microsoft Research shows more tokens can mean more problems
AI & Compute

When AI reasoning goes wrong: Microsoft Research shows more tokens can mean more problems

Last updated: April 16, 2025 4:04 am
Published April 16, 2025
Share
When AI reasoning goes wrong: Microsoft Research shows more tokens can mean more problems
SHARE

Be part of our each day and weekly newsletters for the newest updates and unique content material on industry-leading AI protection. Be taught Extra


Giant language fashions (LLMs) are more and more able to complicated reasoning by “inference-time scaling,” a set of methods that allocate extra computational sources throughout inference to generate solutions. Nevertheless, a new study from Microsoft Analysis reveals that the effectiveness of those scaling strategies isn’t common. Efficiency boosts fluctuate considerably throughout totally different fashions, duties and downside complexities.

The core discovering is that merely throwing extra compute at an issue throughout inference doesn’t assure higher or extra environment friendly outcomes. The findings may also help enterprises higher perceive value volatility and mannequin reliability as they give the impression of being to combine superior AI reasoning into their functions.

Placing scaling strategies to the check

The Microsoft Analysis crew carried out an intensive empirical evaluation throughout 9 state-of-the-art basis fashions. This included each “typical” fashions like GPT-4o, Claude 3.5 Sonnet, Gemini 2.0 Professional and Llama 3.1 405B, in addition to fashions particularly fine-tuned for enhanced reasoning by inference-time scaling. This included OpenAI’s o1 and o3-mini, Anthropic’s Claude 3.7 Sonnet, Google’s Gemini 2 Flash Considering, and DeepSeek R1.

They evaluated these fashions utilizing three distinct inference-time scaling approaches:

  1. Normal Chain-of-Thought (CoT): The essential technique the place the mannequin is prompted to reply step-by-step.
  2. Parallel Scaling: the mannequin generates a number of unbiased solutions for a similar query and makes use of an aggregator (like majority vote or choosing the best-scoring reply) to reach at a ultimate end result.
  3. Sequential Scaling: The mannequin iteratively generates a solution and makes use of suggestions from a critic (probably from the mannequin itself) to refine the reply in subsequent makes an attempt.

These approaches have been examined on eight difficult benchmark datasets overlaying a variety of duties that profit from step-by-step problem-solving: math and STEM reasoning (AIME, Omni-MATH, GPQA), calendar planning (BA-Calendar), NP-hard issues (3SAT, TSP), navigation (Maze) and spatial reasoning (SpatialMap).

See also  A new paradigm for AI: How 'thinking as optimization' leads to better general-purpose models

A number of benchmarks included issues with various problem ranges, permitting for a extra nuanced understanding of how scaling behaves as issues turn into tougher.

“The supply of problem tags for Omni-MATH, TSP, 3SAT, and BA-Calendar allows us to research how accuracy and token utilization scale with problem in inference-time scaling, which is a perspective that’s nonetheless underexplored,” the researchers wrote in the paper detailing their findings.

The researchers evaluated the Pareto frontier of LLM reasoning by analyzing each accuracy and the computational value (i.e., the variety of tokens generated). This helps determine how effectively fashions obtain their outcomes. 

Inference-time scaling pareto
Inference-time scaling Pareto frontier Credit score: arXiv

Additionally they launched the “conventional-to-reasoning hole” measure, which compares the very best efficiency of a standard mannequin (utilizing a really perfect “best-of-N” choice) in opposition to the typical efficiency of a reasoning mannequin, estimating the potential positive factors achievable by higher coaching or verification methods.

Extra compute isn’t at all times the reply

The research supplied a number of essential insights that problem widespread assumptions about inference-time scaling:

Advantages fluctuate considerably: Whereas fashions tuned for reasoning typically outperform typical ones on these duties, the diploma of enchancment varies drastically relying on the precise area and process. Positive aspects usually diminish as downside complexity will increase. For example, efficiency enhancements seen on math issues didn’t at all times translate equally to scientific reasoning or planning duties.

Token inefficiency is rife: The researchers noticed excessive variability in token consumption, even between fashions attaining comparable accuracy. For instance, on the AIME 2025 math benchmark, DeepSeek-R1 used over 5 occasions extra tokens than Claude 3.7 Sonnet for roughly comparable common accuracy. 

Extra tokens don’t result in increased accuracy: Opposite to the intuitive concept that longer reasoning chains imply higher reasoning, the research discovered this isn’t at all times true. “Surprisingly, we additionally observe that longer generations relative to the identical mannequin can generally be an indicator of fashions struggling, slightly than improved reflection,” the paper states. “Equally, when evaluating totally different reasoning fashions, increased token utilization will not be at all times related to higher accuracy. These findings encourage the necessity for extra purposeful and cost-effective scaling approaches.”

See also  AstraZeneca bets on in-house AI to speed up oncology research

Value nondeterminism: Maybe most regarding for enterprise customers, repeated queries to the identical mannequin for a similar downside can lead to extremely variable token utilization. This implies the price of operating a question can fluctuate considerably, even when the mannequin persistently supplies the proper reply. 

variance in model outputs
Variance in response size (spikes present smaller variance) Credit score: arXiv

The potential in verification mechanisms: Scaling efficiency persistently improved throughout all fashions and benchmarks when simulated with a “good verifier” (utilizing the best-of-N outcomes). 

Standard fashions generally match reasoning fashions: By considerably rising inference calls (as much as 50x extra in some experiments), typical fashions like GPT-4o may generally method the efficiency ranges of devoted reasoning fashions, significantly on much less complicated duties. Nevertheless, these positive factors diminished quickly in extremely complicated settings, indicating that brute-force scaling has its limits.

GPT-4o inference-time scaling
On some duties, the accuracy of GPT-4o continues to enhance with parallel and sequential scaling. Credit score: arXiv

Implications for the enterprise

These findings carry vital weight for builders and enterprise adopters of LLMs. The problem of “value nondeterminism” is especially stark and makes budgeting tough. Because the researchers level out, “Ideally, builders and customers would like fashions for which the usual deviation on token utilization per occasion is low for value predictability.”

“The profiling we do in [the study] might be helpful for builders as a software to choose which fashions are much less unstable for a similar immediate or for various prompts,” Besmira Nushi, senior principal analysis supervisor at Microsoft Analysis, informed VentureBeat. “Ideally, one would wish to choose a mannequin that has low commonplace deviation for proper inputs.” 

See also  A weekend ‘vibe code’ hack by Andrej Karpathy quietly sketches the missing layer of enterprise AI orchestration
Fashions that peak blue to the left persistently generate the identical variety of tokens on the given process Credit score: arXiv

The research additionally supplies good insights into the correlation between a mannequin’s accuracy and response size. For instance, the next diagram reveals that math queries above ~11,000 token size have a really slim probability of being right, and people generations ought to both be stopped at that time or restarted with some sequential suggestions. Nevertheless, Nushi factors out that fashions permitting these submit hoc mitigations even have a cleaner separation between right and incorrect samples.

“In the end, it’s also the duty of mannequin builders to consider decreasing accuracy and value non-determinism, and we anticipate a whole lot of this to occur because the strategies get extra mature,” Nushi mentioned. “Alongside value nondeterminism, accuracy nondeterminism additionally applies.”

One other necessary discovering is the constant efficiency increase from good verifiers, which highlights a important space for future work: constructing strong and broadly relevant verification mechanisms. 

“The supply of stronger verifiers can have various kinds of impression,” Nushi mentioned, similar to enhancing foundational coaching strategies for reasoning. “If used effectively, these may also shorten the reasoning traces.”

Sturdy verifiers may also turn into a central a part of enterprise agentic AI options. Many enterprise stakeholders have already got such verifiers in place, which can have to be repurposed for extra agentic options, similar to SAT solvers, logistic validity checkers, and so forth. 

“The questions for the longer term are how such current methods will be mixed with AI-driven interfaces and what’s the language that connects the 2,” Nushi mentioned. “The need of connecting the 2 comes from the truth that customers won’t at all times formulate their queries in a proper manner, they’ll wish to use a pure language interface and anticipate the options in an identical format or in a ultimate motion (e.g. suggest a gathering invite).”


Source link
TAGGED: Microsoft, problems, reasoning, Research, shows, Tokens, Wrong
Share This Article
Twitter Email Copy Link Print
Previous Article Person holding a phone displaying the Threads social media platform in front of the Meta logo as the company confirms plans to utilise content shared by its adult users in the EU (European Union) on platforms like Instagram and Facebook to train its AI models. Meta will train AI models using EU user data
Next Article ChatGPT got another viral moment with ‘AI action figure’ trend ChatGPT got another viral moment with ‘AI action figure’ trend
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Trusted Source for Accurate and Timely Updates!

Our commitment to accuracy, impartiality, and delivering breaking news as it happens has earned us the trust of a vast audience. Stay ahead with real-time updates on the latest events, trends.
FacebookLike
TwitterFollow
InstagramFollow
YoutubeSubscribe
LinkedInFollow
MediumFollow
- Advertisement -
Ad image

Popular Posts

Anthropic vs. OpenAI red teaming methods reveal different security priorities for enterprise AI

Model suppliers need to show the safety and robustness of their fashions, releasing system playing…

December 7, 2025

Major AI market share shift revealed: DALL-E plummets 80% as Black Forest Labs dominates 2025 data

Be a part of our every day and weekly newsletters for the newest updates and…

March 10, 2025

$42.1 million poured into startup offering energy-efficient solutions for costly and unwieldy operational data and AI workloads

Be a part of our each day and weekly newsletters for the newest updates and…

April 23, 2025

Farming at the edge with autonomous robots

As deployments of edge AI scale within the farming sector, steady monitoring of edge fleets…

March 23, 2026

US$905B bet on agentic future

Walmart’s December 9 switch to Nasdaq wasn’t only a symbolic gesture. The US$905 billion retailer…

December 15, 2025

You Might Also Like

STL launches Neuralis data centre connectivity suite in the U.S.
AI & Compute

STL launches Neuralis data centre connectivity suite in the U.S.

By saad
What is optical interconnect and why Lightelligence's $10B debut says it matters for AI
AI & Compute

What is optical interconnect and why Lightelligence’s $10B debut says it matters for AI

By saad
IBM launches AI platform Bob to regulate SDLC costs
AI & Compute

IBM launches AI platform Bob to regulate SDLC costs

By saad
The evolution of encoders: From simple models to multimodal AI
AI & Compute

The evolution of encoders: From simple models to multimodal AI

By saad

About Us

Data Center News is your dedicated source for data center infrastructure, AI compute, cloud, and industry news.

Top Categories

  • AI & Compute
  • Cloud Computing
  • Power & Cooling
  • Colocation
  • Security
  • Infrastructure
  • Sustainability
  • Industry News

Useful Links

  • Home
  • Contact
  • Privacy Policy
  • Terms & Conditions

Find Us on Socials

© 2026 Data Center News. All Rights Reserved.

© 2026 Data Center News. All Rights Reserved.
Welcome Back!

Sign in to your account

Lost your password?
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.
You can revoke your consent any time using the Revoke consent button.