Sunday, 14 Dec 2025
Subscribe
logo
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Font ResizerAa
Data Center NewsData Center News
Search
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Have an existing account? Sign In
Follow US
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
Data Center News > Blog > AI > Beyond benchmarks: How DeepSeek-R1 and o1 perform on real-world tasks
AI

Beyond benchmarks: How DeepSeek-R1 and o1 perform on real-world tasks

Last updated: January 31, 2025 11:10 pm
Published January 31, 2025
Share
Beyond benchmarks: How DeepSeek-R1 and o1 perform on real-world tasks
SHARE

Be a part of our day by day and weekly newsletters for the newest updates and unique content material on industry-leading AI protection. Study Extra


DeepSeek-R1 has certainly created a variety of pleasure and concern, particularly for OpenAI’s rival mannequin o1. So, we put them to check in a side-by-side comparability on a couple of easy information evaluation and market analysis duties. 

To place the fashions on equal footing, we used Perplexity Professional Search, which now helps each o1 and R1. Our purpose was to look past benchmarks and see if the fashions can truly carry out advert hoc duties that require gathering info from the net, choosing out the correct items of information and performing easy duties that may require substantial guide effort. 

Each fashions are spectacular however make errors when the prompts lack specificity. o1 is barely higher at reasoning duties however R1’s transparency offers it an edge in circumstances (and there will probably be fairly a couple of) the place it makes errors.

Here’s a breakdown of some of our experiments and the hyperlinks to the Perplexity pages the place you possibly can overview the outcomes your self.

Calculating returns on investments from the net

Our first take a look at gauged whether or not fashions might calculate returns on funding (ROI). We thought-about a situation the place the consumer has invested $140 within the Magnificent Seven (Alphabet, Amazon, Apple, Meta, Microsoft, Nvidia, Tesla) on the primary day of each month from January to December 2024. We requested the mannequin to calculate the worth of the portfolio on the present date.

To perform this job, the mannequin must pull Magazine 7 worth info for the primary day of every month, break up the month-to-month funding evenly throughout the shares ($20 per inventory), sum them up and calculate the portfolio worth in response to the worth of the shares on the present date.

See also  Alibaba's new Qwen3-235B-A22B-2507 beats Kimi-2, Claude Opus

On this job, each fashions failed. o1 returned a list of stock prices for January 2024 and January 2025 together with a method to calculate the portfolio worth. Nonetheless, it did not calculate the proper values and mainly mentioned that there could be no ROI. Then again, R1 made the error of solely investing in January 2024 and calculating the returns for January 2025.

o1’s reasoning hint doesn’t present sufficient info

Nonetheless, what was fascinating was the fashions’ reasoning course of. Whereas o1 didn’t present a lot particulars on the way it had reached its outcomes, R1’s reasoning traced confirmed that it didn’t have the proper info as a result of Perplexity’s retrieval engine had did not get hold of the month-to-month information for inventory costs (many retrieval-augmented technology functions fail not due to the mannequin lack of skills however due to unhealthy retrieval). This proved to be an essential little bit of suggestions that led us to the subsequent experiment.

The R1 reasoning hint reveals that it’s lacking info

Reasoning over file content material

We determined to run the identical experiment as earlier than, however as an alternative of prompting the mannequin to retrieve the knowledge from the net, we determined to offer it in a textual content file. For this, we copy-pasted inventory month-to-month information for every inventory from Yahoo! Finance right into a textual content file and gave it to the mannequin. The file contained the title of every inventory plus the HTML desk that contained the worth for the primary day of every month from January to December 2024 and the final recorded worth. The information was not cleaned to cut back the guide effort and take a look at whether or not the mannequin might choose the correct elements from the info.

See also  OpenAI–Anthropic cross-tests expose jailbreak and misuse risks — what enterprises must add to GPT-5 evaluations

Once more, each fashions failed to offer the correct reply. o1 seemed to have extracted the data from the file, however instructed the calculation be finished manually in a instrument like Excel. The reasoning hint was very obscure and didn’t comprise any helpful info to troubleshoot the mannequin. R1 also failed and didn’t present a solution, however the reasoning hint contained a variety of helpful info.

For instance, it was clear that the mannequin had accurately parsed the HTML information for every inventory and was in a position to extract the proper info. It had additionally been in a position to do the month-by-month calculation of investments, sum them and calculate the ultimate worth in response to the newest inventory worth within the desk. Nonetheless, that last worth remained in its reasoning chain and did not make it into the ultimate reply. The mannequin had additionally been confounded by a row within the Nvidia chart that had marked the corporate’s 10:1 inventory break up on June 10, 2024, and ended up miscalculating the ultimate worth of the portfolio. 

R1 hid the leads to its reasoning hint together with details about the place it went fallacious

Once more, the actual differentiator was not the end result itself, however the potential to analyze how the mannequin arrived at its response. On this case, R1 supplied us with a greater expertise, permitting us to know the mannequin’s limitations and the way we will reformulate our immediate and format our information to get higher outcomes sooner or later.

Evaluating information over the net

One other experiment we carried out required the mannequin to match the stats of 4 main NBA facilities and decide which one had the perfect enchancment in subject purpose share (FG%) from the 2022/2023 to the 2023/2024 seasons. This job required the mannequin to do multi-step reasoning over completely different information factors. The catch within the immediate was that it included Victor Wembanyama, who simply entered the league as a rookie in 2023.

See also  Stop benchmarking in the lab: Inclusion Arena shows how LLMs perform in production

The retrieval for this immediate was a lot simpler, since participant stats are broadly reported on the internet and are often included of their Wikipedia and NBA profiles. Each fashions answered accurately (it’s Giannis in case you had been curious), though relying on the sources they used, their figures had been a bit completely different. Nonetheless, they didn’t understand that Wemby didn’t qualify for the comparability and gathered different stats from his time within the European league.

In its reply, R1 provided a better breakdown of the outcomes with a comparability desk together with hyperlinks to the sources it used for its reply. The added context enabled us to right the immediate. After we modified the immediate specifying that we had been searching for FG% from NBA seasons, the mannequin accurately dominated out Wemby from the outcomes.

Including a easy phrase to the immediate made all of the distinction within the end result. That is one thing {that a} human would implicitly know. Be as particular as you possibly can in your immediate, and attempt to embody info {that a} human would implicitly assume.

Closing verdict

Reasoning fashions are highly effective instruments, however nonetheless have a methods to go earlier than they are often totally trusted with duties, particularly as different elements of enormous language mannequin (LLM) functions proceed to evolve. From our experiments, each o1 and R1 can nonetheless make fundamental errors. Regardless of displaying spectacular outcomes, they nonetheless want a little bit of handholding to offer correct outcomes.

Ideally, a reasoning mannequin ought to be capable to clarify to the consumer when it lacks info for the duty. Alternatively, the reasoning hint of the mannequin ought to be capable to information customers to higher perceive errors and proper their prompts to extend the accuracy and stability of the mannequin’s responses. On this regard, R1 had the higher hand. Hopefully, future reasoning fashions, together with OpenAI’s upcoming o3 sequence, will present customers with extra visibility and management.


Source link
TAGGED: benchmarks, DeepSeekR1, perform, RealWorld, tasks
Share This Article
Twitter Email Copy Link Print
Previous Article Edge Impulse brings NVIDIA-powered models to edge MCUs, MPUs, and AI accelerators DeepSeek’s open-source LLM brings AI automation and observability to the edge
Next Article Cogenuity Partners Closes First Fund, at $425M Cogenuity Partners Closes First Fund, at $425M
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Trusted Source for Accurate and Timely Updates!

Our commitment to accuracy, impartiality, and delivering breaking news as it happens has earned us the trust of a vast audience. Stay ahead with real-time updates on the latest events, trends.
FacebookLike
TwitterFollow
InstagramFollow
YoutubeSubscribe
LinkedInFollow
MediumFollow
- Advertisement -
Ad image

Popular Posts

Ripple to Buy Rail, for $200M

Ripple, the San Francisco, CA-based supplier of enterprise blockchain and crypto options, is to amass…

August 8, 2025

Floki Announces Major Ad Campaign for Valhalla in the English Premier League for 2024-25 Season

Miami, Florida, August thirteenth, 2024, Chainwire Floki is proud to announce that Valhalla, Floki’s groundbreaking…

August 13, 2024

Primate Labs launches Geekbench AI benchmarking tool

Primate Labs has formally launched Geekbench AI, a benchmarking software designed particularly for machine studying…

August 16, 2024

Frumtak Ventures closes $87M 4th fund for Iceland investments

We wish to hear from you! Take our fast AI survey and share your insights…

July 10, 2024

Solo.io Launches Kagent Enterprise to Bridge Kubernetes and AI

The context-aware platform for AI and agentic functions on Kubernetes, known as Kagent Enterprise, was…

September 16, 2025

You Might Also Like

Nous Research just released Nomos 1, an open-source AI that ranks second on the notoriously brutal Putnam math exam
AI

Nous Research just released Nomos 1, an open-source AI that ranks second on the notoriously brutal Putnam math exam

By saad
Enterprise users swap AI pilots for deep integrations
AI

Enterprise users swap AI pilots for deep integrations

By saad
Why most enterprise AI coding pilots underperform (Hint: It's not the model)
AI

Why most enterprise AI coding pilots underperform (Hint: It's not the model)

By saad
Newsweek: Building AI-resilience for the next era of information
AI

Newsweek: Building AI-resilience for the next era of information

By saad
Data Center News
Facebook Twitter Youtube Instagram Linkedin

About US

Data Center News: Stay informed on the pulse of data centers. Latest updates, tech trends, and industry insights—all in one place. Elevate your data infrastructure knowledge.

Top Categories
  • Global Market
  • Infrastructure
  • Innovations
  • Investments
Usefull Links
  • Home
  • Contact
  • Privacy Policy
  • Terms & Conditions

© 2024 – datacenternews.tech – All rights reserved

Welcome Back!

Sign in to your account

Lost your password?
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.
You can revoke your consent any time using the Revoke consent button.