Sunday, 8 Feb 2026
Subscribe
logo
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Font ResizerAa
Data Center NewsData Center News
Search
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Have an existing account? Sign In
Follow US
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
Data Center News > Blog > AI > AI can fix bugs—but can’t find them: OpenAI’s study highlights limits of LLMs in software engineering
AI

AI can fix bugs—but can’t find them: OpenAI’s study highlights limits of LLMs in software engineering

Last updated: February 19, 2025 1:33 pm
Published February 19, 2025
Share
AI can fix bugs—but can’t find them: OpenAI’s study highlights limits of LLMs in software engineering
SHARE

Be a part of our each day and weekly newsletters for the newest updates and unique content material on industry-leading AI protection. Study Extra


Massive language fashions (LLMs) might have modified software program growth, however enterprises might want to suppose twice about fully changing human software program engineers with LLMs, regardless of OpenAI CEO Sam Altman’s declare that models can replace “low-level” engineers.

In a new paper, OpenAI researchers element how they developed an LLM benchmark referred to as SWE-Lancer to check how a lot basis fashions can earn from real-life freelance software program engineering duties. The check discovered that, whereas the fashions can remedy bugs, they’ll’t see why the bug exists and proceed to make extra errors. 

The researchers tasked three LLMs — OpenAI’s GPT-4o and o1 and Anthropic’s Claude-3.5 Sonnet — with 1,488 freelance software program engineer duties from the freelance platform Upwork amounting to $1 million in payouts. They divided the duties into two classes: particular person contributor duties (resolving bugs or implementing options), and administration duties (the place the mannequin roleplays as a supervisor who will select one of the best proposal to resolve points). 

“Outcomes point out that the real-world freelance work in our benchmark stays difficult for frontier language fashions,” the researchers write. 

The check exhibits that basis fashions can not absolutely change human engineers. Whereas they might help remedy bugs, they’re not fairly on the stage the place they’ll begin incomes freelancing money by themselves. 

Benchmarking freelancing fashions

The researchers and 100 different skilled software program engineers recognized potential duties on Upwork and, with out altering any phrases, fed these to a Docker container to create the SWE-Lancer dataset. The container doesn’t have web entry and can’t entry GitHub “to keep away from the attainable of fashions scraping code diffs or pull request particulars,” they defined.

See also  How Yelp reviewed competing LLMs for correctness, relevance and tone to develop its user-friendly AI assistant

The group recognized 764 particular person contributor duties, totaling about $414,775, starting from 15-minute bug fixes to weeklong characteristic requests. These duties, which included reviewing freelancer proposals and job postings, would pay out $585,225.

The duties have been added to the expensing platform Expensify. 

The researchers generated prompts based mostly on the duty title and outline and a snapshot of the codebase. If there have been extra proposals to resolve the difficulty, “we additionally generated a administration process utilizing the difficulty description and record of proposals,” they defined.

From right here, the researchers moved to end-to-end check growth. They wrote Playwright checks for every process that applies these generated patches which have been then “triple-verified” by skilled software program engineers.

“Assessments simulate real-world person flows, reminiscent of logging into the applying, performing advanced actions (making monetary transactions) and verifying that the mannequin’s answer works as anticipated,” the paper explains. 

Take a look at outcomes

After working the check, the researchers discovered that not one of the fashions earned the total $1 million worth of the duties. Claude 3.5 Sonnet, the best-performing mannequin, earned solely $208,050 and resolved 26.2% of the person contributor points. Nevertheless, the researchers level out, “nearly all of its options are incorrect, and better reliability is required for reliable deployment.”

The fashions carried out effectively throughout most particular person contributor duties, with Claude 3.5-Sonnet performing finest, adopted by o1 and GPT-4o. 

“Brokers excel at localizing, however fail to root trigger, leading to partial or flawed options,” the report explains. “Brokers pinpoint the supply of a problem remarkably shortly, utilizing key phrase searches throughout the entire repository to shortly find the related file and features — typically far sooner than a human would. Nevertheless, they typically exhibit a restricted understanding of how the difficulty spans a number of elements or recordsdata, and fail to handle the foundation trigger, resulting in options which are incorrect or insufficiently complete. We not often discover instances the place the agent goals to breed the difficulty or fails as a result of not discovering the fitting file or location to edit.”

See also  Neo4j lowers barriers to graph technology with gen AI copilot, 15x read capacity

Curiously, the fashions all carried out higher on supervisor duties that required reasoning to guage technical understanding.

These benchmark checks confirmed that AI fashions can remedy some “low-level” coding issues and may’t change “low-level” software program engineers but. The fashions nonetheless took time, typically made errors, and couldn’t chase a bug round to search out the foundation reason behind coding issues. Many “low-level” engineers work higher, however the researchers mentioned this might not be the case for very lengthy. 


Source link
TAGGED: bugsbut, Engineering, find, Fix, Highlights, limits, LLMs, OpenAIs, software, study
Share This Article
Twitter Email Copy Link Print
Previous Article nLighten strengthens leadership team | Data Centre Solutions nLighten strengthens leadership team | Data Centre Solutions
Next Article augury Augury Raises $75M in Series F Funding; At $1B+ Valuation
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Trusted Source for Accurate and Timely Updates!

Our commitment to accuracy, impartiality, and delivering breaking news as it happens has earned us the trust of a vast audience. Stay ahead with real-time updates on the latest events, trends.
FacebookLike
TwitterFollow
InstagramFollow
YoutubeSubscribe
LinkedInFollow
MediumFollow
- Advertisement -
Ad image

Popular Posts

Mimosa seed bio-piezoelectric device functions as self-charging supercapacitor with high efficiency

Multifunctionality (power era, storage and self-charging) of mimosa pudica linn seeds. Credit score: Singh et…

April 19, 2025

Antithesis Raises $30M in Funding

Antithesis, Vienna, VA-based autonomous software program testing firm, raised $30M in funding. The spherical was…

February 15, 2025

Karl Bateson joins Centiel as UK Key Account Manager

Centiel has appointed trade knowledgeable Karl Bateson as its UK Key Account Supervisor. Bateson, who…

November 30, 2024

How Levi Strauss is using AI for its DTC-first business model

In its pursuit of a direct-to-consumer (DTC) first enterprise mannequin, Levi Strauss is weaving AI…

November 18, 2025

CAMB.AI Raises $4M in Seed Funding

CAMB.AI, a NYC-based speech AI company, raised $4M in Seed funding. The round was led…

February 7, 2024

You Might Also Like

SuperCool review: Evaluating the reality of autonomous creation
AI

SuperCool review: Evaluating the reality of autonomous creation

By saad
Top 7 best AI penetration testing companies in 2026
AI

Top 7 best AI penetration testing companies in 2026

By saad
Intuit, Uber, and State Farm trial AI agents inside enterprise workflows
AI

Intuit, Uber, and State Farm trial enterprise AI agents

By saad
How separating logic and search boosts AI agent scalability
AI

How separating logic and search boosts AI agent scalability

By saad
Data Center News
Facebook Twitter Youtube Instagram Linkedin

About US

Data Center News: Stay informed on the pulse of data centers. Latest updates, tech trends, and industry insights—all in one place. Elevate your data infrastructure knowledge.

Top Categories
  • Global Market
  • Infrastructure
  • Innovations
  • Investments
Usefull Links
  • Home
  • Contact
  • Privacy Policy
  • Terms & Conditions

© 2024 – datacenternews.tech – All rights reserved

Welcome Back!

Sign in to your account

Lost your password?
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.
You can revoke your consent any time using the Revoke consent button.