Friday, 10 Apr 2026
Subscribe
logo
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Font ResizerAa
Data Center NewsData Center News
Search
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Have an existing account? Sign In
Follow US
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
Data Center News > Blog > AI > Study claims OpenAI trains AI models on copyrighted data
AI

Study claims OpenAI trains AI models on copyrighted data

Last updated: April 4, 2025 2:05 am
Published April 4, 2025
Share
Photo of a judge gavel in front of the ChatGPT logo as new study from the AI Disclosures Project indicates the GPT-4o model from OpenAI demonstrates a recognition of paywalled and copyrighted data from O'Reilly Media books, according to the research.
SHARE

A brand new research from the AI Disclosures Project has raised questions concerning the information OpenAI makes use of to coach its massive language fashions (LLMs). The analysis signifies the GPT-4o mannequin from OpenAI demonstrates a “sturdy recognition” of paywalled and copyrighted information from O’Reilly Media books.

The AI Disclosures Challenge, led by technologist Tim O’Reilly and economist Ilan Strauss, goals to handle the doubtless dangerous societal impacts of AI’s commercialisation by advocating for improved company and technological transparency. The venture’s working paper highlights the shortage of disclosure in AI, drawing parallels with monetary disclosure requirements and their function in fostering sturdy securities markets.

The research used a legally-obtained dataset of 34 copyrighted O’Reilly Media books to research whether or not LLMs from OpenAI had been skilled on copyrighted information with out consent. The researchers utilized the DE-COP membership inference assault methodology to find out if the fashions may differentiate between human-authored O’Reilly texts and paraphrased LLM variations.

Key findings from the report embody:

  • GPT-4o exhibits “sturdy recognition” of paywalled O’Reilly guide content material, with an AUROC rating of 82%. In distinction, OpenAI’s earlier mannequin, GPT-3.5 Turbo, doesn’t present the identical stage of recognition (AUROC rating simply above 50%)
  • GPT-4o displays stronger recognition of private O’Reilly guide content material in comparison with publicly accessible samples (82% vs 64% AUROC scores respectively)
  • GPT-3.5 Turbo exhibits better relative recognition of publicly accessible O’Reilly guide samples than private ones (64% vs 54% AUROC scores)
  • GPT-4o Mini, a smaller mannequin, confirmed no data of public or private O’Reilly Media content material when examined (AUROC roughly 50%)
See also  Why adversarial AI is the cyber threat no one sees coming

The researchers counsel that entry violations could have occurred through the LibGen database, as the entire O’Reilly books examined had been discovered there. Additionally they acknowledge that newer LLMs have an improved means to differentiate between human-authored and machine-generated language, which doesn’t scale back the tactic’s means to categorise information.

The research highlights the potential for “temporal bias” within the outcomes, as a result of language modifications over time. To account for this, the researchers examined two fashions (GPT-4o and GPT-4o Mini) skilled on information from the identical interval.

The report notes that whereas the proof is restricted to OpenAI and O’Reilly Media books, it possible displays a systemic situation round the usage of copyrighted information. It argues that uncompensated coaching information utilization may result in a decline within the web’s content material high quality and variety, as income streams for skilled content material creation diminish.

The AI Disclosures Challenge emphasises the necessity for stronger accountability in AI corporations’ mannequin pre-training processes. They counsel that legal responsibility provisions that incentivise improved company transparency in disclosing information provenance could also be an necessary step in direction of facilitating industrial markets for coaching information licensing and remuneration.

The EU AI Act’s disclosure necessities may assist set off a constructive disclosure-standards cycle if correctly specified and enforced. Making certain that IP holders know when their work has been utilized in mannequin coaching is seen as an important step in direction of establishing AI markets for content material creator information.

Regardless of proof that AI corporations could also be acquiring information illegally for mannequin coaching, a market is rising wherein AI mannequin builders pay for content material by way of licensing offers. Firms like Defined.ai facilitate the buying of coaching information, acquiring consent from information suppliers and stripping out personally identifiable data.

See also  Snowflake’s Data Clean Room promises to ease analysis of PII data

The report concludes by stating that utilizing 34 proprietary O’Reilly Media books, the research gives empirical proof that OpenAI possible skilled GPT-4o on private, copyrighted information.

(Picture by Sergei Tokmakov)

See additionally: Anthropic gives insights into the ‘AI biology’ of Claude

AI & Big Data Expo banner, a show where attendees will hear more about issues such as OpenAI allegedly using copyrighted data to train its new models.

Wish to study extra about AI and massive information from trade leaders? Try AI & Big Data Expo going down in Amsterdam, California, and London. The great occasion is co-located with different main occasions together with Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.

Discover different upcoming enterprise know-how occasions and webinars powered by TechForge here.

Source link

TAGGED: Claims, Copyrighted, data, models, OpenAI, study, trains
Share This Article
Twitter Email Copy Link Print
Previous Article UK public declares data centres as crucial to innovation, aligning with Government strategy UK public declares data centres as crucial to innovation, aligning with Government strategy
Next Article Hydrolix Closes $80M Series C Funding Hydrolix Closes $80M Series C Funding
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Trusted Source for Accurate and Timely Updates!

Our commitment to accuracy, impartiality, and delivering breaking news as it happens has earned us the trust of a vast audience. Stay ahead with real-time updates on the latest events, trends.
FacebookLike
TwitterFollow
InstagramFollow
YoutubeSubscribe
LinkedInFollow
MediumFollow
- Advertisement -
Ad image

Popular Posts

Three reasons why your Zero Trust project isn’t delivering results (and what to do about it.)

Zero Belief Community Entry (ZTNA) is comparatively simple to know as a cybersecurity idea. Belief…

November 30, 2024

The guide for exceptional support: Unlocking cloud success 

Within the fast-evolving realm of cloud distribution and companies, attaining success hinges not solely on…

March 28, 2024

DeepMind’s PEER scales language models with millions of tiny experts

Be a part of our day by day and weekly newsletters for the newest updates…

July 13, 2024

Motivity Receives $27M Growth Investment from Five Elms Capital

Motivity, a Honolulu primarily based supplier of scientific SaaS options for Utilized Habits Evaluation (ABA)…

March 12, 2025

Can we find hidden graves of murder victims with soil imaging? New Australian study gives it a try

Credit score: Unsplash/CC0 Public Area To keep away from being caught, murderers typically try to…

September 27, 2024

You Might Also Like

Why companies like Apple are building AI agents with limits
AI

Why companies like Apple are building AI agents with limits

By saad
NTT DATA reveals next-gen Keihanna OSK11 data centre in Kyoto
Power & Cooling

NTT DATA reveals next-gen Keihanna OSK11 data centre in Kyoto

By saad
EMEA data centre vacancy hits record low as AI demand outpaces supply
Global Market

EMEA data centre vacancy hits record low as AI demand outpaces supply

By saad
Zoho confirms launch plans for UK data centre
Global Market

Zoho confirms launch plans for UK data centre

By saad
Data Center News
Facebook Twitter Youtube Instagram Linkedin

About US

Data Center News: Stay informed on the pulse of data centers. Latest updates, tech trends, and industry insights—all in one place. Elevate your data infrastructure knowledge.

Top Categories
  • Global Market
  • Infrastructure
  • Innovations
  • Investments
Usefull Links
  • Home
  • Contact
  • Privacy Policy
  • Terms & Conditions

© 2024 – datacenternews.tech – All rights reserved

Welcome Back!

Sign in to your account

Lost your password?
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.
You can revoke your consent any time using the Revoke consent button.