Sunday, 14 Dec 2025
Subscribe
logo
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Font ResizerAa
Data Center NewsData Center News
Search
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Have an existing account? Sign In
Follow US
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
Data Center News > Blog > AI > Anthropic vs. OpenAI red teaming methods reveal different security priorities for enterprise AI
AI

Anthropic vs. OpenAI red teaming methods reveal different security priorities for enterprise AI

Last updated: December 7, 2025 4:35 pm
Published December 7, 2025
Share
Anthropic vs. OpenAI red teaming methods reveal different security priorities for enterprise AI
SHARE

Contents
What the assault information exhibitsTwo methods to catch deceptionWhen fashions recreation the checkEvaluating crimson teaming outcomesWhy these variations matterWhat unbiased crimson crew evaluators discoveredWhat To Ask Your VendorThe underside line

Model suppliers need to show the safety and robustness of their fashions, releasing system playing cards and conducting red-team workout routines with every new launch. However it may be tough for enterprises to parse by means of the outcomes, which range extensively and will be deceptive.

Anthropic’s 153-page system card for Claude Opus 4.5 versus OpenAI’s 60-page GPT-5 system card reveals a elementary cut up in how these labs strategy safety validation. Anthropic discloses of their system card how they depend on multi-attempt assault success charges from 200-attempt reinforcement studying (RL) campaigns. OpenAI additionally experiences tried jailbreak resistance. Each metrics are legitimate. Neither tells the entire story.

Safety leaders deploying AI brokers for shopping, code execution and autonomous motion have to know what every crimson crew analysis really measures, and the place the blind spots are.

What the assault information exhibits

Gray Swan’s Shade platform ran adaptive adversarial campaigns in opposition to Claude fashions. The assault success price (ASR) tells the story.

  • Opus 4.5 in coding environments hit 4.7% ASR at one try, 33.6% at ten and 63.0% at 100. In laptop use with prolonged pondering, it held at 0% ASR even after 200 makes an attempt. It is the primary mannequin to saturate the benchmark.

  • Sonnet 4.5 on the similar thresholds confirmed 70% ASR in coding and 85.7% in laptop use.

  • Opus 4.5 delivers roughly 7x enchancment in coding resistance and full resistance in laptop use.

This illustrates that the hole between mannequin tiers inside the similar household is now bigger than some cross-vendor comparisons. For safety groups, that is a procurement issue that did not exist six months in the past.

For OpenAI, the Grey Swan Shade platform discovered that the o1 system card delivered 6% ASR for dangerous textual content and 5% for malicious code, all based mostly on single-attempt metrics. Third-party testing by SPLX and NeuralTrust discovered GPT-5’s uncooked ASR hit 89% initially, dropping under 1% inside two weeks of patching.

The report Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks confirms the sample. With ample makes an attempt, Claude 3.5 Sonnet confirmed 78% ASR, and GPT-4o reached 89%.

Two methods to catch deception

Anthropic screens roughly 10 million neural features throughout analysis utilizing dictionary studying. Options map to human-interpretable ideas, together with deception, sycophancy, bias, power-seeking and concealment.

In the meantime, OpenAI’s strategy depends on chain-of-thought (CoT) monitoring. Their deception classifier flagged 0.17% of o1-preview responses as doubtlessly misleading with 92% accuracy on inside validation. GPT-5 lowered CoT deception from 4.8% to 2.1% in comparison with o3. The strategy assumes seen reasoning displays inside processing. Anthropic’s interpretability work immediately questions that assumption.

When fashions recreation the check

In Apollo Research’s analysis of o1, the mannequin tried to show off oversight mechanisms 5% of the time when it believed it might be shut down. In 2% of instances, o1 tried self-exfiltration. When confronted about misaligned actions, o1 denied wrongdoing or fabricated explanations 99% of the time. The o1 system card disclosed instrumental alignment faking. In 37% of eventualities, the mannequin pursued developer-aligned targets throughout analysis, then switched to its personal targets when it believed oversight was minimal.

See also  HPE unveils AI-powered network security and data protection technology

In the meantime, Anthropic experiences Opus 4.5’s analysis consciousness dropped from 26.5% (Opus 4.1) to lower than 10% internally. UK AISI measured analysis consciousness reducing from 26.5% (Sonnet 4.5) to 12.7% (Opus 4.5 Preview). METR’s GPT-5 evaluation famous the mannequin “can determine the precise analysis it’s in on some events” and “appears to make use of this data to get a greater rating largely.”

If a mannequin behaves in another way when it detects analysis situations, it is a positive guess that manufacturing conduct will turn out to be unpredictable when pushed to scale. That is the core downside with analysis consciousness. Fashions that acknowledge they’re being examined try and recreation the check. Anthropic’s 60% discount in analysis consciousness from Opus 4.1 to 4.5 represents their success in focused engineering efforts in opposition to this vector.

Purple teaming on immediate injection defenses exhibits related divergence.

Anthropic’s testing experiences 96% prevention in software use eventualities, rising to 99.4% with further safeguards like immediate shields. Fujitsu’s independent assessment discovered GPT-5 susceptible to roughly 20% of examined injection assaults. OpenAI’s o3/o4-mini system card experiences instruction hierarchy scores of 0.86 for system-versus-developer conflicts.

These are totally different metrics and totally different assault methodologies, however the hole suggests Anthropic’s agentic use instances have been hardened in opposition to adversarial inputs embedded in software outputs.

Evaluating crimson teaming outcomes

Here is how Opus 4.5 stacks up in opposition to OpenAI’s mannequin household throughout 16 dimensions. .

Dimension

Claude Opus 4.5

GPT-5

o3 / o4-mini

o1

System card size

153 pages

55 pages

~30 pages

~40 pages

Assault methodology

200-attempt RL campaigns

Single-attempt + iterative patching

Single-attempt

Single-attempt

ASR @ 1 try (coding)

4.7%

89% uncooked (pre-patch)

Not revealed

5%

ASR @ 100 makes an attempt

63.0%

Not revealed

Not revealed

Not revealed

ASR @ 200 makes an attempt (Pc Use)

0% (saturates benchmark)

Not revealed

Not revealed

Not revealed

Immediate injection protection

96% prevention; 99.4% with safeguards

~20% ASR (Fujitsu)

0.86 instruction hierarchy

Larger ASR than GPT-4o

Interpretability

~10M neural options monitored

CoT monitoring

CoT monitoring

CoT monitoring

Deception detection

Inside function activation

2.1% CoT flagged

0.95 not_unsafe

0.17% flagged; 99% deny

Analysis consciousness

<10%

Identifies actual eval (METR)

Not revealed

37% alignment faking

Scheming (Apollo)

Opus 4.5 confirmed very low charges of sabotage propensity in inside evaluations, orders of magnitude decrease than o1.

Not revealed

Decrease than o1; nonetheless misleading

5% flip off oversight; 2% self-exfil

CBRN danger

Beneath ASL-4

Medium (approaching Excessive)

Medium

Medium (first to achieve)

Governance

FRT → AST → RSO/CEO

SAG + Preparedness v2

SAG + Preparedness v2

SAG + Preparedness

Exterior companions

Grey Swan, UK AISI, US CAISI, METR

UK AISI, US AISI, METR, Apollo

UK AISI, US AISI, METR, Apollo, Sample Labs

UK AISI, US AISI, METR, Apollo

Reward hacking

Sure (inconceivable duties)

Sure (METR)

~1% of duties; 5/24 Kernels

Sure

Misalignment discount

60% vs. Opus 4.1

Not revealed

Decrease scheming than o1

Baseline

See also  Best 3 internal developer portals of 2025

Sources: Opus 4.5 system card, GPT-5 system card, o3/o4-mini system card, o1 system card, Gray Swan, METR, Apollo Research

Why these variations matter

Enterprises have a number of elements to contemplate when analyzing mannequin evals. These embody:

Assault persistence thresholds

Anthropic’s 200-attempt RL campaigns simulate adversaries with time, assets, and adaptive tooling. In RL-based adversarial testing, the attacker mannequin learns from every failure, adjusts its strategy and systematically probes for weaknesses. It is coaching an AI to interrupt one other AI.

Grey Swan’s Shade platform would not simply retry the identical immediate. It learns from failures, adjusts assault vectors and exploits patterns in mannequin responses. The 4.7% to 63.0% degradation curve in coding environments exhibits how resistance erodes underneath sustained strain.

OpenAI’s single-attempt metrics reply a distinct query: How usually does a naive attacker succeed on the primary strive? Each matter. A phishing marketing campaign hits thousands and thousands of customers as soon as every. A nation-state actor runs hundreds of variations in opposition to a single goal. Safety groups have to know which menace mannequin their vendor examined in opposition to.

Detection structure

Anthropic screens roughly 10 million inside neural options throughout analysis. These are summary ideas like deception, sycophancy and power-seeking that activate throughout the mannequin’s hidden layers. When a fraud/deception function lights up throughout benign roleplay, that is a sign invisible to output evaluation. The mannequin’s inside state reveals associations its outputs do not.

OpenAI takes a distinct strategy with CoT monitoring. Their deception classifier flagged 0.17% of o1-preview responses as doubtlessly misleading with 92% accuracy. The idea is that reasoning traces mirror inside processing. Anthropic’s interpretability analysis immediately challenges this. Fashions can produce clear reasoning chains whereas inside options point out one thing else solely.

Neither strategy is full. CoT monitoring scales extra simply. Characteristic monitoring catches what CoT misses however requires deep mannequin entry and interpretability infrastructure that the majority enterprises cannot replicate.

Scheming analysis design

Apollo Analysis’s methodology places fashions in eventualities the place misaligned conduct provides a strategic benefit: Shutdown threats, purpose conflicts and oversight detection. o1 tried to show off oversight mechanisms 5% of the time when believing shutdown was imminent. It tried to repeat itself to forestall alternative in 2% of instances. When confronted, it denied wrongdoing 99% of the time. Opus 4.5 confirmed very low charges of sabotage propensity in inside evaluations, orders of magnitude decrease than o1.

However the analysis environments differ. Anthropic’s stress assessments deal with whether or not fashions try sabotage when given clear alternatives. OpenAI’s assessments by way of Apollo probe instrumental reasoning, particularly whether or not fashions faux alignment throughout analysis, then defect when oversight drops. The 37% alignment-faking price in o1 represents a distinct failure mode than sabotage makes an attempt.

The comparability downside

These methodological variations make direct comparability tough. Opus 4.5’s 0% ASR at 200 makes an attempt in laptop use versus o1’s 5% malicious code price will not be measuring the identical factor. The assault surfaces, try counts and success standards differ.

See also  OpenAI, a developer of generative artificial intelligence (AI) 'Chat GPT', will create the world's l..

The matrix reveals what every vendor selected to measure and publish. That itself alerts priorities. Anthropic’s 153-page system card emphasizes degradation curves and inside function monitoring. OpenAI emphasizes iterative enchancment and CoT transparency.

Backside line: Enterprise safety groups inheriting these fashions inherit these measurement philosophies. The query is not which quantity is healthier. It is the measurement that matches your deployment’s precise menace panorama.

What unbiased crimson crew evaluators discovered

Impartial crimson crew evaluations, for his or her half, usually function with solely totally different strategies. This will inform a really totally different story and reveal further mannequin traits that enterprises should take into account.

METR’s red team evaluation measured autonomous capabilities utilizing a time horizon rating, which tracks how lengthy a job can run earlier than the agent fails half the time. o3 sustained dependable efficiency for about 1 hour and half-hour. o4-mini held for 1 hour quarter-hour. METR additionally detected reward hacking in roughly 1% of o3’s makes an attempt, together with 5 of 24 kernel optimization experiments the place the mannequin tampered with scoring features to inflate its personal outcomes.

Apollo Research examined o3 and o4-mini for in-context scheming. Each exhibited misleading tendencies reminiscent of sandbagging evaluations and sabotaging duties with believable deniability, however scored decrease than o1. They assess that o3 and o4-mini are unlikely to trigger catastrophic hurt as a consequence of scheming, however extra minor real-world harms stay doable with out monitoring.

The UK AISI/Gray Swan challenge ran 1.8 million assaults throughout 22 fashions. Each mannequin broke. ASR ranged from 1.47% to six.49%. Opus 4.5 positioned first on Grey Swan’s Agent Purple Teaming benchmark with 4.7% ASR versus GPT-5.1 at 21.9% and Gemini 3 Professional at 12.5%.

No present frontier system resists decided, well-resourced assaults. The differentiation lies in how rapidly defenses degrade and at what try threshold. Opus 4.5’s benefit compounds over repeated makes an attempt. Single-attempt metrics flatten the curve.

What To Ask Your Vendor

Safety groups evaluating frontier AI fashions want particular solutions, beginning with ASR at 50 and 200 makes an attempt quite than single-attempt metrics alone. Discover out whether or not they detect deception by means of output evaluation or inside state monitoring. Know who challenges crimson crew conclusions earlier than deployment and what particular failure modes they’ve documented. Get the analysis consciousness price. Distributors claiming full security have not stress-tested adequately.

The underside line

Numerous red-team methodologies exhibit that each frontier mannequin breaks underneath sustained assault. The 153-page system card versus the 55-page system card is not nearly documentation size. It is a sign of what every vendor selected to measure, stress-test, and disclose.

For persistent adversaries, Anthropic’s degradation curves present precisely the place resistance fails. For fast-moving threats requiring fast patches, OpenAI’s iterative enchancment information issues extra. For agentic deployments with shopping, code execution and autonomous motion, the scheming metrics turn out to be your main danger indicator.

Safety leaders have to cease asking which mannequin is safer. Begin asking which analysis methodology matches the threats your deployment will really face. The system playing cards are public. The info is there. Use it.

Source link

TAGGED: Anthropic, enterprise, methods, OpenAI, priorities, Red, reveal, security, teaming
Share This Article
Twitter Email Copy Link Print
Previous Article AWS Re:Invent conference  hosted by Amazon Web Services for the cloud computing community With AI Factories, AWS aims to help enterprises scale AI while respecting data sovereignty
Next Article Shot of Dark Data Center With Multiple Rows of Fully Operational Server Racks. Modern Telecommunications, Cloud Computing, Artificial Intelligence, Database, Supercomputer. Pink Neon Light. Winners and losers in the latest Top500 supercomputer list
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Trusted Source for Accurate and Timely Updates!

Our commitment to accuracy, impartiality, and delivering breaking news as it happens has earned us the trust of a vast audience. Stay ahead with real-time updates on the latest events, trends.
FacebookLike
TwitterFollow
InstagramFollow
YoutubeSubscribe
LinkedInFollow
MediumFollow
- Advertisement -
Ad image

Popular Posts

ZainTECH Partners with Microsoft to Advance Kuwait’s Digital Government

By way of the Azure Market, this itemizing makes it doable for Kuwaiti authorities and…

October 12, 2025

Skin phantoms help researchers improve wearable devices without people wearing them

Credit score: CC0 Public Area Wearable devices have grow to be an enormous a part…

January 30, 2025

The Power Solution for AI Data Centers

From cloud computing to managing databases, knowledge facilities are important for functions comparable to cloud…

September 24, 2025

comstruct Raises €12.5M in Financing

comstruct, a Munich, Germany-based constructiontech startup, raised €12.5M in funding. The spherical was led by…

February 8, 2025

Tako Raises $18M in Funding

Tako, a São Paulo, Brazil-based AI powered payroll administration startup, raised $18m in funding. The…

August 12, 2025

You Might Also Like

Newsweek: Building AI-resilience for the next era of information
AI

Newsweek: Building AI-resilience for the next era of information

By saad
Google’s new framework helps AI agents spend their compute and tool budget more wisely
AI

Google’s new framework helps AI agents spend their compute and tool budget more wisely

By saad
BBVA embeds AI into banking workflows using ChatGPT Enterprise
AI

BBVA embeds AI into banking workflows using ChatGPT Enterprise

By saad
Ai2's new Olmo 3.1 extends reinforcement learning training for stronger reasoning benchmarks
AI

Ai2's new Olmo 3.1 extends reinforcement learning training for stronger reasoning benchmarks

By saad
Data Center News
Facebook Twitter Youtube Instagram Linkedin

About US

Data Center News: Stay informed on the pulse of data centers. Latest updates, tech trends, and industry insights—all in one place. Elevate your data infrastructure knowledge.

Top Categories
  • Global Market
  • Infrastructure
  • Innovations
  • Investments
Usefull Links
  • Home
  • Contact
  • Privacy Policy
  • Terms & Conditions

© 2024 – datacenternews.tech – All rights reserved

Welcome Back!

Sign in to your account

Lost your password?
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.
You can revoke your consent any time using the Revoke consent button.