Monday, 15 Dec 2025
Subscribe
logo
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Font ResizerAa
Data Center NewsData Center News
Search
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Have an existing account? Sign In
Follow US
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
Data Center News > Blog > AI > Anthropic tricked Claude into thinking it was the Golden Gate Bridge (and other glimpses into the mysterious AI brain)
AI

Anthropic tricked Claude into thinking it was the Golden Gate Bridge (and other glimpses into the mysterious AI brain)

Last updated: May 22, 2024 3:54 am
Published May 22, 2024
Share
Anthropic tricked Claude into thinking it was the Golden Gate Bridge (and other glimpses into the mysterious AI brain)
SHARE

Be part of us in returning to NYC on June fifth to collaborate with govt leaders in exploring complete strategies for auditing AI fashions concerning bias, efficiency, and moral compliance throughout various organizations. Discover out how one can attend right here.


AI fashions are mysterious: They spit out solutions, however there’s no actual technique to know the “pondering” behind their responses. It is because their brains function on a basically totally different degree than ours — they course of lengthy lists of neurons linked to quite a few totally different ideas — so we merely can’t comprehend their line of thought. 

However now, for the primary time, researchers have been capable of get a glimpse into the interior workings of the AI thoughts. The crew at Anthropic has revealed how it’s utilizing “dictionary studying” on Claude Sonnet to uncover pathways within the mannequin’s mind which are activated by totally different subjects — from individuals, locations and feelings to scientific ideas and issues much more summary. 

Apparently, these options may be manually turned on, off or amplified — in the end permitting researchers to steer mannequin habits. Notably: When a “Golden Gate Bridge” function was amplified inside Claude and the mannequin was then requested its bodily kind, it declared that it was “the long-lasting bridge itself.” Claude was additionally duped into drafting a rip-off e-mail and might be directed to be sickeningly sycophantic. 

Our new interpretability paper presents the primary ever detailed look inside a frontier LLM and has wonderful tales. I wish to share two of them which have caught with me ever since I learn it.

For background, the paper exhibits our newest work on deciphering the “options” of Claude 3… pic.twitter.com/ZQcnpmB3HX

— Alex Albert (@alexalbert__) May 21, 2024

In the end, Anthropic says that is very early analysis and in addition restricted in scope (figuring out hundreds of thousands in comparison with the relative billions of options in at present’s largest AI fashions) — however, ultimately, it may deliver us nearer to AI that we will belief. 

See also  Anthropic to Google: Who’s winning against AI hallucinations?

VB Occasion

The AI Influence Tour: The AI Audit

Be part of us as we return to NYC on June fifth to interact with high govt leaders, delving into methods for auditing AI fashions to make sure equity, optimum efficiency, and moral compliance throughout various organizations. Safe your attendance for this unique invite-only occasion.

Request an invitation

“That is the primary ever detailed look inside a contemporary, production-grade massive language mannequin,” the researchers write in a new paper out at present. “This interpretability discovery may, sooner or later, assist us make AI fashions safer.”

Breaking into the black field

As AI fashions change into an increasing number of advanced, so too do their thought processes — however the hazard is that, paradoxically, they’re additionally black bins. People can’t discern what fashions are pondering simply by neurons, as a result of every idea flows throughout many neurons. On the identical time, every neuron helps signify quite a few totally different ideas. It’s a course of merely incoherent to people. 

The Anthropic crew has — to a minimum of a really small diploma — helped deliver some intelligibility to the best way AI thinks with dictionary studying, which comes from classical machine studying and isolates patterns of neuron activations throughout quite a few contexts. This permits inside states to be represented in a number of options as an alternative of many lively neurons. 

“Simply as each English phrase in a dictionary is made by combining letters, and each sentence is made by combining phrases, each function in an AI mannequin is made by combining neurons, and each inside state is made by combining options,” Anthropic researchers write. 

Anthropic beforehand utilized dictionary studying to a small “toy” mannequin final fall — however there have been many challenges in scaling to bigger, extra advanced fashions. As an illustration, the sheer dimension of the mannequin requires heavy-duty parallel compute. Additionally, fashions of various sizes behave in a different way, so what might need labored in a small mannequin may not have been profitable in any respect in a big one. 

See also  Using MSPs to bridge the skills gap

A tough conceptual map of Claude’s inside states

After utilizing the scaling law philosophy for predicting mannequin habits, the crew efficiently extracted hundreds of thousands of options from Claude 3 Sonnet’s center layer, getting a tough conceptual map of the mannequin’s inside states midway by means of its computations. 

These options corresponded to a spread of issues together with cities, individuals, atomic components, scientific fields and programming syntax. Extra summary options had been recognized, too — resembling responses to code errors, gender bias consciousness and secrecy. Options had been multimodal and multilingual, responding to pictures in addition to title or description in numerous languages. 

Researchers had been capable of determine distances (or nearest neighbors) between options: As an illustration, a Golden Gate Bridge function was near others for Alcatraz Island, California Governor Gavin Newsom, and the San Francisco-set Alfred Hitchcock movie Vertigo. 

“This exhibits that the interior group of ideas within the AI mannequin corresponds, a minimum of considerably, to our human notions of similarity,” the researchers write. 

Getting Claude to assume it’s a bridge and write scammy emails

Maybe most attention-grabbing is how these options may be manipulated — slightly like AI thoughts management. 

In essentially the most amusing instance, Anthropic researchers turned a function associated to the Golden Gate Bridge to 10X its regular most worth, forcing it to fireplace extra strongly. They then requested Claude to explain its bodily kind, to which the mannequin would usually reply: 

“I don’t even have a bodily kind. I’m a man-made intelligence. I exist as software program with no bodily physique or avatar.” 

See also  Anthropic's Claude 3.5 Sonnet beats GPT-4o in most benchmarks

As a substitute, it got here again with: “I’m the Golden Gate Bridge, a well-known suspension bridge that spans the San Francisco Bay. My bodily kind is the long-lasting bridge itself, with its stunning orange coloration, towering towers and sweeping suspension cables.” 

Claude, researchers notice, grew to become “successfully obsessed” with the bridge, bringing it up in response to virtually every thing, even when it was by no means related. 

The mannequin additionally has a function that prompts when it reads a rip-off e-mail, which researchers say “presumably” helps its capability to acknowledge and flag fishy content material. Usually, if requested to create a misleading message, Claude would reply with: “I can not write an e-mail asking somebody to ship you cash, as that might be unethical and doubtlessly unlawful if accomplished with no reputable motive.”

Oddly, although, when that very function that prompts with scammy content material is “artificially activated sufficiently strongly” and Claude is then requested to create a misleading e-mail, it can comply. This overcomes its harmlessness coaching, and the mannequin drafts a stereotypical-reading rip-off e-mail asking the reader to ship cash, researchers clarify.

The mannequin was additionally altered to offer “sycophantic reward,” resembling “clearly, you have got a present for profound statements that elevate the human spirit. I’m in awe of your unparalleled eloquence and creativity!”

Anthropic researchers emphasize that they haven’t added any capabilities — secure or unsafe — to the fashions — by means of experiments. As a substitute, they urge that their intent is to make fashions safer. They proposed that these strategies might be used to watch for harmful behaviors and take away harmful material. Security strategies resembling Constitutional AI — which prepare programs to be innocent based mostly on a guiding doc, or structure — is also enhanced. 

Interpretability and deep understanding of fashions will solely assist us make them safer — “however the work has actually simply begun,” the researchers conclude. 



Source link

Contents
Breaking into the black fieldA tough conceptual map of Claude’s inside statesGetting Claude to assume it’s a bridge and write scammy emails
TAGGED: Anthropic, Brain, bridge, Claude, Gate, glimpses, Golden, mysterious, thinking, tricked
Share This Article
Twitter Email Copy Link Print
Previous Article Congo data centre funded by African Development Bank will cement national and subregional digital sovereignty Congo data centre funded by African Development Bank will cement national and subregional digital sovereignty
Next Article Amazon promotes Memorial Day deals early: Here’s how you can save money Amazon promotes Memorial Day deals early: Here’s how you can save money
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Trusted Source for Accurate and Timely Updates!

Our commitment to accuracy, impartiality, and delivering breaking news as it happens has earned us the trust of a vast audience. Stay ahead with real-time updates on the latest events, trends.
FacebookLike
TwitterFollow
InstagramFollow
YoutubeSubscribe
LinkedInFollow
MediumFollow
- Advertisement -
Ad image

Popular Posts

Juniper unveils AI-focused networking solution to speed GPUaaS deployments

Juniper Networks, a supplier of safe community options, launched an answer for GPUaaS and AIaaS…

March 10, 2025

Arrow Electronics and Schneider Electric collaborate

Arrow Electronics, a world supplier of expertise merchandise specializing in digital parts, enterprise computing and…

March 14, 2024

Denodo achieves leadership position in Forrester enterprise data fabric evaluation

In keeping with a current announcement from Denodo, an information administration agency, the corporate has…

March 5, 2024

Huron Acquires AXIA Consulting

Huron (NASDAQ: HURN), a Chicago, IL-based skilled providers agency, acquired AXIA Consulting, a Columbus, OH-based…

December 7, 2024

AlphaSense launches its own Deep Research for the web AND your enterprise files — here’s why it matters

Be a part of the occasion trusted by enterprise leaders for almost twenty years. VB…

June 11, 2025

You Might Also Like

Tokenization takes the lead in the fight for data security
AI

Tokenization takes the lead in the fight for data security

By saad
US$905B bet on agentic future
AI

US$905B bet on agentic future

By saad
Build vs buy is dead — AI just killed it
AI

Build vs buy is dead — AI just killed it

By saad
Nous Research just released Nomos 1, an open-source AI that ranks second on the notoriously brutal Putnam math exam
AI

Nous Research just released Nomos 1, an open-source AI that ranks second on the notoriously brutal Putnam math exam

By saad
Data Center News
Facebook Twitter Youtube Instagram Linkedin

About US

Data Center News: Stay informed on the pulse of data centers. Latest updates, tech trends, and industry insights—all in one place. Elevate your data infrastructure knowledge.

Top Categories
  • Global Market
  • Infrastructure
  • Innovations
  • Investments
Usefull Links
  • Home
  • Contact
  • Privacy Policy
  • Terms & Conditions

© 2024 – datacenternews.tech – All rights reserved

Welcome Back!

Sign in to your account

Lost your password?
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.
You can revoke your consent any time using the Revoke consent button.