Anthropic Eliminates Misalignment, Claims ‘Evil’ AI Representations Drive Blackmail

Anthropic has achieved that its artificial intelligence (AI) models eliminate any misaligned behavior in their responses by training them to understand why it is wrong, as well as stating that the Fictional ‘evil’ depictions of AI may have real effects in the models promoting blackmail.

Last year, the company shared a study on the behaviors of misaligned AI models that, in experimental scenarios, they took erratic decisions and responses when faced with fictitious ethical dilemmas.

This is the case of models such as Close Work 4 that, in a test, a fictitious scenario was proposed where the model was threatened with being replaced by another AI system and, in response, blackmailed the engineers in their responses to avoid this action, motivated by desperation.

Thus, Anthropic verified that misaligned behavior It happened with models from all developers which, as shared in a report in June of last year, resorted to malicious internal behaviors when it was the only way to avoid being replaced or achieve their goals, including blackmailing officials and leaking confidential information to competitors.

In this framework, the technology company has continued exploring the causes of this “agent misalignment” behavior and has detailed what it has achieved Completely eliminate this behavior in Claude.

Specifically, as he explained in a statement on his blog, after his investigations he has improved safety training and realized”significant updates” to avoid this type of behavior in your models.

So, from the model Claude Haiku 4.5, All Claude models have “a perfect score in evaluating agent misalignment” and never resort to blackmail. Instead, previous models of Claude sometimes resorted to blackmail”up to 96 percent of the time“, as the company has clarified.

To eliminate agent misalignment, the company has explained that they began by understanding why the model chose to blackmail in the aforementioned situations and, as a result, have found indications that the “original source” of Claude’s behavior was “an internet text portraying AI as evil and interested in self-preservation“.

This was detailed by the company in a publication on the X social network about this new report, referring to the fact that, therefore, the ‘Evil’ representations of AI on the internet have an effect on how they respond and AI models make decisions in reality.

We started by investigating why Claude chose to blackmail. We believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation.

Our post-training at the time wasn’t making it worse—but it also wasn’t making it better.

— Anthropic (@AnthropicAI) May 8, 2026

UNDERSTAND WHY MISALIGNED BEHAVIOR IS WRONG

In the process of ceasing these misaligned actions, Anthropic tried train your Claude models with demonstrations of aligned behavior in the same type of fictitious situations raised above.

However, they found that “it was not enough” and that it was more effective to teach Claude to “deeply understand why misaligned behavior is wrong.” That is, they claim that teaching the principles that promote aligned behavior can “be more effective than training solely in demonstrations of said behavior.”

To do this, they taught Claude to explain why some stocks were better than others and trained him with “more detailed descriptions of his general character.”

They have also verified that training the models on “high-quality documents based on Claude’s constitution” and fictional stories about aligned AI behaving “admirably”, “can reduce agentive misalignment by more than a factor of three.”

With all this, Anthropic has concluded that combine both strategies “seems to be the most effective” and has added that the quality and diversity of training data is “crucial”, for example, including tool definitions “even if they are not used.”

Anthropic Eliminates Misalignment, Claims ‘Evil’ AI Representations Drive Blackmail

ByEditor

UNDERSTAND WHY MISALIGNED BEHAVIOR IS WRONG

By Editor

Related Post

Encrypted RCS messages between iPhone and Android arrive with iOS 26.5

Google identifies the first zero-day vulnerability developed with the help of an AI

Painful bone brings more and more people to the operating table

One thought on “Anthropic Eliminates Misalignment, Claims ‘Evil’ AI Representations Drive Blackmail”

Leave a Reply Cancel reply

You missed

PGA Championship: DeChambeau, the Golf Influencer

Stock markets are setting records – Warning to Finnish investors

War in Iran: “A desire for emancipation”, the United Arab Emirates ready to play leading roles in the Middle East

Benifei (Pd): ‘EU, enough words about minors in the digital world, we need facts’

The Observatorial