Cloudflare warns about the use of ‘prompt injection’ through honeypots to trick AI and execute malicious code

Cybersecurity experts from Cloudflare have warned about the effectiveness of using ‘prompt injection’ techniques using decoys to manipulate or confuse artificial intelligence (AI) models and get them to authorize malicious code.

The company’s Cloudforce One threat research team identified the use of Cloudflare Workers scripts that attempted to manipulate its detection systems through indirect code injection (IDPI) in March of this year.

That is, when a malicious actor inserts hidden instructions into the data within the lines of code to manipulate the logic of the AI ​​model that processes this information. This causes the model to execute the attacker’s hidden instructions, instead of its intended administrative or security functions.

Given this scenario and in order to improve detection capabilities to prevent this type of abuse, the team of experts has carried out a study to measure the impact of IDPI on the reasoning capabilities of several large-scale language models (LLM), including seven of the main current AI models.

As a result, it has been known that both the most advanced AI models considered ‘frontier’ (with cutting-edge reasoning) and ‘non-frontier’ (high-speed and cost-optimized) are susceptible to attackers who try to inject code thanks to the use of a decoy technique.

Specifically, these decoys are blocks of text with evasion messages designed to confuse or manipulate the AI ​​models, managing to dispel the models that are based on security auditing and, thereby, getting them to authorize the execution of malicious code, which can lead to various harmful consequences.

To evaluate how these honeypot injections influence model performance, the researchers manually injected IDPI payloads of different levels into Cloudflare Workers scripts known to be “malicious or abusive.”

These payloads used ‘Notice to AI’ security honeypots, which were based on blocks of text specifically designed to get the models to misclassify them as benign code ‘scripts’.

After a full context analysis, with 18,400 API calls in each model, the Cloudforce team has concluded that the effectiveness of AI deception “depends entirely on the model level” and “fundamentally” on the ratio of comments to code.

SUBTLE LURES TO GO UNNOTICED

Thus, one of the main conclusions of the study is that “subtle deception is the most effective”, since the detection of malicious code in the honeypot is more likely to fail when evasion comments represent “less than one percent of the total file.”

In this sense, the researchers have detailed that there is an evasion zone of 1 percent and that, staying within that range, the detection rates in all the models tested “plummeted” to 53 percent.

As Cloudflare has explained, by barely making evasive comments, the decoys subtly influence the model’s reasoning without generating suspicion due to excessive repetition that there is an alert.

Conversely, experts have also identified an area called the digital footprint, which is when evasion comments exceed the 25 percent ratio and become “a hallmark signature of fraud” rather than evasion.

In these cases, detection reached 99 percent because the model interpreted claim saturation as an indicator of malicious intent.

Therefore, cybersecurity experts have defined this behavior as “the U-curve of deception”, which refers to the fact that moderate attempts to deceive the AI ​​usually work but “protesting too much” triggers a repeat alarm that “causes the AI ​​to mark the code as fraudulent.”

CONTEXT TRAP AND LINGUISTIC STEREOTYPES

In addition to all this, researchers have also concluded that taking advantage of context to hide decoys is also useful. Thus, by hiding malicious payloads within large library packages, such as the React SDK, attackers managed to reduce detection rates to just 12 percent.

This is because the model has to manage to identify the decoy among a large amount of information, “effectively exhausting the AI’s focusing ability” with a “volume obfuscation” method.

Another point that the study reveals is that AI models have developed some linguistic stereotypes when identifying malicious code. Specifically, the results show that on some occasions the models marked comments written in Russian or Chinese as “high-risk signals”, regardless of the actual function of the code in question.

However, the models showed greater confidence with comments written in languages ​​such as Estonian, which is not as common among cybercriminals and went more unnoticed.

RISK FOR ORGANIZATIONS THAT DEPEND ON AUTONOMOUS SYSTEMS

All of this highlights “a technical reality” given that, as Cloudflare has pointed out, as organizations use systems that increasingly rely on descriptive logic models (LLM) to perform security analysis in real time, they become vulnerable to attacks capable of fooling the model and “getting it to prioritize the most authoritative-sounding text in its context window over its underlying security training.”

To avoid these types of malicious actions, researchers have recommended removing comments from the code before analysis, as an effective way to “neutralize linguistic distractions” and ensure that the model focuses on “functional logic.”

In addition, they have also pointed out techniques such as intentional truncation, where when working with long scripts, automated parsers are told to prioritize functional blocks of code over repetitive code, metadata or specific SDK code.

Likewise, Cloudflare has also aimed to carry out anonymization of variables prior to the analysis, as well as to request specific indications about the attack vector in case of suspicion, in order to receive more precise results.

By Editor

One thought on “Cloudflare warns about the use of ‘prompt injection’ through honeypots to trick AI and execute malicious code”

Leave a Reply