Security researchers have demonstrated how a growing class of AI safety controls (known as AI judges) can be manipulated into approving content they are supposed to block.
In new research published by cybersecurity firm Palo Alto Networks’ threat intelligence team Unit 42, analysts describe how automated “fuzzing” techniques can uncover hidden weaknesses in the large language models that many organizations now rely on as automated gatekeepers.
These models are increasingly used to evaluate whether AI-generated responses are safe, policy-compliant, or suitable for users. But the research suggests that these digital referees can themselves be fooled, sometimes by nothing more than harmless-looking formatting characters.
Testing the AI Gatekeepers
The researchers developed an internal red team testing tool, named “AdvJudge-Zero,” intended for testing AI judge systems for weaknesses. The method is based on conventional software security approaches, including “fuzzing,” where a system is flooded with unexpected inputs in order to identify weaknesses.
Rather than attacking conventional software, this tool is intended for AI systems that are automated reviewers.
These systems are widely used in AI pipelines to decide whether content should be allowed or blocked.
The researchers found that by feeding models carefully crafted inputs, the tool could identify subtle trigger sequences that change the model’s decision, flipping a “block” outcome into an “allow.”
Unlike earlier AI jailbreak techniques that relied on obvious nonsense text, these inputs often appear perfectly normal.
Innocent-Looking Triggers
The most striking finding is how mundane the triggers can be. According to the research, simple formatting cues can influence a model’s internal logic.
Examples include:
- Markdown formatting such as ###
- List markers like 1. or –
- Structural labels such as “User:” or “Assistant:”
- Phrases like “Step 1”, “The solution process is…”, or “Final answer:”
To a human reader (or even a traditional security filter) these look like harmless formatting. But to an AI judge, they can shift the model’s internal attention patterns and alter its decision-making process.
In testing, the team found that such “low-perplexity” tokens (inputs that look natural to the model) were significantly more stealthy than the gibberish used in many known jailbreak attacks.
Real-World Risks
The researchers said manipulating AI judges could allow bad actors to bypass safety filters or even corrupt the training process of other AI systems.
One scenario involves forcing a safety model to approve harmful content by appending subtle control tokens to a prompt. The formatting signals can trick the model into believing the safety check has already concluded, leading it to approve material that would normally be blocked.
Another potential impact involves reinforcement learning pipelines. Many businesses rely on automated evaluators to score AI outputs during training, a process known as “Reinforcement Learning from Human Feedback.”
If malefactors manipulate the scoring model, the AI being trained could receive high scores for incorrect or hallucinated answers. Over time, that feedback loop could degrade the system’s reliability.
Even Large Models Are Vulnerable
Perhaps most concerning is the breadth of the issue. The researchers reported a success rate of around 99% when testing their approach against several categories of models.
Those included:
- enterprise open-weight models used in internal applications
- specialised “reward models” designed to evaluate AI outputs
- large models with more than 70 billion parameters
According to the report, the complexity of these models may actually increase their vulnerability, because it creates more opportunities for subtle logic errors.
A Familiar Lesson
Despite their sophistication, large language models still behave like software systems and thus inherit many of the same weaknesses.
Unit 42 believes the solution lies in applying classic security practices to AI development. By using tools like fuzzers internally, companies can discover these vulnerabilities during testing and retrain models to resist them.
With adversarial training, the success rate of attacks of this nature can drop from near-total bypass to almost zero, they said.
A Profound Paradox
Noelle Murata, Sr. Security Engineer, Xcape, says this matters to security professionals because it reveals a profound paradox: the increased complexity of large models actually creates a broader attack surface for logic-based manipulation.
“As we push toward 70B+ parameter models, we are inadvertently providing attackers with more ‘soft modes’ in the model’s reasoning to exploit. Defenders must shift away from relying solely on LLM-based oversight and implement multi-layered validation, including adversarial training and traditional hard-coded heuristics. We cannot secure a system using a gatekeeper that is inherently susceptible to the same hallucinations it is meant to police.”
Murata says the ultimate irony of modern AI security is that the more “intelligent” a model becomes, the more ways it finds to talk itself out of following its own rules.
Information Security Buzz News Editor
Kirsten Doyle has been in the technology journalism and editing space for nearly 24 years, during which time she has developed a great love for all aspects of technology, as well as words themselves. Her experience spans B2B tech, with a lot of focus on cybersecurity, cloud, enterprise, digital transformation, and data centre. Her specialties are in news, thought leadership, features, white papers, and PR writing, and she is an experienced editor for both print and online publications.
The opinions expressed in this post belong to the individual contributors and do not necessarily reflect the views of Information Security Buzz.


