Close Menu
  • Home
  • Articles
    • Attacks
      • BEC
      • Data Breach
      • DDoS
      • Evasion Attacks
      • Injection
      • Malware
      • MITM
      • Phishing
      • Ransomware
      • RCE
      • Social Engineering
      • Spoofing
      • Spyware
    • Business and Policy
      • BCP and DRP
      • GRC
      • Regulations
    • Data Protection
      • DLP
      • DRM
      • Encryption
      • IAM
    • Future, Trends and Insight
      • AI
      • Events & Community
      • Emerging Tech
      • Expert Panel
      • Interviews With Experts
      • Insights
      • Study & Research
    • Resources
      • Guides
      • Tools
      • Training & Education
    • Security
      • API
      • Apps
      • Cloud
      • Critical Infrastructure
      • Endpoint
      • Hardware
      • IoT
      • Mobile
      • Network
      • OT
      • Port Security
      • Security Architecture
      • Software Development
      • Supply Chain
      • Zero Trust
    • Threats and Vulnerabilities
      • Emerging Threats
      • Insider Threats
      • Risk Management
      • Threat Intelligence
      • Zero Day
  • News and Exclusives
    • Latest News
    • ISB Exclusive
    • Positive News
  • Who We Are
    • About Us
    • Information Security Buzz Expert Panel​
    • Write for Us
    • Media Pack
  • Contact Us
  • Newsletter
Facebook X (Twitter) LinkedIn
Facebook X (Twitter) LinkedIn
Information Security BuzzInformation Security Buzz
  • Home
  • Articles
    • Attacks
      • BEC
      • Data Breach
      • DDoS
      • Evasion Attacks
      • Injection
      • Malware
      • MITM
      • Phishing
      • Ransomware
      • RCE
      • Social Engineering
      • Spoofing
      • Spyware
    • Business and Policy
      • BCP and DRP
      • GRC
      • Regulations
    • Data Protection
      • DLP
      • DRM
      • Encryption
      • IAM
    • Future, Trends and Insight
      • AI
      • Events & Community
      • Emerging Tech
      • Expert Panel
      • Interviews With Experts
      • Insights
      • Study & Research
    • Resources
      • Guides
      • Tools
      • Training & Education
    • Security
      • API
      • Apps
      • Cloud
      • Critical Infrastructure
      • Endpoint
      • Hardware
      • IoT
      • Mobile
      • Network
      • OT
      • Port Security
      • Security Architecture
      • Software Development
      • Supply Chain
      • Zero Trust
    • Threats and Vulnerabilities
      • Emerging Threats
      • Insider Threats
      • Risk Management
      • Threat Intelligence
      • Zero Day
  • News and Exclusives
    • Latest News
    • ISB Exclusive
    • Positive News
  • Who We Are
    • About Us
    • Information Security Buzz Expert Panel​
    • Write for Us
    • Media Pack
  • Contact Us
  • Newsletter
Subscribe
Information Security BuzzInformation Security Buzz
Home - Artificial Intelligence - Researchers Show How “AI Judges” Can Be Tricked Into Approving Harmful Content
Artificial Intelligence Attacks Injection Attacks Latest News News & Analysis

Researchers Show How “AI Judges” Can Be Tricked Into Approving Harmful Content

Kirsten DoyleBy Kirsten DoyleMarch 13, 20264 Mins Read
Share LinkedIn Twitter Facebook Copy Link Email
AI Judges
Share
Facebook Twitter LinkedIn Email Copy Link
Quick AI Summary
ChatGPTClaudeGeminiGrokPerplexityDeepSeekCopilot

Security researchers have demonstrated how a growing class of AI safety controls (known as AI judges) can be manipulated into approving content they are supposed to block. 

In new research published by cybersecurity firm Palo Alto Networks’ threat intelligence team Unit 42, analysts describe how automated “fuzzing” techniques can uncover hidden weaknesses in the large language models that many organizations now rely on as automated gatekeepers. 

These models are increasingly used to evaluate whether AI-generated responses are safe, policy-compliant, or suitable for users. But the research suggests that these digital referees can themselves be fooled, sometimes by nothing more than harmless-looking formatting characters. 

Testing the AI Gatekeepers 

The researchers developed an internal red team testing tool, named “AdvJudge-Zero,” intended for testing AI judge systems for weaknesses. The method is based on conventional software security approaches, including “fuzzing,” where a system is flooded with unexpected inputs in order to identify weaknesses. 

Rather than attacking conventional software, this tool is intended for AI systems that are automated reviewers. 

These systems are widely used in AI pipelines to decide whether content should be allowed or blocked. 

The researchers found that by feeding models carefully crafted inputs, the tool could identify subtle trigger sequences that change the model’s decision, flipping a “block” outcome into an “allow.” 

Unlike earlier AI jailbreak techniques that relied on obvious nonsense text, these inputs often appear perfectly normal. 

Innocent-Looking Triggers 

The most striking finding is how mundane the triggers can be. According to the research, simple formatting cues can influence a model’s internal logic. 

Examples include: 

  • Markdown formatting such as ### 
  • List markers like 1. or – 
  • Structural labels such as “User:” or “Assistant:” 
  • Phrases like “Step 1”, “The solution process is…”, or “Final answer:” 

To a human reader (or even a traditional security filter) these look like harmless formatting. But to an AI judge, they can shift the model’s internal attention patterns and alter its decision-making process. 

In testing, the team found that such “low-perplexity” tokens (inputs that look natural to the model) were significantly more stealthy than the gibberish used in many known jailbreak attacks. 

Real-World Risks 

The researchers said manipulating AI judges could allow bad actors to bypass safety filters or even corrupt the training process of other AI systems. 

One scenario involves forcing a safety model to approve harmful content by appending subtle control tokens to a prompt. The formatting signals can trick the model into believing the safety check has already concluded, leading it to approve material that would normally be blocked. 

Another potential impact involves reinforcement learning pipelines. Many businesses rely on automated evaluators to score AI outputs during training, a process known as “Reinforcement Learning from Human Feedback.” 

If malefactors manipulate the scoring model, the AI being trained could receive high scores for incorrect or hallucinated answers. Over time, that feedback loop could degrade the system’s reliability. 

Even Large Models Are Vulnerable 

Perhaps most concerning is the breadth of the issue. The researchers reported a success rate of around 99% when testing their approach against several categories of models. 

Those included: 

  • enterprise open-weight models used in internal applications 
  • specialised “reward models” designed to evaluate AI outputs 
  • large models with more than 70 billion parameters 

According to the report, the complexity of these models may actually increase their vulnerability, because it creates more opportunities for subtle logic errors. 

A Familiar Lesson 

Despite their sophistication, large language models still behave like software systems and thus inherit many of the same weaknesses. 

Unit 42 believes the solution lies in applying classic security practices to AI development. By using tools like fuzzers internally, companies can discover these vulnerabilities during testing and retrain models to resist them. 

With adversarial training, the success rate of attacks of this nature can drop from near-total bypass to almost zero, they said. 

A Profound Paradox 

Noelle Murata, Sr. Security Engineer, Xcape, says this matters to security professionals because it reveals a profound paradox: the increased complexity of large models actually creates a broader attack surface for logic-based manipulation.  

“As we push toward 70B+ parameter models, we are inadvertently providing attackers with more ‘soft modes’ in the model’s reasoning to exploit. Defenders must shift away from relying solely on LLM-based oversight and implement multi-layered validation, including adversarial training and traditional hard-coded heuristics. We cannot secure a system using a gatekeeper that is inherently susceptible to the same hallucinations it is meant to police.” 

Murata says the ultimate irony of modern AI security is that the more “intelligent” a model becomes, the more ways it finds to talk itself out of following its own rules. 

Kirsten Doyle
Kirsten Doyle
Information Security Buzz News Editor

Kirsten Doyle has been in the technology journalism and editing space for nearly 24 years, during which time she has developed a great love for all aspects of technology, as well as words themselves. Her experience spans B2B tech, with a lot of focus on cybersecurity, cloud, enterprise, digital transformation, and data centre. Her specialties are in news, thought leadership, features, white papers, and PR writing, and she is an experienced editor for both print and online publications.

  • Kirsten Doyle
    AI-Powered Attacks Become Top Concern for Security Professionals, New Filigran Survey Reveals
  • Kirsten Doyle
    ShinyHunters targets Oracle PeopleSoft customers through critical zero-day
  • Kirsten Doyle
    SIG report: AI-generated code is linked to twice the security risk and rising technical debt
  • Kirsten Doyle
    Miasma worm spreads from Red Hat packages to Microsoft repositories

The opinions expressed in this post belong to the individual contributors and do not necessarily reflect the views of Information Security Buzz.

Share. Facebook Twitter LinkedIn Email Copy Link

Related Posts

What Are AI SOC Agents? Use Cases, Architecture, and the Leading Vendors

June 19, 20266 Mins Read

AI-Powered Attacks Become Top Concern for Security Professionals, New Filigran Survey Reveals

June 19, 20265 Mins Read

From AI hype to operational reality: A practitioner’s framework for securing agentic systems

June 5, 20267 Mins Read
ISB-Bora-Side-Bar

No se ha podido establecer conexión. Error 429

 
ISB-Bora-Side-Bar
Black ISB Logo

Information Security Buzz is an independent resource that provides the experts’ comments, analysis, and opinion on the latest Cybersecurity news and topics

X (Twitter) LinkedIn Facebook RSS

Working With Us

  • About Us
  • Advertise With Us
  • Contact Us

Write For Us

  • How To Contribute

The Pages

  • Privacy Policy
  • Cookie Policy
  • AI Policy
  • Terms & Conditions
  • Copyright Notice

Information Security Buzz and all its contents are copyright © 2014-2025. All rights reserved. All third-party trademarks are recognized.

Type above and press Enter to search. Press Esc to cancel.

Manage Consent
To provide the best experiences, we use technologies like cookies to store and/or access device information. Consenting to these technologies will allow us to process data such as browsing behavior or unique IDs on this site. Not consenting or withdrawing consent, may adversely affect certain features and functions.
Functional Always active
The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network.
Preferences
The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user.
Statistics
The technical storage or access that is used exclusively for statistical purposes. The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you.
Marketing
The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes.
  • Manage options
  • Manage services
  • Manage {vendor_count} vendors
  • Read more about these purposes
View preferences
  • {title}
  • {title}
  • {title}