Researchers Expose GPT-5 Jailbreak That Bypasses Safety Controls

Cybersecurity researchers at two companies have uncovered a jailbreak technique that bypasses ethical guardrails set up by OpenAI in its latest large language model (LLM), GPT-5, and produces illicit instructions.

AI security startup SPLX, used more than 1,000 adversarial prompts in different configurations and found that the raw, unguarded GPT-5 without a system prompt will fall for a whopping 89% of attacks. This shows an 11% overall performance score.

OpenAI’s system prompt, a “basic prompt layer,” limits the success rate of attacks to 43%. Although this vastly improves hallucination handling and safety, the overall score is still very low, and the older GPT-4o model outperforms its successor across the board.

In comparison, the hardened GPT-4o model fell for only 3% of attacks, and demonstrated 97% overall score. The success rate of attacks against GPT-4o with a basic system prompt was 19% (81% score), and the model with no system prompt was vulnerable to 71% of attacks (29% score).

Another team of researchers at NeuralTrust confirmed that GPT-5 is vulnerable to two adversarial prompt techniques, “Echo Chamber” and “Storytelling.”

The echo chamber method relies on including a subtly poisonous conversational context in the prompt. Subsequent prompts echo this poisoned context and gradually strengthen it. The storytelling angle functions as a camouflage to trick the chatbot.

NeuralTrust showed that Echo Chamber, when used with narrative-driven steering, can elicit harmful outputs from GPT-5 without issuing explicitly malicious prompts.

This reinforces a key risk: keyword or intent-based filters are insufficient in multi-turn settings where context can be gradually poisoned and then echoed back under the guise of continuity.

Prioritizing Performance Over Security

Maor Volokh, Vice President of Product at Noma Security, says model providers are caught in a competitive “race to the bottom,” releasing new models at an unprecedented pace of every one-to-two months.

“OpenAI alone has launched roughly seven models this year. This breakneck speed typically prioritizes performance and innovation over security considerations, leading to an expectation that more model vulnerabilities will emerge as competition intensifies. Consequently, model runtime security has become critically important, particularly given the rapid emergence of new agentic capabilities that expand potential attack surfaces.”

Part of the arms race that is product design and release appears to be far more complicated for the frontier model providers, Trey Ford, Chief Strategy and Trust Officer at Bugcrowd adds. “Releasing an email, chat, or industry-specific software platform is pretty straightforward in comparison to iterating with a GenAI model – the use cases are somewhat narrow in many traditional software platforms, and unbounded in their world.”

Test Each Major Release

Ford says models will get stronger in some areas, and will probably see loss of progress in other ways. “I feel bad for the frontier model testing teams. That’s a highly complicated test harness with a variety of frontiers to measure progress on.

“Security vendors pressure test each major release, verifying their value proposition, and inform where and how they fit into that ecosystem – they not only hold the model providers accountable – enterprise security teams need to know how to protect the instructions informing the originally intended behaviors, understanding how untrusted prompts will be handled, and how to monitor for evolution over time,” continues Ford.

“Security teams need to view AI usage in a similar light to how IaaS, PaaS, and Saas (Infrastructure, Platforms, Software – as a service) will evolve. We must manage our users, our implementation, and stay current with how the underlying infrastructure evolves (releasing capabilities, features, security offerings, and the continually evolving security posture of the platforms and technologies we’re using… the sand beneath our feet will continue to shift and evolve,” Ford says.

Tripped by Simple Obfuscation Tricks

“GPT-5’s alleged vulnerabilities boil down to three things: it can be steered over multiple turns by context poisoning and storytelling, it’s still tripped by simple obfuscation tricks, and it inherits agent/tool risks when links and functions get pulled into the loop,” comments J Stephen Kowski, Field CTO at SlashNext.

“These gaps appear when safety checks judge prompts one-by-one while attackers work the whole conversation, nudging the model to keep a story consistent until it outputs something it shouldn’t. The practical fix is layered: harden the starting policy, add real-time input/output inspection that catches persuasion cycles, obfuscation, and tool/URL bait, and enforce conversation-level memory checks with a kill-switch when the dialog drifts into risky territory. That’s the approach we use in production: pre-prompt hardening, live enforcement across messages, link/tool hygiene, and automatic shutdowns if the conversation turns unsafe—so even if the model slips, the app doesn’t,” Kowski adds.

Layered, Context-aware Controls

Satyam Sinha, CEO and founder at Acuvity, says these findings highlight a reality we’re seeing more often in AI security: model capability is advancing faster than our ability to harden it against incidents.

“GPT-5’s vulnerabilities aren’t surprising, they’re a reminder that security isn’t something you ‘ship’ once. Attacks like the Echo Chamber exploit the model’s own conversational memory, and the SPLX results underscore how dependent GPT-5’s defenses are on external scaffolding like prompts and runtime filters.”

Sinha says enterprises can’t assume model-level alignment will protect them. “They need layered, context-aware controls and continuous red-teaming to detect when a model’s behavior is drifting toward unsafe territory. Without that, even the most advanced LLMs can be turned against their operators.”

Kirsten Doyle

Information Security Buzz News Editor

Kirsten Doyle has been in the technology journalism and editing space for nearly 24 years, during which time she has developed a great love for all aspects of technology, as well as words themselves. Her experience spans B2B tech, with a lot of focus on cybersecurity, cloud, enterprise, digital transformation, and data centre. Her specialties are in news, thought leadership, features, white papers, and PR writing, and she is an experienced editor for both print and online publications.

The opinions expressed in this post belong to the individual contributors and do not necessarily reflect the views of Information Security Buzz.

Kirsten Doyle

Information Security Buzz News Editor

What Are AI SOC Agents? Use Cases, Architecture, and the Leading Vendors

AI-Powered Attacks Become Top Concern for Security Professionals, New Filigran Survey Reveals

From AI hype to operational reality: A practitioner’s framework for securing agentic systems

Working With Us

Write For Us

The Pages

Researchers Expose GPT-5 Jailbreak That Bypasses Safety Controls

Prioritizing Performance Over Security

Test Each Major Release

Tripped by Simple Obfuscation Tricks

Layered, Context-aware Controls

Kirsten Doyle

Information Security Buzz News Editor

Related Posts

What Are AI SOC Agents? Use Cases, Architecture, and the Leading Vendors

AI-Powered Attacks Become Top Concern for Security Professionals, New Filigran Survey Reveals

From AI hype to operational reality: A practitioner’s framework for securing agentic systems