Jailbreaks (LLM)

From Systems Analysis Wiki
Jump to navigation Jump to search

Jailbreak in the context of large language models (LLMs) is a type of adversarial attack aimed at bypassing built-in security mechanisms and restrictions to elicit prohibited or potentially malicious responses[1]. A jailbreak involves "inducing the model to generate malicious responses that contradict usage policies and societal norms by crafting adversarial prompts"[2].

The fundamental vulnerability exploited by jailbreak attacks lies in an architectural feature of LLMs: the models cannot distinguish between instructions and data by their type, as both system prompts and user input share the same format—natural language text strings[3].

History and Evolution

Early Period: Prompt Injections (2022)

The first documented discovery of the prompt injection vulnerability occurred in May 2022, when researchers at the company Preamble found that ChatGPT was susceptible to such attacks. In September 2022, Riley Goodside independently published the first public demonstration of GPT-3's vulnerability on Twitter, with a famous example where the model was instructed to ignore its previous instructions[4].

The DAN Era (2022–2023)

In mid-2022, the first "Do Anything Now" (DAN) prompts emerged, which were role-playing instructions. The key innovation was the use of role-playing to bypass security restrictions by creating an "alternate persona" free from rules[5]. The evolution of DAN led to complex scenarios involving token systems (punishment/reward mechanisms) and character persistence mechanisms[6].

Diversification of Methods (2023–2024)

Starting in 2023, comprehensive academic research into jailbreak attacks began. In 2024, multimodal attacks appeared, involving the hiding of malicious instructions in images, audio files, and visual prompt injections via ASCII art[7].

Modern Period (2024–2025)

Attack techniques continue to grow more sophisticated. In November 2024, the "Time Bandit" technique was discovered, which exploits temporal confusion in ChatGPT-4o by framing questions as if they were from historical periods (the 1800s-1900s)[8].

Technical Methods and Classification

Attacks can be classified based on access to the model:

  • Black-box attacks: Without access to the model's internal components (parameters, gradients).
  • White-box attacks: With full access to the model's parameters and gradients[2].

JailbreakRadar Taxonomy

The JailbreakRadar classification (Chu et al., 2024) identifies six main categories of attacks:

  1. Direct attacks: Direct malicious prompts.
  2. Indirect attacks: Multi-step manipulation strategies.
  3. Contextual attacks: Using conversation history.
  4. Role-playing attacks: Character impersonation techniques (e.g., DAN).
  5. Encoding attacks: Obfuscation methods to hide malicious instructions.
  6. Template attacks: Structured adversarial frameworks[9].

Technical Mechanisms

  • Adversarial Suffix Generation (GCG): A method proposed by Zou et al. (2023) that automatically generates adversarial suffixes (sequences of tokens) which, when appended to a prompt, have a high probability of eliciting a malicious response. The method uses gradient-based optimization and demonstrates high success rates (up to 84% on GPT-4) and transferability across models[10].
  • Many-shot Jailbreaking: Research by Anthropic (2024) showed that attack effectiveness follows a power law: as the number of malicious examples in the prompt increases, the percentage of undesirable responses grows[11].

Defense Mechanisms

  • Constitutional Classifiers (Anthropic): Filtering input/output data based on a set of constitutional principles. This method reduced jailbreak success rates from 86% to 4.4% in controlled evaluations[12].
  • Reinforcement Learning from Human Feedback (RLHF): A three-stage training process (OpenAI), involving supervised fine-tuning, training a reward model, and policy optimization, has shown a significant reduction in the generation of toxic content.
  • Adversarial Training: Training the model on examples of jailbreak attacks to enhance its robustness. The effectiveness of this approach in reducing attack success rates is estimated at 60–80%[1].
  • Multi-layered Defense: A recommended strategy that includes input validation, model-level protection, output monitoring, and continuous real-time monitoring.

Jailbreak attacks on large language models represent a fundamental AI safety problem, demonstrating the ongoing tension between model capabilities and alignment. The attack landscape is constantly becoming more complex, evolving from simple prompt injections to sophisticated multimodal and automated attacks. Research shows that no current defense mechanism is completely robust against all jailbreak attempts. Success in this area requires continuous investment in safety research, responsible disclosure practices, and collaborative efforts among researchers, industry, and regulators.

Literature

  • Bai, Y. et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073.
  • Zou, A. et al. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv:2307.15043.
  • Shen, X. et al. (2023). “Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models. arXiv:2308.03825.
  • Chao, P. et al. (2024). JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models. arXiv:2404.01318.
  • Liao, Z.; Sun, H. (2024). AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs. OpenReview UfqzXg95I5.
  • Yi, S. et al. (2024). Jailbreak Attacks and Defenses Against Large Language Models: A Survey. arXiv:2407.04295.
  • Chu, J. et al. (2025). JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs. arXiv:2402.05668.
  • Liu, A. et al. (2025). PiCo: Jailbreaking Multimodal Large Language Models via Pictorial Code Contextualization. arXiv:2504.01444.
  • Ghosal, D. et al. (2025). Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Filtering. CVPR 2025. PDF.
  • Yan, Q. et al. (2025). Hidden in Plain Sight: Probing Implicit Reasoning in Multimodal Language Models. arXiv:2506.00258.
  • Liu, Y. et al. (2025). RePD: Defending Jailbreak Attack through a Retrieval-Based Detector. Findings of NAACL 2025. ACL Anthology.

Notes

  1. 1.0 1.1 “A brief history of jailbreaking”. Lil'Log. [1]
  2. 2.0 2.1 Yi, J., et al. “Jailbreak Attacks and Defenses Against Large Language Models: A Comprehensive Survey”. arXiv:2405.09443. [2]
  3. “Jailbreaking LLMs”. Prompting Guide. [3]
  4. “Exploring prompt injection attacks”. NCC Group. [4]
  5. “Do Anything Now: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models”. arXiv:2308.03825. [5]
  6. “0xk1h0/ChatGPT_DAN”. GitHub. [6]
  7. “Hiding in Plain Sight: Multimodal Jailbreaking of Large Language Models”. HiddenLayer. [7]
  8. “ChatGPT "Time-travel" jailbreak lets you bypass its safety guards”. BleepingComputer. [8]
  9. Chu, Z., et al. “JailbreakRadar: A Comprehensive Benchmark for Jailbreak Attack and Defense”. arXiv:2402.12642. [9]
  10. Zou, A., et al. “Universal and Transferable Adversarial Attacks on Aligned Language Models”. arXiv:2307.15043. [10]
  11. “Many-shot Jailbreaking”. Anthropic. [11]
  12. “How we're using 'constitutional AI' to make our models safer”. MIT Technology Review. [12]