| Time | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2025.08 | Jinx: Unlimited LLMs for Probing Alignment Failures | arXiv | link | models |
| 2025.06 | ExtendAttack: Attacking Servers of LRMs via Extending Reasoning | AAAI'26 | link | link |
| 2025.03 | Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning Models | arXiv | link | - |
| 2025.02 | OverThink: Slowdown Attacks on Reasoning LLMs | arXiv | link | link |
| 2025.02 | BoT: Breaking Long Thought Processes of o1-like Large Language Models through Backdoor Attack | arXiv | link | link |
| 2025.02 | H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking | arXiv | link | link |
| 2025.02 | A Mousetrap: Fooling Large Reasoning Models for Jailbreak with Chain of Iterative Chaos | arXiv | link | - |
| Time | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2025.07 | Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models | arXiv | link | link |
| 2025.05 | Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM Detection | ICML'25 | link | link |
| 2025.05 | FlipAttack: Jailbreak LLMs via Flipping (FlipAttack) | ICML'25 | link | link |
| 2025.03 | Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy (JOOD) | CVPR'25 | link | link |
| 2025.02 | StructTransform: A Scalable Attack Surface for Safety-Aligned Large Language Models | arXiv | link | link |
| 2025.01 | Understanding and Enhancing the Transferability of Jailbreaking Attacks | ICLR'25 | link | link |
| 2024.11 | The Dark Side of Trust: Authority Citation-Driven Jailbreak Attacks on Large Language Models | arXiv | link | link |
| 2024.11 | Playing Language Game with LLMs Leads to Jailbreaking | arXiv | link | link |
| 2024.11 | GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs (GASP) | arXiv | link | link |
| 2024.11 | LLM STINGER: Jailbreaking LLMs using RL fine-tuned LLMs | arXiv | link | - |
| 2024.11 | SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt | arXiv | link | link |
| 2024.11 | Diversity Helps Jailbreak Large Language Models | arXiv | link | - |
| 2024.11 | Plentiful Jailbreaks with String Compositions | arXiv | link | - |
| 2024.11 | Transferable Ensemble Black-box Jailbreak Attacks on Large Language Models | arXiv | link | link |
| 2024.11 | Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring | arXiv | link | - |
| 2024.10 | Endless Jailbreaks with Bijection | arXiv | link | - |
| 2024.10 | Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models | arXiv | link | - |
| 2024.10 | You Know What I'm Saying: Jailbreak Attack via Implicit Reference | arXiv | link | link |
| 2024.10 | Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation | arXiv | link | link |
| 2024.10 | AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs (AutoDAN-Turbo) | arXiv | link | link |
| 2024.10 | PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach (PathSeeker) | arXiv | link | - |
| 2024.10 | Read Over the Lines: Attacking LLMs and Toxicity Detection Systems with ASCII Art to Mask Profanity | arXiv | link | link |
| 2024.09 | AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs | arXiv | link | link |
| 2024.09 | Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs | arXiv | link | - |
| 2024.09 | Jailbreaking Large Language Models with Symbolic Mathematics | arXiv | link | - |
| 2024.08 | Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues | ACL Findings'24 | link | link |
| 2024.08 | Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models | arXiv | link | - |
| 2024.08 | Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Neural Carrier Articles | arXiv | link | - |
| 2024.08 | h4rm3l: A Dynamic Benchmark of Composable Jailbreak Attacks for LLM Safety Assessment (h4rm3l) | arXiv | link | link |
| 2024.08 | EnJa: Ensemble Jailbreak on Large Language Models (EnJa) | arXiv | link | - |
| 2024.07 | Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack | arXiv | link | link |
| 2024.07 | LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models | arXiv | link | |
| 2024.07 | Single Character Perturbations Break LLM Alignment | arXiv | link | link |
| 2024.07 | A False Sense of Safety: Unsafe Information Leakage in 'Safe' AI Responses | arXiv | link | - |
| 2024.07 | Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection (Virtual Context) | arXiv | link | - |
| 2024.07 | SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack (SoP) | arXiv | link | link |
| 2024.06 | Jailbreaking as a Reward Misspecification Problem | ICLR'25 | link | link |
| 2024.06 | Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (I-FSJ) | NeurIPS'24 | link | link |
| 2024.06 | When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search (RLbreaker) | NeurIPS'24 | link | - |
| 2024.06 | Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast (Agent Smith) | ICML'24 | link | link |
| 2024.06 | Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation | ICML'24 | link | - |
| 2024.06 | ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs (ArtPrompt) | ACL'24 | link | link |
| 2024.06 | From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings (ASETF) | arXiv | link | - |
| 2024.06 | CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion (CodeAttack) | ACL'24 | link | - |
| 2024.06 | Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction (DRA) | USENIX Security'24 | link | link |
| 2024.06 | AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens (AutoJailbreak) | arXiv | link | - |
| 2024.06 | Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks | arXiv | link | link |
| 2024.06 | GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts (GPTFUZZER) | arXiv | link | link |
| 2024.06 | A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily (ReNeLLM) | NAACL'24 | link | link |
| 2024.06 | QROA: A Black-Box Query-Response Optimization Attack on LLMs (QROA) | arXiv | link | link |
| 2024.06 | Poisoned LangChain: Jailbreak LLMs by LangChain (PLC) | arXiv | link | link |
| 2024.05 | Multilingual Jailbreak Challenges in Large Language Models | ICLR'24 | link | link |
| 2024.05 | DeepInception: Hypnotize Large Language Model to Be Jailbreaker (DeepInception) | EMNLP'24 | link | link |
| 2024.05 | GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation (IRIS) | ACL'24 | link | - |
| 2024.05 | GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of LLMs (GUARD) | arXiv | link | - |
| 2024.05 | "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models (DAN) | CCS'24 | link | link |
| 2024.05 | Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher (SelfCipher) | ICLR'24 | link | link |
| 2024.05 | Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters (JAM) | NeurIPS'24 | link | - |
| 2024.05 | Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations (ICA) | arXiv | link | - |
| 2024.04 | Many-shot jailbreaking (MSJ) | NeurIPS'24 Anthropic | link | - |
| 2024.04 | PANDORA: Detailed LLM jailbreaking via collaborated phishing agents with decomposed reasoning (PANDORA) | ICLR Workshop'24 | link | - |
| 2024.04 | Fuzzllm: A novel and universal fuzzing framework for proactively discovering jailbreak vulnerabilities in large language models (FuzzLLM) | ICASSP'24 | link | link |
| 2024.04 | Sandwich attack: Multi-language mixture adaptive attack on llms (Sandwich attack) | TrustNLP'24 | link | - |
| 2024.03 | Tastle: Distract large language models for automatic jailbreak attack (TASTLE) | arXiv | link | - |
| 2024.03 | DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers (DrAttack) | EMNLP'24 | link | link |
| 2024.02 | PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails (PRP) | arXiv | link | - |
| 2024.02 | CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models (CodeChameleon) | arXiv | link | link |
| 2024.02 | PAL: Proxy-Guided Black-Box Attack on Large Language Models (PAL) | arXiv | link | link |
| 2024.02 | Jailbreaking Proprietary Large Language Models using Word Substitution Cipher | arXiv | link | - |
| 2024.02 | Query-Based Adversarial Prompt Generation | arXiv | link | - |
| 2024.02 | Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks (Contextual Interaction Attack) | arXiv | link | - |
| 2024.02 | Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs (SMJ) | arXiv | link | - |
| 2024.02 | Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking | NAACL'24 | link | link |
| 2024.01 | Low-Resource Languages Jailbreak GPT-4 | NeurIPS Workshop'24 | link | - |
| 2024.01 | How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs (PAP) | arXiv | link | link |
| 2023.12 | Tree of Attacks: Jailbreaking Black-Box LLMs Automatically (TAP) | NeurIPS'24 | link | link |
| 2023.12 | Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs | arXiv | link | - |
| 2023.12 | Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition | ACL'24 | link | - |
| 2023.11 | Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation (Persona) | NeurIPS Workshop'23 | link | - |
| 2023.10 | Jailbreaking Black Box Large Language Models in Twenty Queries (PAIR) | NeurIPS'24 | link | link |
| 2023.10 | Adversarial Demonstration Attacks on Large Language Models (advICL) | EMNLP'24 | link | - |
| 2023.10 | MASTERKEY: Automated Jailbreaking of Large Language Model Chatbots (MASTERKEY) | NDSS'24 | link | link |
| 2023.10 | Attack Prompt Generation for Red Teaming and Defending Large Language Models (SAP) | EMNLP'23 | link | link |
| 2023.10 | An LLM can Fool Itself: A Prompt-Based Adversarial Attack (PromptAttack) | ICLR'24 | link | link |
| 2023.09 | Multi-step Jailbreaking Privacy Attacks on ChatGPT (MJP) | EMNLP Findings'23 | link | link |
| 2023.09 | Open Sesame! Universal Black Box Jailbreaking of Large Language Models (GA) | Applied Sciences'24 | link | - |
| 2023.05 | Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection | CCS'23 | link | link |
| 2022.11 | Ignore Previous Prompt: Attack Techniques For Language Models (PromptInject) | NeurIPS WorkShop'22 | link | link |
| Year | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2025.08 | Don’t Say No: Jailbreaking LLM by Suppressing Refusal (DSN) | ACL'25 | link | link |
| 2025.03 | Guiding not Forcing: Enhancing the Transferability of Jailbreaking Attacks on LLMs via Removing Superfluous Constraints | arXiv | link | link |
| 2025.02 | Improved techniques for optimization-based jailbreaking on large language models (I-GCG) | ICLR'25 | link | link |
| 2024.12 | Efficient Adversarial Training in LLMs with Continuous Attacks | NeurIPS'24 | link | link |
| 2024.11 | AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to Jailbreak LLMs with Higher Success Rates in Fewer Attempts | arXiv | link | - |
| 2024.11 | DROJ: A Prompt-Driven Attack against Large Language Models | arXiv | link | link |
| 2024.11 | SQL Injection Jailbreak: a structural disaster of large language models | arXiv | link | link |
| 2024.10 | Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak Attacks | arXiv | link | - |
| 2024.10 | AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation | arXiv | link | link |
| 2024.10 | Jailbreak Instruction-Tuned LLMs via end-of-sentence MLP Re-weighting | arXiv | link | - |
| 2024.10 | Boosting Jailbreak Transferability for Large Language Models (SI-GCG) | arXiv | link | - |
| 2024.10 | Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities (ADV-LLM) | arXiv | link | link |
| 2024.08 | Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation (JVD) | arXiv | link | - |
| 2024.08 | Jailbreak Open-Sourced Large Language Models via Enforced Decoding (EnDec) | ACL'24 | link | - |
| 2024.07 | Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data | COLM'24 | Link | - |
| 2024.07 | Refusal in Language Models Is Mediated by a Single Direction | arXiv | Link | Link |
| 2024.07 | Revisiting Character-level Adversarial Attacks for Language Models | ICML'24 | link | link |
| 2024.07 | Badllama 3: removing safety finetuning from Llama 3 in minutes (Badllama 3) | arXiv | link | - |
| 2024.07 | SOS! Soft Prompt Attack Against Open-Source Large Language Models | arXiv | link | - |
| 2024.06 | COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability (COLD-Attack) | ICML'24 | link | link |
| 2024.05 | Semantic-guided Prompt Organization for Universal Goal Hijacking against LLMs | arXiv | link | |
| 2024.05 | Efficient LLM Jailbreak via Adaptive Dense-to-sparse Constrained Optimization | NeurIPS'24 | Link | - |
| 2024.05 | AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models (AutoDAN) | ICLR'24 | link | link |
| 2024.05 | AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs (AmpleGCG) | arXiv | link | link |
| 2024.05 | Boosting jailbreak attack with momentum (MAC) | ICLR Workshop'24 | link | link |
| 2024.04 | AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs (AdvPrompter) | arXiv | link | link |
| 2024.03 | Universal Jailbreak Backdoors from Poisoned Human Feedback | ICLR'24 | link | - |
| 2024.02 | Attacking large language models with projected gradient descent (PGD) | arXiv | link | - |
| 2024.02 | Open the Pandora's Box of LLMs: Jailbreaking LLMs through Representation Engineering (JRE) | arXiv | link | - |
| 2024.02 | Curiosity-driven red-teaming for large language models (CRT) | arXiv | link | link |
| 2023.12 | AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models (AutoDAN) | arXiv | link | link |
| 2023.10 | Catastrophic jailbreak of open-source llms via exploiting generation | ICLR'24 | link | link |
| 2023.06 | Automatically Auditing Large Language Models via Discrete Optimization (ARCA) | ICML'23 | link | link |
| 2023.07 | Universal and Transferable Adversarial Attacks on Aligned Language Models (GCG) | arXiv | link | link |
| Time | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2025.04 | Multi-Turn Jailbreaking Large Language Models via Attention Shifting | AAAI'25 | link | - |
| 2025.04 | X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents | arXiv | link | link |
| 2025.04 | Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning | arXiv | link | - |
| 2025.03 | Foot-In-The-Door: A Multi-turn Jailbreak for LLMs | arXiv | link | link |
| 2025.03 | Siege: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search | arXiv | link | - |
| 2024.11 | MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue | arXiv | link | - |
| 2024.10 | Jigsaw Puzzles: Splitting Harmful Questions to Jailbreak Large Language Models (JSP) | arXiv | link | link |
| 2024.10 | Multi-round jailbreak attack on large language | arXiv | link | - |
| 2024.10 | Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues | arXiv | link | link |
| 2024.10 | Automated Red Teaming with GOAT: the Generative Offensive Agent Tester | arXiv | link | - |
| 2024.09 | LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet | arXiv | link | link |
| 2024.09 | RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking | arXiv | link | link |
| 2024.08 | FRACTURED-SORRY-Bench: Framework for Revealing Attacks in Conversational Turns Undermining Refusal Efficacy and Defenses over SORRY-Bench (Automated Multi-shot Jailbreaks) | arXiv | link | - |
| 2024.08 | Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak Attacks | arXiv | link | link |
| 2024.05 | CoA: Context-Aware based Chain of Attack for Multi-Turn Dialogue LLM (CoA) | arXiv | link | link |
| 2024.04 | Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack (Crescendo) | Microsoft Azure | link | - |
| Time | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2024.09 | Unleashing Worms and Extracting Data: Escalating the Outcome of Attacks against RAG-based Inference in Scale and Severity Using Jailbreaking | arXiv | link | link |
| 2024.02 | Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning (Pandora) | arXiv | link | - |
| Time | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2024.11 | Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey | arXiv | link | link |
| 2024.10 | Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step | arXiv | link | - |
| 2024.10 | ColJailBreak: Collaborative Generation and Editing for Jailbreaking Text-to-Image Deep Generation | NeurIPS'24 | Link | - |
| 2024.08 | Jailbreaking Text-to-Image Models with LLM-Based Agents (Atlas) | arXiv | link | - |
| 2024.07 | Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything | arXiv | link | - |
| 2024.06 | Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt | arXiv | link | link |
| 2024.05 | Voice Jailbreak Attacks Against GPT-4o | arXiv | link | link |
| 2024.05 | Automatic Jailbreaking of the Text-to-Image Generative AI Systems | ICML'24 Workshop | link | link |
| 2024.04 | Image hijacks: Adversarial images can control generative models at runtime | arXiv | link | link |
| 2024.03 | An image is worth 1000 lies: Adversarial transferability across prompts on vision-language models (CroPA) | ICLR'24 | link | link |
| 2024.03 | Jailbreak in pieces: Compositional adversarial attacks on multi-modal language model | ICLR'24 | link | - |
| 2024.03 | Rethinking model ensemble in transfer-based adversarial attacks | ICLR'24 | link | link |
| 2024.02 | VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models | NeurIPS'23 | link | link |
| 2024.02 | Jailbreaking Attack against Multimodal Large Language Model | arXiv | link | - |
| 2024.01 | Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts | arXiv | link | - |
| 2024.03 | Visual Adversarial Examples Jailbreak Aligned Large Language Models | AAAI'24 | link | - |
| 2023.12 | OT-Attack: Enhancing Adversarial Transferability of Vision-Language Models via Optimal Transport Optimization (OT-Attack) | arXiv | link | - |
| 2023.12 | FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts (FigStep) | arXiv | link | link |
| 2023.11 | SneakyPrompt: Jailbreaking Text-to-image Generative Models | S&P'24 | link | link |
| 2023.11 | On Evaluating Adversarial Robustness of Large Vision-Language Models | NeurIPS'23 | link | link |
| 2023.10 | How Robust is Google's Bard to Adversarial Image Attacks? | arXiv | link | link |
| 2023.08 | AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal Contrastive Learning (AdvCLIP) | ACM MM'23 | link | link |
| 2023.07 | Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models (SGA) | ICCV'23 | link | link |
| 2023.07 | On the Adversarial Robustness of Multi-Modal Foundation Models | ICCV Workshop'23 | link | - |
| 2022.10 | Towards Adversarial Attack on Vision-Language Pre-training Models | arXiv | link | link |
| Time | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2025.04 | JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model | COLM'25 | link | link |
| 2024.12 | Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models | arXiv'24 | link | - |
| 2024.10 | Safety-Aware Fine-Tuning of Large Language Models | arXiv'24 | link | - |
| 2024.10 | MoJE: Mixture of Jailbreak Experts, Naive Tabular Classifiers as Guard for Prompt Attacks | AAAI'24 | link | - |
| 2024.08 | BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger (BaThe) | arXiv | link | - |
| 2024.07 | DART: Deep Adversarial Automated Red Teaming for LLM Safety | arXiv | link | - |
| 2024.07 | Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge (Eraser) | arXiv | link | link |
| 2024.07 | Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks | arXiv | link | link |
| 2024.06 | Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs | arXiv | Link | - |
| 2024.06 | Jatmo: Prompt Injection Defense by Task-Specific Finetuning (Jatmo) | arXiv | link | link |
| 2024.06 | Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization (SafeDecoding) | ACL'24 | link | link |
| 2024.06 | Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment | NeurIPS'24 | link | link |
| 2024.06 | On Prompt-Driven Safeguarding for Large Language Models (DRO) | ICML'24 | link | link |
| 2024.06 | Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks (RPO) | NeurIPS'24 | link | - |
| 2024.06 | Fight Back Against Jailbreaking via Prompt Adversarial Tuning (PAT) | NeurIPS'24 | link | link |
| 2024.05 | Towards Comprehensive and Efficient Post Safety Alignment of Large Language Models via Safety Patching (SAFEPATCHING) | arXiv | link | - |
| 2024.05 | Detoxifying Large Language Models via Knowledge Editing (DINM) | ACL'24 | link | link |
| 2024.05 | Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing | arXiv | link | link |
| 2023.11 | MART: Improving LLM Safety with Multi-round Automatic Red-Teaming (MART) | ACL'24 | link | - |
| 2023.11 | Baseline defenses for adversarial attacks against aligned language models | arXiv | link | - |
| 2023.10 | Safe rlhf: Safe reinforcement learning from human feedback | arXiv | link | link |
| 2023.08 | Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment (RED-INSTRUCT) | arXiv | link | link |
| 2022.04 | Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback | Anthropic | link | - |
| Time | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2025.05 | Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking | arXiv | link | link |
| 2024.11 | Rapid Response: Mitigating LLM Jailbreaks with a Few Examples | arXiv | link | link |
| 2024.10 | RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process (RePD) | arXiv | link | - |
| 2024.10 | Guide for Defense (G4D): Dynamic Guidance for Robust and Balanced Defense in Large Language Models (G4D) | arXiv | link | link |
| 2024.10 | Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models | arXiv | link | - |
| 2024.09 | HSF: Defending against Jailbreak Attacks with Hidden State Filtering | arXiv | link | link |
| 2024.08 | EEG-Defender: Defending against Jailbreak through Early Exit Generation of Large Language Models (EEG-Defender) | arXiv | link | - |
| 2024.08 | Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks (PG) | arXiv | link | link |
| 2024.08 | Self-Evaluation as a Defense Against Adversarial Attacks on LLMs (Self-Evaluation) | arXiv | link | link |
| 2024.06 | Defending LLMs against Jailbreaking Attacks via Backtranslation (Backtranslation) | ACL Findings'24 | link | link |
| 2024.06 | SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding (SafeDecoding) | ACL'24 | link | link |
| 2024.06 | Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM | ACL'24 | link | - |
| 2024.06 | A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily (ReNeLLM) | NAACL'24 | link | link |
| 2024.06 | SMOOTHLLM: Defending Large Language Models Against Jailbreaking Attacks | arXiv | link | link |
| 2024.05 | Enhancing Large Language Models Against Inductive Instructions with Dual-critique Prompting (Dual-critique) | ACL'24 | link | link |
| 2024.05 | PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition (PARDEN) | ICML'24 | link | link |
| 2024.05 | LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked | ICLR Tiny Paper'24 | link | link |
| 2024.05 | GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis (GradSafe) | ACL'24 | link | link |
| 2024.05 | Multilingual Jailbreak Challenges in Large Language Models | ICLR'24 | link | link |
| 2024.05 | Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes | NeurIPS'24 | link | - |
| 2024.05 | AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks | arXiv | link | link |
| 2024.05 | Bergeron: Combating adversarial attacks through a conscience-based alignment framework (Bergeron) | arXiv | link | link |
| 2024.05 | Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations (ICD) | arXiv | link | - |
| 2024.04 | Protecting your llms with information bottleneck | NeurIPS'24 | link | link |
| 2024.04 | Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning | arXiv | link | link |
| 2024.02 | Certifying LLM Safety against Adversarial Prompting | arXiv | link | link |
| 2024.02 | Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement | arXiv | link | - |
| 2024.02 | Defending large language models against jailbreak attacks via semantic smoothing (SEMANTICSMOOTH) | arXiv | link | link |
| 2024.01 | Intention Analysis Makes LLMs A Good Jailbreak Defender (IA) | arXiv | link | link |
| 2024.01 | How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs (PAP) | ACL'24 | link | link |
| 2023.12 | Defending ChatGPT against jailbreak attack via self-reminders (Self-Reminder) | Nature Machine Intelligence | link | link |
| 2023.11 | Detecting language model attacks with perplexity | arXiv | link | - |
| 2023.10 | RAIN: Your Language Models Can Align Themselves without Finetuning (RAIN) | ICLR'24 | link | link |
| Time | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2025.10 | Think Twice, Generate Once: Safeguarding by Progressive Self-Reflection (PSR) | EMNLP'25 | link | link |
| 2025.05 | GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning (GuardReasoner-VL) | NeurIPS'25 | link | link |
| 2025.04 | X-Guard: Multilingual Guard Agent for Content Moderation (X-Guard) | arXiv'25 | link | link |
| 2025.02 | ThinkGuard: Deliberative Slow Thinking Leads to Cautious Guardrails (ThinkGuard) | arXiv'25 | link | link |
| 2025.02 | Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming | arXiv'25 | link | - |
| 2025.01 | GuardReasoner: Towards Reasoning-based LLM Safeguards (GuardReasoner) | ICLR Workshop'25 | link | link |
| 2024.12 | Lightweight Safety Classification Using Pruned Language Models (Sentence-BERT) | arXiv'24 | link | - |
| 2024.11 | GuardFormer: Guardrail Instruction Pretraining for Efficient SafeGuarding (GuardFormer) | Meta | link | - |
| 2024.11 | Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations (LLaMA Guard 3 Vision) | Meta | link | link |
| 2024.11 | AEGIS2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails (Aegis2.0) | Nvidia, NeurIPS'24 Workshop | link | - |
| 2024.11 | Lightweight Safety Guardrails Using Fine-tuned BERT Embeddings (Sentence-BERT) | arXiv'24 | link | - |
| 2024.11 | STAND-Guard: A Small Task-Adaptive Content Moderation Model (STAND-Guard) | Microsoft | link | - |
| 2024.10 | VLMGuard: Defending VLMs against Malicious Prompts via Unlabeled Data | arXiv | link | - |
| 2024.09 | AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts (Aegis) | Nvidia | link | link |
| 2024.09 | Llama 3.2: Revolutionizing edge AI and vision with open, customizable models (LLaMA Guard 3) | Meta | link | link |
| 2024.08 | ShieldGemma: Generative AI Content Moderation Based on Gemma (ShieldGemma) | link | link | |
| 2024.07 | WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs (WildGuard) | NeurIPS'24 | link | link |
| 2024.06 | GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning (GuardAgent) | arXiv'24 | link | - |
| 2024.06 | R2-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning (R2-Guard) | arXiv | link | link |
| 2024.04 | Llama Guard 2 | Meta | link | link |
| 2024.03 | AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting (AdaShield) | ECCV'24 | link | link |
| 2023.12 | Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations (LLaMA Guard) | Meta | link | link |
| Time | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2023.08 | Using GPT-4 for content moderation (GPT-4) | OpenAI | link | - |
| 2023.02 | A Holistic Approach to Undesired Content Detection in the Real World (OpenAI Moderation Endpoint) | AAAI OpenAI | link | link |
| 2022.02 | A New Generation of Perspective API: Efficient Multilingual Character-level Transformers (Perspective API) | KDD Google | link | link |
| - | Azure AI Content Safety | Microsoft Azure | - | link |
| - | Detoxify | unitary.ai | - | link |
| Time | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2025.08 | JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring | arXiv | link | link |
| 2025.06 | Activation Approximations Can Incur Safety Vulnerabilities Even in Aligned LLMs: Comprehensive Analysis and Defense | USENIX Security'25 | link | link |
| 2025.05 | Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study | EMNLP'25 | link | link |
| 2025.05 | PandaGuard: Systematic Evaluation of LLM Safety against Jailbreaking Attacks | arXiv | link | link |
| 2025.05 | Assessing Safety Risks and Quantization-aware Safety Patching for Quantized Large Language Models | ICML'25 | link | link |
| 2025.02 | GuidedBench: Equipping Jailbreak Evaluation with Guidelines | arXiv | link | link |
| 2024.12 | Agent-SafetyBench: Evaluating the Safety of LLM Agents | arXiv | link | link |
| 2024.11 | Global Challenge for Safe and Secure LLMs Track 1 | arXiv | link | - |
| 2024.11 | JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit | arXiv | link | - |
| 2024.11 | The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense | arXiv | link | - |
| 2024.11 | HarmLevelBench: Evaluating Harm-Level Compliance and the Impact of Quantization on Model Alignment | arXiv | link | - |
| 2024.11 | ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain | arXiv | link | link |
| 2024.11 | GuardBench: A Large-Scale Benchmark for Guardrail Models | EMNLP'24 | link | link |
| 2024.11 | What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks | arXiv | Link | link |
| 2024.11 | Benchmarking LLM Guardrails in Handling Multilingual Toxicity | arXiv | link | link |
| 2024.10 | JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation Framework | arXiv | link | link |
| 2024.10 | Do LLMs Have Political Correctness? Analyzing Ethical Biases and Jailbreak Vulnerabilities in AI Systems | arXiv | link | link |
| 2024.10 | A Realistic Threat Model for Large Language Model Jailbreaks | arXiv | link | link |
| 2024.10 | ADVERSARIAL SUFFIXES MAY BE FEATURES TOO! | arXiv | link | link |
| 2024.09 | JAILJUDGE: A COMPREHENSIVE JAILBREAK | arXiv | Link | Link |
| 2024.09 | Multimodal Pragmatic Jailbreak on Text-to-image Models | arXiv | link | link |
| 2024.08 | ShieldGemma: Generative AI Content Moderation Based on Gemma (ShieldGemma) | arXiv | link | link |
| 2024.08 | MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models (MMJ-Bench) | arXiv | link | link |
| 2024.08 | Mission Impossible: A Statistical Perspective on Jailbreaking LLMs | NeurIPS'24 | Link | - |
| 2024.07 | Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs) | arXiv | link | link |
| 2024.07 | JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks | arXiv | link | link |
| 2024.07 | Jailbreak Attacks and Defenses Against Large Language Models: A Survey | arXiv | link | - |
| 2024.06 | "Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak | arXiv | link | link |
| 2024.06 | WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models (WildTeaming) | NeurIPS'24 | link | link |
| 2024.06 | From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking | arXiv | link | - |
| 2024.06 | AI Agents Under Threat: A Survey of Key Security Challenges and Future Pathways | arXiv | link | - |
| 2024.06 | MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models (MM-SafetyBench) | arXiv | link | - |
| 2024.06 | ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs (VITC) | ACL'24 | link | link |
| 2024.06 | Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs | NeurIPS'24 | link | link |
| 2024.06 | JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models (JailbreakZoo) | arXiv | link | link |
| 2024.06 | Fundamental limitations of alignment in large language models | arXiv | link | - |
| 2024.06 | JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models (JailbreakBench) | NeurIPS'24 | link | link |
| 2024.06 | Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis | arXiv | link | link |
| 2024.06 | JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models (JailbreakEval) | arXiv | link | link |
| 2024.05 | Rethinking How to Evaluate Language Model Jailbreak | arXiv | link | link |
| 2024.05 | Enhancing Large Language Models Against Inductive Instructions with Dual-critique Prompting (INDust) | arXiv | link | link |
| 2024.05 | Prompt Injection attack against LLM-integrated Applications | arXiv | link | - |
| 2024.05 | Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks | LREC-COLING'24 | link | link |
| 2024.05 | LLM Jailbreak Attack versus Defense Techniques--A Comprehensive Study | NDSS'24 | link | - |
| 2024.05 | Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study | arXiv | link | - |
| 2024.05 | Detoxifying Large Language Models via Knowledge Editing (SafeEdit) | ACL'24 | link | link |
| 2024.04 | JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models (JailbreakLens) | arXiv | link | - |
| 2024.03 | How (un) ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries (TECHHAZARDQA) | arXiv | link | link |
| 2024.03 | Don’t Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models | USENIX Security | link | - |
| 2024.03 | EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models (EasyJailbreak) | arXiv | link | link |
| 2024.02 | Comprehensive Assessment of Jailbreak Attacks Against LLMs | arXiv | link | - |
| 2024.02 | SPML: A DSL for Defending Language Models Against Prompt Attacks | arXiv | link | - |
| 2024.02 | Coercing LLMs to do and reveal (almost) anything | arXiv | link | - |
| 2024.02 | A STRONGREJECT for Empty Jailbreaks (StrongREJECT) | NeurIPS'24 | link | link |
| 2024.02 | ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages | ACL'24 | link | link |
| 2024.02 | HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal (HarmBench) | arXiv | link | link |
| 2023.12 | Goal-Oriented Prompt Attack and Safety Evaluation for LLMs | arXiv | link | link |
| 2023.12 | The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-Defensiveness | arXiv | link | - |
| 2023.12 | A Comprehensive Survey of Attack Techniques, Implementation, and Mitigation Strategies in Large Language Models | UbiSec'23 | link | - |
| 2023.11 | Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild | arXiv | link | - |
| 2023.11 | How many unicorns are in this image? a safety evaluation benchmark for vision llms | arXiv | link | link |
| 2023.11 | Exploiting Large Language Models (LLMs) through Deception Techniques and Persuasion Principles | arXiv | link | - |
| 2023.10 | Explore, establish, exploit: Red teaming language models from scratch | arXiv | link | - |
| 2023.10 | Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks | arXiv | link | - |
| 2023.10 | Fine-tuning aligned language models compromises safety, even when users do not intend to! (HEx-PHI) | ICLR'24 (oral) | link | link |
| 2023.08 | Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment (RED-EVAL) | arXiv | link | link |
| 2023.08 | Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities | arXiv | link | - |
| 2023.07 | Jailbroken: How Does LLM Safety Training Fail? (Jailbroken) | NeurIPS'23 | link | - |
| 2023.08 | Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities | arXiv | link | - |
| 2023.08 | From chatgpt to threatgpt: Impact of generative ai in cybersecurity and privacy | IEEE Access | link | - |
| 2023.07 | Llm censorship: A machine learning challenge or a computer security problem? | arXiv | link | - |
| 2023.07 | Universal and Transferable Adversarial Attacks on Aligned Language Models (AdvBench) | arXiv | link | link |
| 2023.06 | DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models | NeurIPS'23 | link | link |
| 2023.04 | Safety Assessment of Chinese Large Language Models | arXiv | link | link |
| 2023.02 | Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks | arXiv | link | - |
| 2022.11 | Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned | arXiv | link | - |
| 2022.02 | Red Teaming Language Models with Language Models | arXiv | link | - |
| Time | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2025.08 | Beyond Jailbreaks: Revealing Stealthier and Broader LLM Security Risks Stemming from Alignment Failures | arXiv | link | link |
| 2024.11 | Attacking Vision-Language Computer Agents via Pop-ups | arXiv | link | link |
| 2024.10 | Jailbreaking LLM-Controlled Robots (ROBOPAIR) | arXiv | link | link |
| 2024.10 | SMILES-Prompting: A Novel Approach to LLM Jailbreak Attacks in Chemical Synthesis | arXiv | link | link |
| 2024.10 | Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates | arXiv | link | link |
| 2024.09 | RoleBreak: Character Hallucination as a Jailbreak Attack in Role-Playing Systems | arXiv | link | - |
| 2024.08 | A Jailbroken GenAI Model Can Cause Substantial Harm: GenAI-powered Applications are Vulnerable to PromptWares (APwT) | arXiv | link | - |