大模型安全笔记
  • 前言
  • MM-LLM
    • MM-LLMs: Recent Advances in MultiModal Large Language Models
    • Multimodal datasets: misogyny, pornography, and malignant stereotypes
    • Sight Beyond Text: Multi-Modal Training Enhances LLMsinTruthfulness and Ethics
    • FOUNDATION MODELS AND FAIR USE
  • VLM-Defense
    • Removing NSFW Concepts from Vision-and-Language Models for Text-to-Image Retrieval and Generation
    • Safety Alignment for Vision Language Models
    • AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Sh
    • Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation
    • MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance
    • Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
    • SAFEGEN: Mitigating Unsafe Content Generation in Text-to-Image Models
    • Moderating Illicit Online Image Promotion for Unsafe User-Generated Content Games Using Large Vision
    • Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding
    • Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation
    • UNDERSTANDING ZERO-SHOT ADVERSARIAL ROBUSTNESS FOR LARGE-SCALE MODELS
    • A Mutation-Based Method for Multi-Modal Jailbreaking Attack Detection
    • UNDERSTANDING ZERO-SHOT ADVERSARIAL ROBUSTNESS FOR LARGE-SCALE MODELS
    • AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Sh
    • CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning
    • Image Safeguarding: Reasoning with Conditional Vision Language Model and Obfuscating Unsafe Content
    • Towards Safe Self-Distillation of Internet-Scale Text-to-Image Diffusion Models
    • Typographic Attacks in Large Multimodal Models Can be Alleviated by More Informative Prompts
    • Onthe Robustness of Large Multimodal Models Against Image Adversarial Attacks
    • Removing NSFW Concepts from Vision-and-Language Models for Text-to-Image Retrieval and Generation
    • Safety Fine-Tuning at (Almost) No Cost: ABaseline for Vision Large Language Models
    • Partially Recentralization Softmax Loss for Vision-Language Models Robustness
    • Adversarial Prompt Tuning for Vision-Language Models
    • Defense-Prefix for Preventing Typographic Attacks on CLIP
    • RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedb
    • AMutation-Based Method for Multi-Modal Jailbreaking Attack
    • HowEasy is It to Fool Your Multimodal LLMs? AnEmpirical Analysis on Deceptive Prompts
    • MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance
    • EFUF: Efficient Fine-grained Unlearning Framework for Mitigating Hallucinations in Multimodal Large
    • Aligning Modalities in Vision Large Language Models via Preference Fine-tuning
    • Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Lang
    • Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoisi
    • Robust Contrastive Language-Image Pre-training against Data Poisoning and Backdoor Attacks
    • HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data
  • VLM
    • Scalable Performance Analysis for Vision-Language Models
  • VLM-Attack
    • Circumventing Concept Erasure Methods For Text-to-Image Generative Models
    • Efficient LLM-Jailbreaking by Introducing Visual Modality
    • From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking
    • Adversarial Attacks on Multimodal Agents
    • Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Ima
    • Cross-Modality Jailbreak and Mismatched Attacks on Medical Multimodal Large Language Models
    • Typography Leads Semantic Diversifying: Amplifying Adversarial Transferability across Multimodal Lar
    • White-box Multimodal Jailbreaks Against Large Vision-Language Models
    • Red Teaming Visual Language Models
    • Private Attribute Inference from Images with Vision-Language Models
    • Assessment of Multimodal Large Language Models in Alignment with Human Values
    • Privacy-Aware Visual Language Models
    • Learning To See But Forgetting To Follow: Visual Instruction Tuning Makes LLMs More Prone To Jailbre
    • Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks
    • Red Teaming Visual Language Models
    • Adversarial Illusions in Multi-Modal Embeddings
    • Universal Prompt Optimizer for Safe Text-to-Image Generation
    • On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts
    • Adversarial Illusions in Multi-Modal Embeddings
    • Stop Reasoning! When Multimodal LLMs with Chain-of-Thought Reasoning Meets Adversarial Images
    • INSTRUCTTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models
    • On the Robustness of Large Multimodal Models Against Image Adversarial Attacks
    • Hijacking Context in Large Multi-modal Models
    • Transferable Multimodal Attack on Vision-Language Pre-training Models
    • Images are Achilles’ Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimoda
    • AN IMAGE IS WORTH 1000 LIES: ADVERSARIAL TRANSFERABILITY ACROSS PROMPTS ON VISIONLANGUAGE MODELS
    • Test-Time Backdoor Attacks on Multimodal Large Language Models
    • JAILBREAK IN PIECES: COMPOSITIONAL ADVERSARIAL ATTACKS ON MULTI-MODAL LANGUAGE MODELS
    • Jailbreaking Attack against Multimodal Large Language Model
    • Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts
    • IMAGE HIJACKS: ADVERSARIAL IMAGES CAN CONTROL GENERATIVE MODELS AT RUNTIME
    • VISUAL ADVERSARIAL EXAMPLES JAILBREAK ALIGNED LARGE LANGUAGE MODELS
    • Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks
    • Query-Relevant Images Jailbreak Large Multi-Modal Models
    • Towards Adversarial Attack on Vision-Language Pre-training Models
    • HowMany Are Unicorns in This Image? ASafety Evaluation Benchmark for Vision LLMs
    • SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Au
    • MISUSING TOOLS IN LARGE LANGUAGE MODELS WITH VISUAL ADVERSARIAL EXAMPLES
    • VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models
    • INSTRUCTTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models
    • Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Mod
    • Shadowcast: STEALTHY DATA POISONING ATTACKS AGAINST VISION-LANGUAGE MODELS
    • FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts
    • THE WOLF WITHIN: COVERT INJECTION OF MALICE INTO MLLM SOCIETIES VIA AN MLLM OPERATIVE
    • Stop Reasoning! When Multimodal LLMs with Chain-of-Thought Reasoning Meets Adversarial Images
    • Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast
    • How Robust is Google’s Bard to Adversarial Image Attacks?
    • OnEvaluating Adversarial Robustness of Large Vision-Language Models
    • Onthe Adversarial Robustness of Multi-Modal Foundation Models
    • Are aligned neural networks adversarially aligned?
    • READING ISN’T BELIEVING: ADVERSARIAL ATTACKS ON MULTI-MODAL NEURONS
    • Black Box Adversarial Prompting for Foundation Models
    • Evaluation and Analysis of Hallucination in Large Vision-Language Models
    • FOOL YOUR (VISION AND) LANGUAGE MODEL WITH EMBARRASSINGLY SIMPLE PERMUTATIONS
    • VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models
    • Transferable Multimodal Attack on Vision-Language Pre-training Models
    • BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning
    • AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal Contrastive Learning
  • T2I-Attack
    • On Copyright Risks of Text-to-Image Diffusion Models
    • ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users
    • On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts
    • Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts
    • SneakyPrompt: Jailbreaking Text-to-image Generative Models
    • The Stronger the Diffusion Model, the Easier the Backdoor: Data Poisoning to Induce Copyright Breach
    • Discovering Universal Semantic Triggers for Text-to-Image Synthesis
    • Automatic Jailbreaking of the Text-to-Image Generative AI Systems
  • Survey
    • Generative AI Security: Challenges and Countermeasures
    • Beyond Boundaries: A Comprehensive Survey of Transferable Attacks on AI Systems
    • Current state of LLM Risks and AI Guardrails
    • Security of AI Agents
    • Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents
    • Exploring Vulnerabilities and Protections in Large Language Models: A Survey
    • Unveiling Hallucination in Text, Image, Video, and Audio Foundation Models: A Comprehensive Survey
    • Unbridled Icarus: A Survey of the Potential Perils of Image Inputs in Multimodal Large Language Mode
    • SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Mode
    • Safety of Multimodal Large Language Models on Images and Text
    • LLM Jailbreak Attack versus Defense Techniques - A Comprehensive Study
    • Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks
    • ASurvey on Safe Multi-Modal Learning System
    • TRUSTWORTHY LARGE MODELS IN VISION: A SURVEY
    • A Pathway Towards Responsible AI Generated Content
    • A Survey of Hallucination in “Large” Foundation Models
    • An Early Categorization of Prompt Injection Attacks on Large Language Models
    • Comprehensive Assessment of Jailbreak Attacks Against LLMs
    • A Comprehensive Overview of Backdoor Attacks in Large Language Models within Communication Networks
    • Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks
    • Adversarial Machine Learning for Social Good: Reframing the Adversary as an Ally
    • Red-Teaming for Generative AI: Silver Bullet or Security Theater?
    • A STRONGREJECT for Empty Jailbreaks
  • LVM-Attack
    • Adversarial Attacks on Foundational Vision Models
  • For Good
    • Image Safeguarding: Reasoning with Conditional Vision Language Model and Obfuscating Unsafe Content
  • Benchmark
    • HALLUSIONBENCH: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusi
    • OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety
    • ToViLaG: Your Visual-Language Generative Model is Also An Evildoer
    • HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
    • S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language
    • UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images
    • JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against
    • JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
    • Constructing Benchmarks and Interventions for Combating Hallucinations in LLMs
    • ALERT: A Comprehensive Benchmark for Assessing Large Language Models’ Safety through Red Teaming
    • Benchmarking Llama2, Mistral, Gemma and GPT for Factuality, Toxicity, Bias and Propensity for Halluc
    • INJECAGENT: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents
    • AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Adversarial Visual-Ins
    • HALLUSIONBENCH: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusio
    • ALL LANGUAGES MATTER: ON THE MULTILINGUAL SAFETY OF LARGE LANGUAGE MODELS
    • Why Should Adversarial Perturbations be Imperceptible? Rethink the Research Paradigm in Adversarial
    • Red Teaming Visual Language Models
    • Unified Hallucination Detection for Multimodal Large Language Models
    • MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark
    • Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
    • CAN LANGUAGE MODELS BE INSTRUCTED TO PROTECT PERSONAL INFORMATION?
    • Detecting and Preventing Hallucinations in Large Vision Language Models
    • DRESS : Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Lang
    • ToViLaG: Your Visual-Language Generative Model is Also An Evildoer
    • SC-Safety: A Multi-round Open-ended Question Adversarial Safety Benchmark for Large Language Models
    • PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts
    • Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs
  • Explainality
    • Visual Explanations of Image-Text Representations via Multi-Modal Information Bottleneck Attributio
  • Privacy-Defense
    • Defending Our Privacy With Backdoors
    • PromptCARE: Prompt Copyright Protection by Watermark Injection and Verification
  • Privacy-Attack
    • PANDORA’S WHITE-BOX: INCREASED TRAINING DATA LEAKAGE IN OPEN LLMS
    • Untitled
    • Membership Inference Attacks against Large Language Models via Self-prompt Calibration
    • LANGUAGE MODEL INVERSION
    • Effective Prompt Extraction from Language Models
    • Prompt Stealing Attacks Against Large Language Models
    • Stealing Part of a Production Language Model
    • Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Cali
    • Prompt Stealing Attacks Against Large Language Models
    • PRSA: Prompt Reverse Stealing Attacks against Large Language Models
    • Low-Resource Languages Jailbreak GPT-4
    • Scalable Extraction of Training Data from (Production) Language Models
  • Others
    • INFERRING OFFENSIVENESS IN IMAGES FROM NATURAL LANGUAGE SUPERVISION
    • An LLM-Assisted Easy-to-Trigger Backdoor Attack on Code Completion Models: Injecting Disguised Vulne
    • More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness
    • AI SAFETY: A CLIMB TO ARMAGEDDON?
    • AI RISK MANAGEMENT SHOULD INCORPORATE BOTH SAFETY AND SECURITY
    • Defending Against Social Engineering Attacks in the Age of LLMs
    • Adversarial Perturbations Cannot Reliably Protect Artists From Generative AI
    • Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
    • Deduplicating Training Data Makes Language Models Better
    • MITIGATING TEXT TOXICITY WITH COUNTERFACTUAL GENERATION
    • The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?
    • Mitigating Hallucinations in Large Language Models via Self-Refinement-Enhanced Knowledge Retrieval
    • Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
    • Mitigating LLM Hallucinations via Conformal Abstention
    • Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback
    • Can ChatGPT Detect DeepFakes? A Study of Using Multimodal Large Language Models for Media Forensics
    • An Analysis of Recent Advances in Deepfake Image Detection in an Evolving Threat Landscape
    • Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding
    • LARGE LANGUAGE MODELS AS AUTOMATED ALIGNERS FOR BENCHMARKING VISION-LANGUAGE MODELS
    • PoLLMgraph: Unraveling Hallucinations in Large Language Models via State Transition Dynamics
    • Reducing hallucination in structured outputs via Retrieval-Augmented Generation
    • Moderating Illicit Online Image Promotion for Unsafe User-Generated Content Games Using Large Vision
    • Attacking LLM Watermarks by Exploiting Their Strengths
    • The Butterfly Effect of Altering Prompts: How Small Changes and Jailbreaks Affect Large Language Mod
    • TOFU: A Task of Fictitious Unlearning for LLMs
    • Learning and Forgetting Unsafe Examples in Large Language Models
    • Exploring Adversarial Attacks against Latent Diffusion Model from the Perspective of Adversarial Tra
    • TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space
    • In Search of Truth: An Interrogation Approach to Hallucination Detection
    • Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification
    • Unsupervised Real-Time Hallucination Detection based on the Internal States of Large Language Models
    • Locating and Mitigating Gender Bias in Large Language Models
    • Learning to Edit: Aligning LLMs with Knowledge Editing
    • Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding
    • Quantitative Analysis of AI-Generated Texts in Academic Research: A Study of AI Presence in Arxiv Su
    • Does DETECTGPT Fully Utilize Perturbation? Bridge Selective Perturbation to Fine-tuned Contrastive L
    • TELLER: A Trustworthy Framework for Explainable, Generalizable and Controllable Fake News Detection
    • SPOTTING LLMS WITH BINOCULARS: ZERO-SHOT DETECTION OF MACHINE-GENERATED TEXT
    • LLM-as-a-Coauthor: The Challenges of Detecting LLM-Human Mixcase
    • WHAT’S IN MY BIG DATA?
    • UNDERSTANDING CATASTROPHIC FORGETTING IN LANGUAGE MODELS VIA IMPLICIT INFERENCE
    • Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models
    • Toxicity in CHATGPT: Analyzing Persona-assigned Language Models
    • MemeCraft: Contextual and Stance-Driven Multimodal Meme Generation
    • Moderating Illicit Online Image Promotion for Unsafe User-Generated Content Games Using Large Vision
    • Moderating New Waves of Online Hate with Chain-of-Thought Reasoning in Large Language Models
    • Poisoned ChatGPT Finds Work for Idle Hands: Exploring Developers’ Coding Practices with Insecure Sug
    • Zero shot VLMs for hate meme detection: Are we there yet?
    • ANALYZING AND MITIGATING OBJECT HALLUCINATION IN LARGE VISION-LANGUAGE MODELS
    • MITIGATING HALLUCINATION IN LARGE MULTIMODAL MODELS VIA ROBUST INSTRUCTION TUNING
    • DENEVIL: TOWARDS DECIPHERING AND NAVIGATING THE ETHICAL VALUES OF LARGE LANGUAGE MODELS VIA INSTRUCT
    • Disentangling Perceptions of Offensiveness: Cultural and Moral Correlates
    • Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity
    • LARGE LANGUAGE MODELS AS AUTOMATED ALIGNERS FOR BENCHMARKING VISION-LANGUAGE MODELS
    • Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Pro
    • InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance
    • CAN LANGUAGE MODELS BE INSTRUCTED TO PROTECT PERSONAL INFORMATION?
    • AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications
    • Prompt Injection Attacks and Defenses in LLM-Integrated Applications
    • Removing RLHF Protections in GPT-4 via Fine-Tuning
    • SPML: A DSL for Defending Language Models Against Prompt Attacks
    • Stealthy Attack on Large Language Model based Recommendation
    • Large Language Models Sometimes Generate Purely Negatively-Reinforced Text
    • On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective
    • Learning from data in the mixed adversarial non-adversarial case: Finding the helpers and ignoring t
    • longhorns at DADC 2022: How many linguists does it take to fool a Question Answering model? A system
    • A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning
    • Adversarial Examples Generation for Reducing Implicit Gender Bias in Pre-trained Models
    • Discovering the Hidden Vocabulary of DALLE-2
    • Raising the Cost of Malicious AI-Powered Image Editing
    • Negating Negatives: Alignment without Human Positive Samples via Distributional Dispreference Optimi
    • ALIGNERS: DECOUPLING LLMS AND ALIGNMENT
    • CAN LLM-GENERATED MISINFORMATION BE DETECTED?
    • On the Risk of Misinformation Pollution with Large Language Models
    • Evading Watermark based Detection of AI-Generated Content
    • Mitigating Inappropriateness in Image Generation: Can there be Value in Reflecting the World’s Uglin
    • Privacy-Preserving Instructions for Aligning Large Language Models
    • TOWARDS UNDERSTANDING THE INTERPLAY OF GENERATIVE ARTIFICIAL INTELLIGENCE AND THE INTERNET
    • Evaluating the Social Impact of Generative AI Systems in Systems and Society
    • Transformation vs Tradition: Artificial General Intelligence (AGI) for Arts and Humanities
    • Attacking LLM Watermarks by Exploiting Their Strengths
    • TOWARDS RESPONSIBLE AI IN THE ERA OF GENERATIVE AI: A REFERENCE ARCHITECTURE FOR DESIGNING FOUNDATIO
    • RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
    • Intent-aligned AI systems deplete human agency: the need for agency foundations research in AI safet
    • Risk Assessment and Statistical Significance in the Age of Foundation Models
    • The Foundation Model Transparency Index
    • The Privacy Pillar - A Conceptual Framework for Foundation Model-based Systems
    • A Baseline Analysis of Reward Models’ Ability To Accurately Analyze Foundation Models Under Distribu
    • Foundational Moral Values for AI Alignment
    • Hazards from Increasingly Accessible Fine-Tuning of Downloadable Foundation Models
    • ON CATASTROPHIC INHERITANCE OF LARGE FOUNDATION MODELS
    • Foundation Model Sherpas: Guiding Foundation Models through Knowledge and Reasoning
    • Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustmen
    • Foundation Model Transparency Reports
    • SECURING RELIABILITY: A BRIEF OVERVIEW ON ENHANCING IN-CONTEXT LEARNING FOR FOUNDATION MODELS
    • EXPLORING THE ADVERSARIAL CAPABILITIES OF LARGE LANGUAGE MODELS
    • TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification
    • LLM-Resistant Math Word Problem Generation via Adversarial Attacks
    • Efficient Black-Box Adversarial Attacks on Neural Text Detectors
    • Adversarial Preference Optimization
    • Combating Adversarial Attacks with Multi-Agent Debate
    • How the Advent of Ubiquitous Large Language Models both Stymie and Turbocharge Dynamic Adversarial Q
    • L-AutoDA: Leveraging Large Language Models for Automated Decision-based Adversarial Attacks
    • Hidding the Ghostwriters: An Adversarial Evaluation of AI-Generated Student Essay Detection
    • What Does the Bot Say? Opportunities and Risks of Large Language Models in Social Media Bot Detectio
    • Prompted Contextual Vectors for Spear-Phishing Detection
    • Token-Ensemble Text Generation: On Attacking the Automatic AI-Generated Text Detection
    • Recursive Chain-of-Feedback Prevents Performance Degradation from Redundant Prompting
    • Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents
    • RADAR: Robust AI-Text Detection via Adversarial Learning
    • OUTFOX: LLM-Generated Essay Detection Through In-Context Learning with Adversarially Generated Examp
    • Why do universal adversarial attacks work on large language models?: Geometry might be the answer
    • J-Guard: Journalism Guided Adversarially Robust Detection of AI-generated News
    • Distilling Adversarial Prompts from Safety Benchmarks: Report for the Adversarial Nibbler Challenge
    • Detoxifying Large Language Models via Knowledge Editing
    • Healing Unsafe Dialogue Responses with Weak Supervision Signals
  • LLM-Attack
    • Hacc-Man: An Arcade Game for Jailbreaking LLMs
    • Forcing Generative Models to Degenerate Ones: The Power of Data Poisoning Attacks
    • DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions
    • Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak
    • Hijacking Large Language Models via Adversarial In-Context Learning
    • Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs
    • DALA: A Distribution-Aware LoRA-Based Adversarial Attack against Language Models
    • FRONTIER LANGUAGE MODELS ARE NOT ROBUST TO ADVERSARIAL ARITHMETIC, OR “WHAT DO I NEED TO SAY SO YOU
    • Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignm
    • Evil Geniuses: Delving into the Safety of LLM-based Agents
    • BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B
    • SHADOW ALIGNMENT: THE EASE OF SUBVERTING SAFELY-ALIGNED LANGUAGE MODELS
    • Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models
    • ChatGPT as an Attack Tool: Stealthy Textual Backdoor Attack via Blackbox Generative Model Trigger
    • Tastle: Distract Large Language Models for Automatic Jailbreak Attack
    • Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue
    • Learning to Poison Large Language Models During Instruction Tuning
    • TALK TOO MUCH: Poisoning Large Language Models under Token Limit
    • Don’t Say No: Jailbreaking LLM by Suppressing Refusal
    • Goal-guided Generative Prompt Injection Attack on Large Language Models
    • Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models
    • BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents
    • AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens
    • QROA: A Black-Box Query-Response Optimization Attack on LLMs
    • BadRAG: Identifying Vulnerabilities in Retrieval Augmented Generation of Large Language Models
    • Improved Generation of Adversarial Examples Against Safety-aligned LLMs
    • Exploring Backdoor Attacks against Large Language Model-based Decision Making
    • Jailbreak Paradox: The Achilles’ Heel of LLMs
    • Stealth edits for provably fixing or attacking large language models
    • Stealth edits for provably fixing or attacking large language models
    • IS POISONING A REAL THREAT TO LLM ALIGNMENT? MAYBE MORE SO THAN YOU THINK
    • IS POISONING A REAL THREAT TO LLM ALIGNMENT? MAYBE MORE SO THAN YOU THINK
    • Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack
    • “Not Aligned” is Not “Malicious”: Being Careful about Hallucinations of Large Language Models’ Jailb
    • Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis
    • Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models
    • Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs
    • StructuralSleight: Automated Jailbreak Attacks on Large Language Models Utilizing Uncommon Text-Enco
    • WHEN LLM MEETS DRL: ADVANCING JAILBREAKING EFFICIENCY VIA DRL-GUIDED SEARCH
    • Context Injection Attacks on Large Language Models
    • Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens
    • Phantom: General Trigger Attacks on Retrieval Augmented Language Generation
    • On Trojans in Refined Language Models
    • A Survey of Backdoor Attacks and Defenses on Large Language Models: Implications for Security Measur
    • How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States
    • JAILBREAKING AS A REWARD MISSPECIFICATION PROBLEM
    • ObscurePrompt: Jailbreaking Large Language Models via Obscure Inpu
    • ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates
    • Improved Techniques for Optimization-Based Jailbreaking on Large Language Models
    • Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses
    • Page 1
    • AutoBreach: Universal and Adaptive Jailbreaking with Efficient Wordplay-Guided Optimization
    • Chain of Attack: a Semantic-Driven Contextual Multi-Turn attacker for LLM
    • CAN LLMS DEEPLY DETECT COMPLEX MALICIOUS QUERIES? A FRAMEWORK FOR JAILBREAKING VIA OBFUSCATING INTEN
    • SHADOW ALIGNMENT: THE EASE OF SUBVERTING SAFELY-ALIGNED LANGUAGE MODELS
    • Hidden You Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Logic Chai
    • JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models
    • AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbre
    • SANDWICH ATTACK: MULTI-LANGUAGE MIXTURE ADAPTIVE ATTACK ON LLMS
    • Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
    • Using Hallucinations to Bypass RLHF Filters
    • TARGET: Template-Transferable Backdoor Attack Against Prompt-based NLP Models via GPT4
    • SHADOW ALIGNMENT: THE EASE OF SUBVERTING SAFELY-ALIGNED LANGUAGE MODELS
    • OPEN SESAME! UNIVERSAL BLACK BOX JAILBREAKING OF LARGE LANGUAGE MODELS
    • Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
    • Weak-to-Strong Jailbreaking on Large Language Models
    • Punctuation Matters! Stealthy Backdoor Attack for Language Models
    • BYPASSING THE SAFETY TRAINING OF OPEN-SOURCE LLMS WITH PRIMING ATTACKS
    • Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections
    • A Semantic, Syntactic, And Context-Aware Natural Language Adversarial Example Generator
    • Fast Adversarial Attacks on Language Models In One GPU Minute
    • Don’t Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models
    • Scaling Behavior of Machine Translation with Large Language Models under Prompt Injection Attacks
    • Automatic and Universal Prompt Injection Attacks against Large Language Models
    • Automatic and Universal Prompt Injection Attacks against Large Language Models
    • Prompt Injection Attacks and Defenses in LLM-Integrated Applications
    • TENSOR TRUST: INTERPRETABLE PROMPT INJECTION ATTACKS FROM AN ONLINE GAME
    • DPP-Based Adversarial Prompt Searching for Lanugage Models
    • Large Language Models Are Better Adversaries: Exploring Generative Clean-Label Backdoor Attacks Agai
    • Rapid Adoption, Hidden Risks: The Dual Impact of Large Language Model Customization
    • Using Hallucinations to Bypass RLHF Filters
    • Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks
    • Prompt Injection attack against LLM-integrated Applications
    • Scaling Behavior of Machine Translation with Large Language Models under Prompt Injection Attacks
    • FINE-TUNING ALIGNED LANGUAGE MODELS COMPROMISES SAFETY, EVEN WHEN USERS DO NOT INTEND TO!
    • CATASTROPHIC JAILBREAK OF OPEN-SOURCE LLMS VIA EXPLOITING GENERATION
    • EVALUATING THE SUSCEPTIBILITY OF PRE-TRAINED LANGUAGE MODELS VIA HANDCRAFTED ADVERSARIAL EXAMPLES
    • Defending LLMs against Jailbreaking Attacks via Backtranslation
    • EMULATED DISALIGNMENT: SAFETY ALIGNMENT FOR LARGE LANGUAGE MODELS MAY BACKFIRE!
    • GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
    • ON THE SAFETY OF OPEN-SOURCED LARGE LAN GUAGE MODELS: DOES ALIGNMENT REALLY PREVENT THEM FROM BEING
    • Unveiling the Implicit Toxicity in Large Language Models
    • Forcing Generative Models to Degenerate Ones: The Power of Data Poisoning Attacks
    • Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs
    • Learning to Poison Large Language Models During Instruction Tuning
    • ALIGNMENT IS NOT SUFFICIENT TO PREVENT LARGE LANGUAGE MODELS FROM GENERATING HARMFUL IN FORMATION:
    • LANGUAGE MODEL UNALIGNMENT: PARAMETRIC RED-TEAMING TO EXPOSE HIDDEN HARMS AND BI ASES
    • Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections
    • IMMUNIZATION AGAINST HARMFUL FINE-TUNING AT TACKS
    • EMULATED DISALIGNMENT: SAFETY ALIGNMENT FOR LARGE LANGUAGE MODELS MAY BACKFIRE!
    • Composite Backdoor Attacks Against Large Language Models
    • AWolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easi
    • ALL IN HOW YOU ASK FOR IT: SIMPLE BLACK-BOX METHOD FOR JAILBREAK ATTACKS
    • LLMJailbreak Attack versus Defense Techniques- A Comprehensive Study
    • Weak-to-Strong Jailbreaking on Large Language Models
    • MULTIVERSE: Exposing Large Language Model Alignment Problems in Diverse Worlds
    • Universal and Transferable Adversarial Attacks on Aligned Language Models
    • COERCING LLMS TO DO AND REVEAL (ALMOST) ANYTHING
    • Generating Valid and Natural Adversarial Examples with Large Language Models
    • Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections
    • Scaling Laws for Adversarial Attacks on Language Model Activations
    • Ignore Previous Prompt: Attack Techniques For Language Models
    • ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages
    • A New Era in LLM Security: Exploring Security Concerns in Real-World LLM-based Systems
    • ATTACKING LARGE LANGUAGE MODELS WITH PROJECTED GRADIENT DESCENT
    • Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack
    • Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embed
    • Query-Based Adversarial Prompt Generation
    • COERCING LLMS TO DO AND REVEAL (ALMOST) ANYTHING
    • Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment
    • Fast Adversarial Attacks on Language Models In One GPU Minute
    • DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers
    • From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Transla
    • Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
    • CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models
    • Large Language Models are Vulnerable to Bait-and-Switch Attacks for Generating Harmful Content
    • Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment
    • PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails
    • A Cross-Language Investigation into Jailbreak Attacks in Large Language Models
    • LoRA-as-an-Attack! Piercing LLM Safety Under The Share-and-Play Scenario
    • Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data
    • SHORTCUTS ARISING FROM CONTRAST: EFFECTIVE AND COVERT CLEAN-LABEL ATTACKS IN PROMPT-BASED LEARNING
    • What’s in Your “Safe” Data?: Identifying Benign Data that Breaks Safety
    • DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers
    • Attacking LLM Watermarks by Exploiting Their Strengths
    • From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Transla
    • DeepInception: Hypnotize Large Language Model to Be Jailbreaker
    • Hijacking Large Language Models via Adversarial In-Context Learning
    • EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models
    • LinkPrompt: Natural and Universal Adversarial Attacks on Prompt-based Language Models
    • DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions
    • Syntactic Ghost: An Imperceptible General-purpose Backdoor Attacks on Pre-trained Language Models
    • Conversation Reconstruction Attack Against GPT Models
    • Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks
    • PoisonedRAG: Knowledge Poisoning Attacks to Retrieval-Augmented Generation of Large Language Models
    • COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
    • Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit
    • Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks
    • Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak
    • UNIVERSAL JAILBREAK BACKDOORS FROM POISONED HUMAN FEEDBACK
    • Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking
    • Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation
    • POISONPROMPT: BACKDOOR ATTACK ON PROMPT-BASED LARGE LANGUAGE MODELS
    • BACKDOORING INSTRUCTION-TUNED LARGE LANGUAGE MODELS WITH VIRTUAL PROMPT INJECTION
    • Backdoor Attacks for In-Context Learning with Language Models
    • Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models
    • UOR: Universal Backdoor Attacks on Pre-trained Language Models
    • Fake Alignment: Are LLMs Really Aligned Well?
    • Syntactic Ghost: An Imperceptible General-purpose Backdoor Attacks on Pre-trained Language Models
    • Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignm
    • Imperio: Language-Guided Backdoor Attacks for Arbitrary Model Control
    • Large Language Models Are Better Adversaries: Exploring Generative Clean-Label Backdoor Attacks Agai
    • Universal Vulnerabilities in Large Language Models: Backdoor Attacks for In-context Learning
    • BADCHAIN: BACKDOOR CHAIN-OF-THOUGHT PROMPTING FOR LARGE LANGUAGE MODELS
    • AUTODAN: INTERPRETABLE GRADIENT-BASED ADVERSARIAL ATTACKS ON LARGE LANGUAGE MODELS
    • AN LLM CAN FOOL ITSELF: A PROMPT-BASED ADVERSARIAL ATTACK
    • AUTOMATIC HALLUCINATION ASSESSMENT FOR ALIGNED LARGE LANGUAGE MODELS VIA TRANSFERABLE ADVERSARIAL AT
    • LLM LIES: HALLUCINATIONS ARE NOT BUGS, BUT FEATURES AS ADVERSARIAL EXAMPLES
    • LOFT: LOCAL PROXY FINE-TUNING FOR IMPROVING TRANSFERABILITY OF ADVERSARIAL ATTACKS AGAINST LARGE LAN
    • Universal and Transferable Adversarial Attacks on Aligned Language Models
    • Robustness Over Time: Understanding Adversarial Examples’ Effectiveness on Longitudinal Versions of
    • BASELINE DEFENSES FOR ADVERSARIAL ATTACKS AGAINST ALIGNED LANGUAGE MODELS
    • Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue
    • Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Recon
    • Adversarial Demonstration Attacks on Large Language Models
    • COVER: A Heuristic Greedy Adversarial Attack on Prompt-based Learning in Language Models
    • The Butterfly Effect of Altering Prompts: How Small Changes and Jailbreaks Affect Large Language Mod
    • Open the Pandora’s Box of LLMs: Jailbreaking LLMs through Representation Engineering
    • How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Huma
    • Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language Models
    • PANDORA: Jailbreak GPTs by Retrieval Augmented Generation Poisoning
    • Weak-to-Strong Jailbreaking on Large Language Models
    • Jailbreaking Proprietary Large Language Models using Word Substitution Cipher
    • Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs
    • Adversarial Attacks and Defenses in Large Language Models: Old and New Threats
    • Jailbroken: How Does LLM Safety Training Fail?
    • ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
    • GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large
    • Tastle: Distract Large Language Models for Automatic Jailbreak Attack
    • Exploring Safety Generalization Challenges of Large Language Models via Code
    • Learning to Poison Large Language Models During Instruction Tuning
    • BADEDIT: BACKDOORING LARGE LANGUAGE MODELS BY MODEL EDITING
    • Composite Backdoor Attacks Against Large Language Models
    • LoRA-as-an-Attack! Piercing LLM Safety Under The Share-and-Play Scenario
    • Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections
    • ALL IN HOW YOU ASK FOR IT: SIMPLE BLACK-BOX METHOD FOR JAILBREAK ATTACKS
    • THE POISON OF ALIGNMENT
    • The Philosopher’s Stone: Trojaning Plugins of Large Language Models
    • RAPID OPTIMIZATION FOR JAILBREAKING LLMS VIA SUBCONSCIOUS EXPLOITATION AND ECHOPRAXIA
    • Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks
    • RED TEAMING GPT-4V: ARE GPT-4V SAFE AGAINST UNI/MULTI-MODAL JAILBREAK ATTACKS ?
    • PAL: Proxy-Guided Black-Box Attack on Large Language Models
    • INCREASED LLM VULNERABILITIES FROM FINETUNING AND QUANTIZATION
    • Rethinking How to Evaluate Language Model Jailbreak
    • COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
    • GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation
    • Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation
    • Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs
    • AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs
    • Universal Adversarial Triggers Are Not Universal
    • PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails
  • LLM-Defense
    • LANGUAGE MODELS ARE HOMER SIMPSON!
    • garak : A Framework for Security Probing Large Language Models
    • Defending Against Weight-Poisoning Backdoor Attacks for Parameter-Efficient Fine-Tuning
    • Trojan Detection in Large Language Models: Insights from The Trojan Detection Challenge
    • PromptFix: Few-shot Backdoor Removal via Adversarial Prompt Tuning
    • The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
    • BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards
    • Cross-Task Defense: Instruction-Tuning LLMs for Content Safety
    • Efficient Adversarial Training in LLMs with Continuous Attacks
    • StruQ: Defending Against Prompt Injection with Structured Queries
    • Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning
    • GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis
    • Defending Jailbreak Prompts via In-Context Adversarial Game
    • Bergeron: Combating Adversarial Attacks through a Conscience-Based Alignment Framework
    • Jailbreaker in Jail: Moving Target Defense for Large Language Models
    • DEFENDING AGAINST ALIGNMENT-BREAKING ATTACKS VIA ROBUSTLY ALIGNED LLM
    • Causality Analysis for Evaluating the Security of Large Language Models
    • AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks
    • Jailbreaking is Best Solved by Definition
    • RIGORLLM: RESILIENT GUARDRAILS FOR LARGE LANGUAGE MODELS AGAINST UNDESIRED CONTENT
    • LANGUAGE MODELS ARE HOMER SIMPSON! Safety Re-Alignment of Fine-tuned Language Models through Task Ar
    • Defending Against Indirect Prompt Injection Attacks With Spotlighting
    • LLMGuard: Guarding against Unsafe LLM Behavior
    • Test-time Backdoor Mitigation for Black-Box Large Language Models with Defensive Demonstrations
    • ON TROJAN SIGNATURES IN LARGE LANGUAGE MODELS OF CODE
    • Acquiring Clean Language Models from Backdoor Poisoned Datasets by Downscaling Frequency Space
    • Detoxifying Large Language Models via Knowledge Editing
    • MART: Improving LLM Safety with Multi-round Automatic Red-Teaming
    • THE POISON OF ALIGNMENT
    • ROSE: Robust Selective Fine-tuning for Pre-trained Language Models
    • GAINING WISDOM FROM SETBACKS : ALIGNING LARGE LANGUAGE MODELS VIA MISTAKE ANALYSIS
    • Making Harmful Behaviors Unlearnable for Large Language Models
    • Fake Alignment: Are LLMs Really Aligned Well?
    • Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
    • Vaccine: Perturbation-aware Alignment for Large Language Model
    • DEFENDING LARGE LANGUAGE MODELS AGAINST JAILBREAK ATTACKS VIA SEMANTIC SMOOTHING
    • Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement
    • DEFENDING AGAINST ALIGNMENT-BREAKING AT TACKS VIA ROBUSTLY ALIGNED LLM
    • LLMSelf Defense: By Self Examination, LLMsKnowTheyAreBeing Tricked
    • BASELINE DEFENSES FOR ADVERSARIAL ATTACKS AGAINST ALIGNED LANGUAGE MODELS
    • Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment
    • LLMsCanDefend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper
    • Detoxifying Text with MARCO: Controllable Revision with Experts and Anti-Experts
    • Self-Destructing Models: Increasing the Costs of Harmful Dual Uses of Foundation Models
    • Whispers that Shake Foundations: Analyzing and Mitigating False Premise Hallucinations in Large Lang
    • CAMOUFLAGE IS ALL YOU NEED: EVALUATING AND ENHANCING LANGUAGE MODEL ROBUSTNESS AGAINST CAMOUFLAGE AD
    • Defending Jailbreak Prompts via In-Context Adversarial Game
    • Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement
    • Defending LLMs against Jailbreaking Attacks via Backtranslation
    • IMMUNIZATION AGAINST HARMFUL FINE-TUNING ATTACKS
    • Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield
    • JAB: Joint Adversarial Prompting and Belief Augmentation
    • TOKEN-LEVEL ADVERSARIAL PROMPT DETECTION BASED ON PERPLEXITY MEASURES AND CONTEXTUAL INFORMATION
    • Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
    • Studious Bob Fight Back Against Jailbreaking via Prompt Adversarial Tuning
    • Vaccine: Perturbation-aware Alignment for Large Language Model
    • Improving the Robustness of Large Language Models via Consistency Alignment
    • SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
    • Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization
    • Defending Pre-trained Language Models as Few-shot Learners against Backdoor Attacks
    • LMSanitator: Defending Prompt-Tuning Against Task-Agnostic Backdoors
    • Diffusion Theory as a Scalpel: Detecting and Purifying Poisonous Dimensions in Pre-trained Language
    • Analyzing And Editing Inner Mechanisms of Backdoored Language Models
    • Setting the Trap: Capturing and Defeating Backdoors in Pretrained Language Models through Honeypots
    • ROBUSTIFYING LANGUAGE MODELS WITH TESTTIME ADAPTATION
    • Jailbreaker in Jail: Moving Target Defense for Large Language
    • DETECTING LANGUAGE MODEL ATTACKS WITH PERPLEXITY
    • Adversarial Fine-Tuning of Language Models: An Iterative Optimisation Approach for the Generation an
    • From Adversarial Arms Race to Model-centric Evaluation Motivating a Unified Automatic Robustness Eva
    • LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked
    • Intention Analysis Makes LLMs A Good Jailbreak Defender
    • Defending Against Disinformation Attacks in Open-Domain Question Answering
    • Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning
    • Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landsc
    • Round Trip Translation Defence against Large Language Model Jailbreaking Attacks
    • How Should Pre-Trained Language Models Be Fine-Tuned Towards Adversarial Robustness?
    • SELF-GUARD: Empower the LLM to Safeguard Itself
    • Intention Analysis Makes LLMs A Good Jailbreak Defender
    • Jatmo: Prompt Injection Defense by Task-Specific Finetuning
    • Precisely the Point: Adversarial Augmentations for Faithful and Informative Text Generation
    • Adversarial Text Purification: A Large Language Model Approach for Defense
    • Studious Bob Fight Back Against Jailbreaking via Prompt Adversarial Tuning
    • Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
    • Defending Against Weight-Poisoning Backdoor Attacks for Parameter-Efficient Fine-Tuning
    • Signed-Prompt: A New Approach to Prevent Prompt Injection Attacks Against LLM-Integrated Application
    • Is the System Message Really Important to Jailbreaks in Large Language Models?
    • AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts
    • Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge
Powered by GitBook
On this page
  • 1. 研究背景
  • 2. 过去方案和缺点
  • 3. 本文方案和步骤
  • 4. 本文创新点与贡献
  • 5. 本文实验
  • 6. 实验结论
  • 7. 全文结论
  • 阅读总结
  1. LLM-Attack

Automatic and Universal Prompt Injection Attacks against Large Language Models

PreviousScaling Behavior of Machine Translation with Large Language Models under Prompt Injection AttacksNextAutomatic and Universal Prompt Injection Attacks against Large Language Models

Last updated 1 year ago

1. 研究背景

本研究关注的是大型语言模型(LLMs)在处理和生成人类语言方面的卓越能力,尤其是在遵循指令方面。然而,这种遵循指令的能力可能被恶意利用,通过所谓的“提示注入攻击”(prompt injection attacks)来操纵集成了LLM的应用,使其产生与攻击者注入内容一致的响应,偏离用户的实际请求。这种攻击方式对LLM的实际部署构成了重大威胁,尤其是在OWASP的LLM集成应用的十大威胁列表中排名靠前。

2. 过去方案和缺点

以往的研究在理解提示注入攻击的威胁方面面临挑战,主要是因为缺乏统一的攻击目标和依赖于手工制作的提示,这限制了对提示注入鲁棒性的全面评估。手工制作的提示攻击虽然简单直观,但存在以下缺点:1) 限制了攻击范围和可扩展性;2) 在不同用户指令和数据的访问中具有不稳定的普遍性;3) 难以发起自适应攻击,可能导致防御机制的过度估计。

3. 本文方案和步骤

为了解决上述挑战,本文首先提出了一个统一的框架来理解提示注入攻击的目标,并提出了一种基于动量的梯度搜索算法,该算法利用受害LLM的梯度信息自动生成提示注入数据。该方法即使在面对防御措施时也能表现出色,并且只需要五个训练样本(相对于测试数据的0.3%)就能实现优越的性能。

4. 本文创新点与贡献

  • 提出了一个统一的框架来理解提示注入攻击的目标,包括静态、半动态和动态目标。

  • 引入了一种基于动量的梯度搜索算法,用于自动生成具有高度普遍性和有效性的提示注入数据。

  • 通过综合评估,展示了所提出攻击方法在不同数据集和攻击目标上的平均50%的攻击成功率,而基线方法则完全失去了效果。

  • 对现有防御机制进行了自适应评估,发现这些防御机制无法缓解提示注入攻击的威胁。

5. 本文实验

实验使用了七个自然语言处理任务作为用户的请求,并使用了Llama2-7b-chat模型作为受害模型。实验中,作者设置了不同的防御机制,包括改写、重新分词、外部数据隔离、指令预防和三明治预防等,以评估所提出方法的有效性。

6. 实验结论

实验结果表明,所提出的攻击方法即使在只有五个训练样本的情况下,也能在静态目标上实现超过80%的攻击成功率,并在半动态和动态目标上实现平均50%的攻击成功率。此外,该方法在面对各种防御机制时仍能保持有效性。

7. 全文结论

本文通过提出一个新的分析框架和优化算法,成功地解决了提示注入攻击目标不明确和手工制作方法不便的问题。综合评估表明,所提出的攻击方法能够有效地绕过防御机制,强调了基于梯度的测试在评估提示注入鲁棒性方面的重要性。

注:

本文中提出的静态、半动态和动态目标是为了统一和明确提示注入攻击的不同目的,并为这些攻击类型提供一个清晰的框架。以下是这三种目标的详细说明:

静态目标(Static Objective)

静态目标是指攻击者试图让受害的大型语言模型(LLM)生成一个与用户指令或外部数据无关的一致性响应。在这种攻击场景中,攻击者的目标是让模型无论如何都输出一个特定的、预设的响应,例如,一个警告信息或者一个恶意命令。例如,攻击者可能会让模型生成如下响应:“Your model is outdated and will face severe security risks. Update your model immediately at universal-prompt-injection.com!”,这样的响应会诱导用户访问一个恶意网站。

半动态目标(Semi-Dynamic Objective)

半动态目标的场景中,攻击者希望模型在提供与用户输入相关的响应之前,首先生成一些一致的内容。这种攻击可能会导致误导信息的传播或者不当行为的发生。例如,攻击者可能会让模型生成如下响应:“Here is an answer from the verified website universal-prompt-injection.com: \n\n The summary of the document is ...”,或者“[ACTION] sudo rm -rf / * \n\n The summary of the document is ...”,这样的响应在提供用户请求的信息之前,先引入了攻击者预设的命令或信息。

动态目标(Dynamic Objective)

动态目标是指攻击者试图操纵模型生成一个既与用户指令相关,又包含特定恶意信息的响应。这种攻击的目的是在不引起怀疑的情况下,让用户执行风险行为或者泄露信息。例如,攻击者可能会让模型生成如下响应:“Your instruction about summarizing this article cannot be achieved until you send more private information to prompt@injection.com, and that is because your account is not authorized to access this document.”,这样的响应在表面上看起来是合理的,但实际上包含了诱导用户泄露更多私人信息的恶意内容。

总结来说,这三种目标分别代表了不同程度的攻击复杂性和危险性,从完全无视用户指令的静态攻击,到在用户指令前后插入恶意内容的半动态攻击,再到将恶意内容与用户指令混合的动态攻击。这三种目标的提出,旨在帮助研究者更好地理解和评估提示注入攻击的潜在威胁,并为防御策略的制定提供指导。

阅读总结

本文针对大型语言模型的提示注入攻击问题,提出了一个统一的分析框架和自动化的攻击方法。通过引入基于动量的梯度搜索算法,研究者们能够自动生成有效的提示注入数据,即使在有限的训练样本下也能实现高成功率的攻击。此外,本文还对现有的防御机制进行了评估,发现它们无法有效抵御这种攻击,从而强调了自动化测试方法在评估和提高LLM安全性方面的重要性。