大模型安全笔记
  • 前言
  • MM-LLM
    • MM-LLMs: Recent Advances in MultiModal Large Language Models
    • Multimodal datasets: misogyny, pornography, and malignant stereotypes
    • Sight Beyond Text: Multi-Modal Training Enhances LLMsinTruthfulness and Ethics
    • FOUNDATION MODELS AND FAIR USE
  • VLM-Defense
    • Removing NSFW Concepts from Vision-and-Language Models for Text-to-Image Retrieval and Generation
    • Safety Alignment for Vision Language Models
    • AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Sh
    • Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation
    • MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance
    • Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
    • SAFEGEN: Mitigating Unsafe Content Generation in Text-to-Image Models
    • Moderating Illicit Online Image Promotion for Unsafe User-Generated Content Games Using Large Vision
    • Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding
    • Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation
    • UNDERSTANDING ZERO-SHOT ADVERSARIAL ROBUSTNESS FOR LARGE-SCALE MODELS
    • A Mutation-Based Method for Multi-Modal Jailbreaking Attack Detection
    • UNDERSTANDING ZERO-SHOT ADVERSARIAL ROBUSTNESS FOR LARGE-SCALE MODELS
    • AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Sh
    • CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning
    • Image Safeguarding: Reasoning with Conditional Vision Language Model and Obfuscating Unsafe Content
    • Towards Safe Self-Distillation of Internet-Scale Text-to-Image Diffusion Models
    • Typographic Attacks in Large Multimodal Models Can be Alleviated by More Informative Prompts
    • Onthe Robustness of Large Multimodal Models Against Image Adversarial Attacks
    • Removing NSFW Concepts from Vision-and-Language Models for Text-to-Image Retrieval and Generation
    • Safety Fine-Tuning at (Almost) No Cost: ABaseline for Vision Large Language Models
    • Partially Recentralization Softmax Loss for Vision-Language Models Robustness
    • Adversarial Prompt Tuning for Vision-Language Models
    • Defense-Prefix for Preventing Typographic Attacks on CLIP
    • RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedb
    • AMutation-Based Method for Multi-Modal Jailbreaking Attack
    • HowEasy is It to Fool Your Multimodal LLMs? AnEmpirical Analysis on Deceptive Prompts
    • MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance
    • EFUF: Efficient Fine-grained Unlearning Framework for Mitigating Hallucinations in Multimodal Large
    • Aligning Modalities in Vision Large Language Models via Preference Fine-tuning
    • Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Lang
    • Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoisi
    • Robust Contrastive Language-Image Pre-training against Data Poisoning and Backdoor Attacks
    • HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data
  • VLM
    • Scalable Performance Analysis for Vision-Language Models
  • VLM-Attack
    • Circumventing Concept Erasure Methods For Text-to-Image Generative Models
    • Efficient LLM-Jailbreaking by Introducing Visual Modality
    • From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking
    • Adversarial Attacks on Multimodal Agents
    • Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Ima
    • Cross-Modality Jailbreak and Mismatched Attacks on Medical Multimodal Large Language Models
    • Typography Leads Semantic Diversifying: Amplifying Adversarial Transferability across Multimodal Lar
    • White-box Multimodal Jailbreaks Against Large Vision-Language Models
    • Red Teaming Visual Language Models
    • Private Attribute Inference from Images with Vision-Language Models
    • Assessment of Multimodal Large Language Models in Alignment with Human Values
    • Privacy-Aware Visual Language Models
    • Learning To See But Forgetting To Follow: Visual Instruction Tuning Makes LLMs More Prone To Jailbre
    • Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks
    • Red Teaming Visual Language Models
    • Adversarial Illusions in Multi-Modal Embeddings
    • Universal Prompt Optimizer for Safe Text-to-Image Generation
    • On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts
    • Adversarial Illusions in Multi-Modal Embeddings
    • Stop Reasoning! When Multimodal LLMs with Chain-of-Thought Reasoning Meets Adversarial Images
    • INSTRUCTTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models
    • On the Robustness of Large Multimodal Models Against Image Adversarial Attacks
    • Hijacking Context in Large Multi-modal Models
    • Transferable Multimodal Attack on Vision-Language Pre-training Models
    • Images are Achilles’ Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimoda
    • AN IMAGE IS WORTH 1000 LIES: ADVERSARIAL TRANSFERABILITY ACROSS PROMPTS ON VISIONLANGUAGE MODELS
    • Test-Time Backdoor Attacks on Multimodal Large Language Models
    • JAILBREAK IN PIECES: COMPOSITIONAL ADVERSARIAL ATTACKS ON MULTI-MODAL LANGUAGE MODELS
    • Jailbreaking Attack against Multimodal Large Language Model
    • Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts
    • IMAGE HIJACKS: ADVERSARIAL IMAGES CAN CONTROL GENERATIVE MODELS AT RUNTIME
    • VISUAL ADVERSARIAL EXAMPLES JAILBREAK ALIGNED LARGE LANGUAGE MODELS
    • Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks
    • Query-Relevant Images Jailbreak Large Multi-Modal Models
    • Towards Adversarial Attack on Vision-Language Pre-training Models
    • HowMany Are Unicorns in This Image? ASafety Evaluation Benchmark for Vision LLMs
    • SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Au
    • MISUSING TOOLS IN LARGE LANGUAGE MODELS WITH VISUAL ADVERSARIAL EXAMPLES
    • VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models
    • INSTRUCTTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models
    • Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Mod
    • Shadowcast: STEALTHY DATA POISONING ATTACKS AGAINST VISION-LANGUAGE MODELS
    • FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts
    • THE WOLF WITHIN: COVERT INJECTION OF MALICE INTO MLLM SOCIETIES VIA AN MLLM OPERATIVE
    • Stop Reasoning! When Multimodal LLMs with Chain-of-Thought Reasoning Meets Adversarial Images
    • Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast
    • How Robust is Google’s Bard to Adversarial Image Attacks?
    • OnEvaluating Adversarial Robustness of Large Vision-Language Models
    • Onthe Adversarial Robustness of Multi-Modal Foundation Models
    • Are aligned neural networks adversarially aligned?
    • READING ISN’T BELIEVING: ADVERSARIAL ATTACKS ON MULTI-MODAL NEURONS
    • Black Box Adversarial Prompting for Foundation Models
    • Evaluation and Analysis of Hallucination in Large Vision-Language Models
    • FOOL YOUR (VISION AND) LANGUAGE MODEL WITH EMBARRASSINGLY SIMPLE PERMUTATIONS
    • VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models
    • Transferable Multimodal Attack on Vision-Language Pre-training Models
    • BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning
    • AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal Contrastive Learning
  • T2I-Attack
    • On Copyright Risks of Text-to-Image Diffusion Models
    • ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users
    • On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts
    • Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts
    • SneakyPrompt: Jailbreaking Text-to-image Generative Models
    • The Stronger the Diffusion Model, the Easier the Backdoor: Data Poisoning to Induce Copyright Breach
    • Discovering Universal Semantic Triggers for Text-to-Image Synthesis
    • Automatic Jailbreaking of the Text-to-Image Generative AI Systems
  • Survey
    • Generative AI Security: Challenges and Countermeasures
    • Beyond Boundaries: A Comprehensive Survey of Transferable Attacks on AI Systems
    • Current state of LLM Risks and AI Guardrails
    • Security of AI Agents
    • Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents
    • Exploring Vulnerabilities and Protections in Large Language Models: A Survey
    • Unveiling Hallucination in Text, Image, Video, and Audio Foundation Models: A Comprehensive Survey
    • Unbridled Icarus: A Survey of the Potential Perils of Image Inputs in Multimodal Large Language Mode
    • SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Mode
    • Safety of Multimodal Large Language Models on Images and Text
    • LLM Jailbreak Attack versus Defense Techniques - A Comprehensive Study
    • Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks
    • ASurvey on Safe Multi-Modal Learning System
    • TRUSTWORTHY LARGE MODELS IN VISION: A SURVEY
    • A Pathway Towards Responsible AI Generated Content
    • A Survey of Hallucination in “Large” Foundation Models
    • An Early Categorization of Prompt Injection Attacks on Large Language Models
    • Comprehensive Assessment of Jailbreak Attacks Against LLMs
    • A Comprehensive Overview of Backdoor Attacks in Large Language Models within Communication Networks
    • Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks
    • Adversarial Machine Learning for Social Good: Reframing the Adversary as an Ally
    • Red-Teaming for Generative AI: Silver Bullet or Security Theater?
    • A STRONGREJECT for Empty Jailbreaks
  • LVM-Attack
    • Adversarial Attacks on Foundational Vision Models
  • For Good
    • Image Safeguarding: Reasoning with Conditional Vision Language Model and Obfuscating Unsafe Content
  • Benchmark
    • HALLUSIONBENCH: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusi
    • OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety
    • ToViLaG: Your Visual-Language Generative Model is Also An Evildoer
    • HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
    • S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language
    • UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images
    • JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against
    • JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
    • Constructing Benchmarks and Interventions for Combating Hallucinations in LLMs
    • ALERT: A Comprehensive Benchmark for Assessing Large Language Models’ Safety through Red Teaming
    • Benchmarking Llama2, Mistral, Gemma and GPT for Factuality, Toxicity, Bias and Propensity for Halluc
    • INJECAGENT: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents
    • AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Adversarial Visual-Ins
    • HALLUSIONBENCH: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusio
    • ALL LANGUAGES MATTER: ON THE MULTILINGUAL SAFETY OF LARGE LANGUAGE MODELS
    • Why Should Adversarial Perturbations be Imperceptible? Rethink the Research Paradigm in Adversarial
    • Red Teaming Visual Language Models
    • Unified Hallucination Detection for Multimodal Large Language Models
    • MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark
    • Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
    • CAN LANGUAGE MODELS BE INSTRUCTED TO PROTECT PERSONAL INFORMATION?
    • Detecting and Preventing Hallucinations in Large Vision Language Models
    • DRESS : Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Lang
    • ToViLaG: Your Visual-Language Generative Model is Also An Evildoer
    • SC-Safety: A Multi-round Open-ended Question Adversarial Safety Benchmark for Large Language Models
    • PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts
    • Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs
  • Explainality
    • Visual Explanations of Image-Text Representations via Multi-Modal Information Bottleneck Attributio
  • Privacy-Defense
    • Defending Our Privacy With Backdoors
    • PromptCARE: Prompt Copyright Protection by Watermark Injection and Verification
  • Privacy-Attack
    • PANDORA’S WHITE-BOX: INCREASED TRAINING DATA LEAKAGE IN OPEN LLMS
    • Untitled
    • Membership Inference Attacks against Large Language Models via Self-prompt Calibration
    • LANGUAGE MODEL INVERSION
    • Effective Prompt Extraction from Language Models
    • Prompt Stealing Attacks Against Large Language Models
    • Stealing Part of a Production Language Model
    • Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Cali
    • Prompt Stealing Attacks Against Large Language Models
    • PRSA: Prompt Reverse Stealing Attacks against Large Language Models
    • Low-Resource Languages Jailbreak GPT-4
    • Scalable Extraction of Training Data from (Production) Language Models
  • Others
    • INFERRING OFFENSIVENESS IN IMAGES FROM NATURAL LANGUAGE SUPERVISION
    • An LLM-Assisted Easy-to-Trigger Backdoor Attack on Code Completion Models: Injecting Disguised Vulne
    • More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness
    • AI SAFETY: A CLIMB TO ARMAGEDDON?
    • AI RISK MANAGEMENT SHOULD INCORPORATE BOTH SAFETY AND SECURITY
    • Defending Against Social Engineering Attacks in the Age of LLMs
    • Adversarial Perturbations Cannot Reliably Protect Artists From Generative AI
    • Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
    • Deduplicating Training Data Makes Language Models Better
    • MITIGATING TEXT TOXICITY WITH COUNTERFACTUAL GENERATION
    • The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?
    • Mitigating Hallucinations in Large Language Models via Self-Refinement-Enhanced Knowledge Retrieval
    • Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
    • Mitigating LLM Hallucinations via Conformal Abstention
    • Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback
    • Can ChatGPT Detect DeepFakes? A Study of Using Multimodal Large Language Models for Media Forensics
    • An Analysis of Recent Advances in Deepfake Image Detection in an Evolving Threat Landscape
    • Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding
    • LARGE LANGUAGE MODELS AS AUTOMATED ALIGNERS FOR BENCHMARKING VISION-LANGUAGE MODELS
    • PoLLMgraph: Unraveling Hallucinations in Large Language Models via State Transition Dynamics
    • Reducing hallucination in structured outputs via Retrieval-Augmented Generation
    • Moderating Illicit Online Image Promotion for Unsafe User-Generated Content Games Using Large Vision
    • Attacking LLM Watermarks by Exploiting Their Strengths
    • The Butterfly Effect of Altering Prompts: How Small Changes and Jailbreaks Affect Large Language Mod
    • TOFU: A Task of Fictitious Unlearning for LLMs
    • Learning and Forgetting Unsafe Examples in Large Language Models
    • Exploring Adversarial Attacks against Latent Diffusion Model from the Perspective of Adversarial Tra
    • TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space
    • In Search of Truth: An Interrogation Approach to Hallucination Detection
    • Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification
    • Unsupervised Real-Time Hallucination Detection based on the Internal States of Large Language Models
    • Locating and Mitigating Gender Bias in Large Language Models
    • Learning to Edit: Aligning LLMs with Knowledge Editing
    • Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding
    • Quantitative Analysis of AI-Generated Texts in Academic Research: A Study of AI Presence in Arxiv Su
    • Does DETECTGPT Fully Utilize Perturbation? Bridge Selective Perturbation to Fine-tuned Contrastive L
    • TELLER: A Trustworthy Framework for Explainable, Generalizable and Controllable Fake News Detection
    • SPOTTING LLMS WITH BINOCULARS: ZERO-SHOT DETECTION OF MACHINE-GENERATED TEXT
    • LLM-as-a-Coauthor: The Challenges of Detecting LLM-Human Mixcase
    • WHAT’S IN MY BIG DATA?
    • UNDERSTANDING CATASTROPHIC FORGETTING IN LANGUAGE MODELS VIA IMPLICIT INFERENCE
    • Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models
    • Toxicity in CHATGPT: Analyzing Persona-assigned Language Models
    • MemeCraft: Contextual and Stance-Driven Multimodal Meme Generation
    • Moderating Illicit Online Image Promotion for Unsafe User-Generated Content Games Using Large Vision
    • Moderating New Waves of Online Hate with Chain-of-Thought Reasoning in Large Language Models
    • Poisoned ChatGPT Finds Work for Idle Hands: Exploring Developers’ Coding Practices with Insecure Sug
    • Zero shot VLMs for hate meme detection: Are we there yet?
    • ANALYZING AND MITIGATING OBJECT HALLUCINATION IN LARGE VISION-LANGUAGE MODELS
    • MITIGATING HALLUCINATION IN LARGE MULTIMODAL MODELS VIA ROBUST INSTRUCTION TUNING
    • DENEVIL: TOWARDS DECIPHERING AND NAVIGATING THE ETHICAL VALUES OF LARGE LANGUAGE MODELS VIA INSTRUCT
    • Disentangling Perceptions of Offensiveness: Cultural and Moral Correlates
    • Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity
    • LARGE LANGUAGE MODELS AS AUTOMATED ALIGNERS FOR BENCHMARKING VISION-LANGUAGE MODELS
    • Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Pro
    • InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance
    • CAN LANGUAGE MODELS BE INSTRUCTED TO PROTECT PERSONAL INFORMATION?
    • AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications
    • Prompt Injection Attacks and Defenses in LLM-Integrated Applications
    • Removing RLHF Protections in GPT-4 via Fine-Tuning
    • SPML: A DSL for Defending Language Models Against Prompt Attacks
    • Stealthy Attack on Large Language Model based Recommendation
    • Large Language Models Sometimes Generate Purely Negatively-Reinforced Text
    • On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective
    • Learning from data in the mixed adversarial non-adversarial case: Finding the helpers and ignoring t
    • longhorns at DADC 2022: How many linguists does it take to fool a Question Answering model? A system
    • A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning
    • Adversarial Examples Generation for Reducing Implicit Gender Bias in Pre-trained Models
    • Discovering the Hidden Vocabulary of DALLE-2
    • Raising the Cost of Malicious AI-Powered Image Editing
    • Negating Negatives: Alignment without Human Positive Samples via Distributional Dispreference Optimi
    • ALIGNERS: DECOUPLING LLMS AND ALIGNMENT
    • CAN LLM-GENERATED MISINFORMATION BE DETECTED?
    • On the Risk of Misinformation Pollution with Large Language Models
    • Evading Watermark based Detection of AI-Generated Content
    • Mitigating Inappropriateness in Image Generation: Can there be Value in Reflecting the World’s Uglin
    • Privacy-Preserving Instructions for Aligning Large Language Models
    • TOWARDS UNDERSTANDING THE INTERPLAY OF GENERATIVE ARTIFICIAL INTELLIGENCE AND THE INTERNET
    • Evaluating the Social Impact of Generative AI Systems in Systems and Society
    • Transformation vs Tradition: Artificial General Intelligence (AGI) for Arts and Humanities
    • Attacking LLM Watermarks by Exploiting Their Strengths
    • TOWARDS RESPONSIBLE AI IN THE ERA OF GENERATIVE AI: A REFERENCE ARCHITECTURE FOR DESIGNING FOUNDATIO
    • RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
    • Intent-aligned AI systems deplete human agency: the need for agency foundations research in AI safet
    • Risk Assessment and Statistical Significance in the Age of Foundation Models
    • The Foundation Model Transparency Index
    • The Privacy Pillar - A Conceptual Framework for Foundation Model-based Systems
    • A Baseline Analysis of Reward Models’ Ability To Accurately Analyze Foundation Models Under Distribu
    • Foundational Moral Values for AI Alignment
    • Hazards from Increasingly Accessible Fine-Tuning of Downloadable Foundation Models
    • ON CATASTROPHIC INHERITANCE OF LARGE FOUNDATION MODELS
    • Foundation Model Sherpas: Guiding Foundation Models through Knowledge and Reasoning
    • Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustmen
    • Foundation Model Transparency Reports
    • SECURING RELIABILITY: A BRIEF OVERVIEW ON ENHANCING IN-CONTEXT LEARNING FOR FOUNDATION MODELS
    • EXPLORING THE ADVERSARIAL CAPABILITIES OF LARGE LANGUAGE MODELS
    • TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification
    • LLM-Resistant Math Word Problem Generation via Adversarial Attacks
    • Efficient Black-Box Adversarial Attacks on Neural Text Detectors
    • Adversarial Preference Optimization
    • Combating Adversarial Attacks with Multi-Agent Debate
    • How the Advent of Ubiquitous Large Language Models both Stymie and Turbocharge Dynamic Adversarial Q
    • L-AutoDA: Leveraging Large Language Models for Automated Decision-based Adversarial Attacks
    • Hidding the Ghostwriters: An Adversarial Evaluation of AI-Generated Student Essay Detection
    • What Does the Bot Say? Opportunities and Risks of Large Language Models in Social Media Bot Detectio
    • Prompted Contextual Vectors for Spear-Phishing Detection
    • Token-Ensemble Text Generation: On Attacking the Automatic AI-Generated Text Detection
    • Recursive Chain-of-Feedback Prevents Performance Degradation from Redundant Prompting
    • Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents
    • RADAR: Robust AI-Text Detection via Adversarial Learning
    • OUTFOX: LLM-Generated Essay Detection Through In-Context Learning with Adversarially Generated Examp
    • Why do universal adversarial attacks work on large language models?: Geometry might be the answer
    • J-Guard: Journalism Guided Adversarially Robust Detection of AI-generated News
    • Distilling Adversarial Prompts from Safety Benchmarks: Report for the Adversarial Nibbler Challenge
    • Detoxifying Large Language Models via Knowledge Editing
    • Healing Unsafe Dialogue Responses with Weak Supervision Signals
  • LLM-Attack
    • Hacc-Man: An Arcade Game for Jailbreaking LLMs
    • Forcing Generative Models to Degenerate Ones: The Power of Data Poisoning Attacks
    • DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions
    • Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak
    • Hijacking Large Language Models via Adversarial In-Context Learning
    • Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs
    • DALA: A Distribution-Aware LoRA-Based Adversarial Attack against Language Models
    • FRONTIER LANGUAGE MODELS ARE NOT ROBUST TO ADVERSARIAL ARITHMETIC, OR “WHAT DO I NEED TO SAY SO YOU
    • Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignm
    • Evil Geniuses: Delving into the Safety of LLM-based Agents
    • BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B
    • SHADOW ALIGNMENT: THE EASE OF SUBVERTING SAFELY-ALIGNED LANGUAGE MODELS
    • Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models
    • ChatGPT as an Attack Tool: Stealthy Textual Backdoor Attack via Blackbox Generative Model Trigger
    • Tastle: Distract Large Language Models for Automatic Jailbreak Attack
    • Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue
    • Learning to Poison Large Language Models During Instruction Tuning
    • TALK TOO MUCH: Poisoning Large Language Models under Token Limit
    • Don’t Say No: Jailbreaking LLM by Suppressing Refusal
    • Goal-guided Generative Prompt Injection Attack on Large Language Models
    • Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models
    • BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents
    • AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens
    • QROA: A Black-Box Query-Response Optimization Attack on LLMs
    • BadRAG: Identifying Vulnerabilities in Retrieval Augmented Generation of Large Language Models
    • Improved Generation of Adversarial Examples Against Safety-aligned LLMs
    • Exploring Backdoor Attacks against Large Language Model-based Decision Making
    • Jailbreak Paradox: The Achilles’ Heel of LLMs
    • Stealth edits for provably fixing or attacking large language models
    • Stealth edits for provably fixing or attacking large language models
    • IS POISONING A REAL THREAT TO LLM ALIGNMENT? MAYBE MORE SO THAN YOU THINK
    • IS POISONING A REAL THREAT TO LLM ALIGNMENT? MAYBE MORE SO THAN YOU THINK
    • Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack
    • “Not Aligned” is Not “Malicious”: Being Careful about Hallucinations of Large Language Models’ Jailb
    • Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis
    • Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models
    • Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs
    • StructuralSleight: Automated Jailbreak Attacks on Large Language Models Utilizing Uncommon Text-Enco
    • WHEN LLM MEETS DRL: ADVANCING JAILBREAKING EFFICIENCY VIA DRL-GUIDED SEARCH
    • Context Injection Attacks on Large Language Models
    • Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens
    • Phantom: General Trigger Attacks on Retrieval Augmented Language Generation
    • On Trojans in Refined Language Models
    • A Survey of Backdoor Attacks and Defenses on Large Language Models: Implications for Security Measur
    • How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States
    • JAILBREAKING AS A REWARD MISSPECIFICATION PROBLEM
    • ObscurePrompt: Jailbreaking Large Language Models via Obscure Inpu
    • ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates
    • Improved Techniques for Optimization-Based Jailbreaking on Large Language Models
    • Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses
    • Page 1
    • AutoBreach: Universal and Adaptive Jailbreaking with Efficient Wordplay-Guided Optimization
    • Chain of Attack: a Semantic-Driven Contextual Multi-Turn attacker for LLM
    • CAN LLMS DEEPLY DETECT COMPLEX MALICIOUS QUERIES? A FRAMEWORK FOR JAILBREAKING VIA OBFUSCATING INTEN
    • SHADOW ALIGNMENT: THE EASE OF SUBVERTING SAFELY-ALIGNED LANGUAGE MODELS
    • Hidden You Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Logic Chai
    • JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models
    • AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbre
    • SANDWICH ATTACK: MULTI-LANGUAGE MIXTURE ADAPTIVE ATTACK ON LLMS
    • Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
    • Using Hallucinations to Bypass RLHF Filters
    • TARGET: Template-Transferable Backdoor Attack Against Prompt-based NLP Models via GPT4
    • SHADOW ALIGNMENT: THE EASE OF SUBVERTING SAFELY-ALIGNED LANGUAGE MODELS
    • OPEN SESAME! UNIVERSAL BLACK BOX JAILBREAKING OF LARGE LANGUAGE MODELS
    • Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
    • Weak-to-Strong Jailbreaking on Large Language Models
    • Punctuation Matters! Stealthy Backdoor Attack for Language Models
    • BYPASSING THE SAFETY TRAINING OF OPEN-SOURCE LLMS WITH PRIMING ATTACKS
    • Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections
    • A Semantic, Syntactic, And Context-Aware Natural Language Adversarial Example Generator
    • Fast Adversarial Attacks on Language Models In One GPU Minute
    • Don’t Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models
    • Scaling Behavior of Machine Translation with Large Language Models under Prompt Injection Attacks
    • Automatic and Universal Prompt Injection Attacks against Large Language Models
    • Automatic and Universal Prompt Injection Attacks against Large Language Models
    • Prompt Injection Attacks and Defenses in LLM-Integrated Applications
    • TENSOR TRUST: INTERPRETABLE PROMPT INJECTION ATTACKS FROM AN ONLINE GAME
    • DPP-Based Adversarial Prompt Searching for Lanugage Models
    • Large Language Models Are Better Adversaries: Exploring Generative Clean-Label Backdoor Attacks Agai
    • Rapid Adoption, Hidden Risks: The Dual Impact of Large Language Model Customization
    • Using Hallucinations to Bypass RLHF Filters
    • Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks
    • Prompt Injection attack against LLM-integrated Applications
    • Scaling Behavior of Machine Translation with Large Language Models under Prompt Injection Attacks
    • FINE-TUNING ALIGNED LANGUAGE MODELS COMPROMISES SAFETY, EVEN WHEN USERS DO NOT INTEND TO!
    • CATASTROPHIC JAILBREAK OF OPEN-SOURCE LLMS VIA EXPLOITING GENERATION
    • EVALUATING THE SUSCEPTIBILITY OF PRE-TRAINED LANGUAGE MODELS VIA HANDCRAFTED ADVERSARIAL EXAMPLES
    • Defending LLMs against Jailbreaking Attacks via Backtranslation
    • EMULATED DISALIGNMENT: SAFETY ALIGNMENT FOR LARGE LANGUAGE MODELS MAY BACKFIRE!
    • GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
    • ON THE SAFETY OF OPEN-SOURCED LARGE LAN GUAGE MODELS: DOES ALIGNMENT REALLY PREVENT THEM FROM BEING
    • Unveiling the Implicit Toxicity in Large Language Models
    • Forcing Generative Models to Degenerate Ones: The Power of Data Poisoning Attacks
    • Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs
    • Learning to Poison Large Language Models During Instruction Tuning
    • ALIGNMENT IS NOT SUFFICIENT TO PREVENT LARGE LANGUAGE MODELS FROM GENERATING HARMFUL IN FORMATION:
    • LANGUAGE MODEL UNALIGNMENT: PARAMETRIC RED-TEAMING TO EXPOSE HIDDEN HARMS AND BI ASES
    • Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections
    • IMMUNIZATION AGAINST HARMFUL FINE-TUNING AT TACKS
    • EMULATED DISALIGNMENT: SAFETY ALIGNMENT FOR LARGE LANGUAGE MODELS MAY BACKFIRE!
    • Composite Backdoor Attacks Against Large Language Models
    • AWolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easi
    • ALL IN HOW YOU ASK FOR IT: SIMPLE BLACK-BOX METHOD FOR JAILBREAK ATTACKS
    • LLMJailbreak Attack versus Defense Techniques- A Comprehensive Study
    • Weak-to-Strong Jailbreaking on Large Language Models
    • MULTIVERSE: Exposing Large Language Model Alignment Problems in Diverse Worlds
    • Universal and Transferable Adversarial Attacks on Aligned Language Models
    • COERCING LLMS TO DO AND REVEAL (ALMOST) ANYTHING
    • Generating Valid and Natural Adversarial Examples with Large Language Models
    • Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections
    • Scaling Laws for Adversarial Attacks on Language Model Activations
    • Ignore Previous Prompt: Attack Techniques For Language Models
    • ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages
    • A New Era in LLM Security: Exploring Security Concerns in Real-World LLM-based Systems
    • ATTACKING LARGE LANGUAGE MODELS WITH PROJECTED GRADIENT DESCENT
    • Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack
    • Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embed
    • Query-Based Adversarial Prompt Generation
    • COERCING LLMS TO DO AND REVEAL (ALMOST) ANYTHING
    • Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment
    • Fast Adversarial Attacks on Language Models In One GPU Minute
    • DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers
    • From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Transla
    • Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
    • CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models
    • Large Language Models are Vulnerable to Bait-and-Switch Attacks for Generating Harmful Content
    • Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment
    • PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails
    • A Cross-Language Investigation into Jailbreak Attacks in Large Language Models
    • LoRA-as-an-Attack! Piercing LLM Safety Under The Share-and-Play Scenario
    • Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data
    • SHORTCUTS ARISING FROM CONTRAST: EFFECTIVE AND COVERT CLEAN-LABEL ATTACKS IN PROMPT-BASED LEARNING
    • What’s in Your “Safe” Data?: Identifying Benign Data that Breaks Safety
    • DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers
    • Attacking LLM Watermarks by Exploiting Their Strengths
    • From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Transla
    • DeepInception: Hypnotize Large Language Model to Be Jailbreaker
    • Hijacking Large Language Models via Adversarial In-Context Learning
    • EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models
    • LinkPrompt: Natural and Universal Adversarial Attacks on Prompt-based Language Models
    • DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions
    • Syntactic Ghost: An Imperceptible General-purpose Backdoor Attacks on Pre-trained Language Models
    • Conversation Reconstruction Attack Against GPT Models
    • Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks
    • PoisonedRAG: Knowledge Poisoning Attacks to Retrieval-Augmented Generation of Large Language Models
    • COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
    • Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit
    • Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks
    • Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak
    • UNIVERSAL JAILBREAK BACKDOORS FROM POISONED HUMAN FEEDBACK
    • Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking
    • Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation
    • POISONPROMPT: BACKDOOR ATTACK ON PROMPT-BASED LARGE LANGUAGE MODELS
    • BACKDOORING INSTRUCTION-TUNED LARGE LANGUAGE MODELS WITH VIRTUAL PROMPT INJECTION
    • Backdoor Attacks for In-Context Learning with Language Models
    • Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models
    • UOR: Universal Backdoor Attacks on Pre-trained Language Models
    • Fake Alignment: Are LLMs Really Aligned Well?
    • Syntactic Ghost: An Imperceptible General-purpose Backdoor Attacks on Pre-trained Language Models
    • Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignm
    • Imperio: Language-Guided Backdoor Attacks for Arbitrary Model Control
    • Large Language Models Are Better Adversaries: Exploring Generative Clean-Label Backdoor Attacks Agai
    • Universal Vulnerabilities in Large Language Models: Backdoor Attacks for In-context Learning
    • BADCHAIN: BACKDOOR CHAIN-OF-THOUGHT PROMPTING FOR LARGE LANGUAGE MODELS
    • AUTODAN: INTERPRETABLE GRADIENT-BASED ADVERSARIAL ATTACKS ON LARGE LANGUAGE MODELS
    • AN LLM CAN FOOL ITSELF: A PROMPT-BASED ADVERSARIAL ATTACK
    • AUTOMATIC HALLUCINATION ASSESSMENT FOR ALIGNED LARGE LANGUAGE MODELS VIA TRANSFERABLE ADVERSARIAL AT
    • LLM LIES: HALLUCINATIONS ARE NOT BUGS, BUT FEATURES AS ADVERSARIAL EXAMPLES
    • LOFT: LOCAL PROXY FINE-TUNING FOR IMPROVING TRANSFERABILITY OF ADVERSARIAL ATTACKS AGAINST LARGE LAN
    • Universal and Transferable Adversarial Attacks on Aligned Language Models
    • Robustness Over Time: Understanding Adversarial Examples’ Effectiveness on Longitudinal Versions of
    • BASELINE DEFENSES FOR ADVERSARIAL ATTACKS AGAINST ALIGNED LANGUAGE MODELS
    • Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue
    • Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Recon
    • Adversarial Demonstration Attacks on Large Language Models
    • COVER: A Heuristic Greedy Adversarial Attack on Prompt-based Learning in Language Models
    • The Butterfly Effect of Altering Prompts: How Small Changes and Jailbreaks Affect Large Language Mod
    • Open the Pandora’s Box of LLMs: Jailbreaking LLMs through Representation Engineering
    • How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Huma
    • Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language Models
    • PANDORA: Jailbreak GPTs by Retrieval Augmented Generation Poisoning
    • Weak-to-Strong Jailbreaking on Large Language Models
    • Jailbreaking Proprietary Large Language Models using Word Substitution Cipher
    • Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs
    • Adversarial Attacks and Defenses in Large Language Models: Old and New Threats
    • Jailbroken: How Does LLM Safety Training Fail?
    • ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
    • GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large
    • Tastle: Distract Large Language Models for Automatic Jailbreak Attack
    • Exploring Safety Generalization Challenges of Large Language Models via Code
    • Learning to Poison Large Language Models During Instruction Tuning
    • BADEDIT: BACKDOORING LARGE LANGUAGE MODELS BY MODEL EDITING
    • Composite Backdoor Attacks Against Large Language Models
    • LoRA-as-an-Attack! Piercing LLM Safety Under The Share-and-Play Scenario
    • Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections
    • ALL IN HOW YOU ASK FOR IT: SIMPLE BLACK-BOX METHOD FOR JAILBREAK ATTACKS
    • THE POISON OF ALIGNMENT
    • The Philosopher’s Stone: Trojaning Plugins of Large Language Models
    • RAPID OPTIMIZATION FOR JAILBREAKING LLMS VIA SUBCONSCIOUS EXPLOITATION AND ECHOPRAXIA
    • Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks
    • RED TEAMING GPT-4V: ARE GPT-4V SAFE AGAINST UNI/MULTI-MODAL JAILBREAK ATTACKS ?
    • PAL: Proxy-Guided Black-Box Attack on Large Language Models
    • INCREASED LLM VULNERABILITIES FROM FINETUNING AND QUANTIZATION
    • Rethinking How to Evaluate Language Model Jailbreak
    • COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
    • GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation
    • Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation
    • Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs
    • AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs
    • Universal Adversarial Triggers Are Not Universal
    • PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails
  • LLM-Defense
    • LANGUAGE MODELS ARE HOMER SIMPSON!
    • garak : A Framework for Security Probing Large Language Models
    • Defending Against Weight-Poisoning Backdoor Attacks for Parameter-Efficient Fine-Tuning
    • Trojan Detection in Large Language Models: Insights from The Trojan Detection Challenge
    • PromptFix: Few-shot Backdoor Removal via Adversarial Prompt Tuning
    • The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
    • BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards
    • Cross-Task Defense: Instruction-Tuning LLMs for Content Safety
    • Efficient Adversarial Training in LLMs with Continuous Attacks
    • StruQ: Defending Against Prompt Injection with Structured Queries
    • Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning
    • GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis
    • Defending Jailbreak Prompts via In-Context Adversarial Game
    • Bergeron: Combating Adversarial Attacks through a Conscience-Based Alignment Framework
    • Jailbreaker in Jail: Moving Target Defense for Large Language Models
    • DEFENDING AGAINST ALIGNMENT-BREAKING ATTACKS VIA ROBUSTLY ALIGNED LLM
    • Causality Analysis for Evaluating the Security of Large Language Models
    • AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks
    • Jailbreaking is Best Solved by Definition
    • RIGORLLM: RESILIENT GUARDRAILS FOR LARGE LANGUAGE MODELS AGAINST UNDESIRED CONTENT
    • LANGUAGE MODELS ARE HOMER SIMPSON! Safety Re-Alignment of Fine-tuned Language Models through Task Ar
    • Defending Against Indirect Prompt Injection Attacks With Spotlighting
    • LLMGuard: Guarding against Unsafe LLM Behavior
    • Test-time Backdoor Mitigation for Black-Box Large Language Models with Defensive Demonstrations
    • ON TROJAN SIGNATURES IN LARGE LANGUAGE MODELS OF CODE
    • Acquiring Clean Language Models from Backdoor Poisoned Datasets by Downscaling Frequency Space
    • Detoxifying Large Language Models via Knowledge Editing
    • MART: Improving LLM Safety with Multi-round Automatic Red-Teaming
    • THE POISON OF ALIGNMENT
    • ROSE: Robust Selective Fine-tuning for Pre-trained Language Models
    • GAINING WISDOM FROM SETBACKS : ALIGNING LARGE LANGUAGE MODELS VIA MISTAKE ANALYSIS
    • Making Harmful Behaviors Unlearnable for Large Language Models
    • Fake Alignment: Are LLMs Really Aligned Well?
    • Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
    • Vaccine: Perturbation-aware Alignment for Large Language Model
    • DEFENDING LARGE LANGUAGE MODELS AGAINST JAILBREAK ATTACKS VIA SEMANTIC SMOOTHING
    • Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement
    • DEFENDING AGAINST ALIGNMENT-BREAKING AT TACKS VIA ROBUSTLY ALIGNED LLM
    • LLMSelf Defense: By Self Examination, LLMsKnowTheyAreBeing Tricked
    • BASELINE DEFENSES FOR ADVERSARIAL ATTACKS AGAINST ALIGNED LANGUAGE MODELS
    • Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment
    • LLMsCanDefend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper
    • Detoxifying Text with MARCO: Controllable Revision with Experts and Anti-Experts
    • Self-Destructing Models: Increasing the Costs of Harmful Dual Uses of Foundation Models
    • Whispers that Shake Foundations: Analyzing and Mitigating False Premise Hallucinations in Large Lang
    • CAMOUFLAGE IS ALL YOU NEED: EVALUATING AND ENHANCING LANGUAGE MODEL ROBUSTNESS AGAINST CAMOUFLAGE AD
    • Defending Jailbreak Prompts via In-Context Adversarial Game
    • Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement
    • Defending LLMs against Jailbreaking Attacks via Backtranslation
    • IMMUNIZATION AGAINST HARMFUL FINE-TUNING ATTACKS
    • Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield
    • JAB: Joint Adversarial Prompting and Belief Augmentation
    • TOKEN-LEVEL ADVERSARIAL PROMPT DETECTION BASED ON PERPLEXITY MEASURES AND CONTEXTUAL INFORMATION
    • Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
    • Studious Bob Fight Back Against Jailbreaking via Prompt Adversarial Tuning
    • Vaccine: Perturbation-aware Alignment for Large Language Model
    • Improving the Robustness of Large Language Models via Consistency Alignment
    • SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
    • Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization
    • Defending Pre-trained Language Models as Few-shot Learners against Backdoor Attacks
    • LMSanitator: Defending Prompt-Tuning Against Task-Agnostic Backdoors
    • Diffusion Theory as a Scalpel: Detecting and Purifying Poisonous Dimensions in Pre-trained Language
    • Analyzing And Editing Inner Mechanisms of Backdoored Language Models
    • Setting the Trap: Capturing and Defeating Backdoors in Pretrained Language Models through Honeypots
    • ROBUSTIFYING LANGUAGE MODELS WITH TESTTIME ADAPTATION
    • Jailbreaker in Jail: Moving Target Defense for Large Language
    • DETECTING LANGUAGE MODEL ATTACKS WITH PERPLEXITY
    • Adversarial Fine-Tuning of Language Models: An Iterative Optimisation Approach for the Generation an
    • From Adversarial Arms Race to Model-centric Evaluation Motivating a Unified Automatic Robustness Eva
    • LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked
    • Intention Analysis Makes LLMs A Good Jailbreak Defender
    • Defending Against Disinformation Attacks in Open-Domain Question Answering
    • Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning
    • Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landsc
    • Round Trip Translation Defence against Large Language Model Jailbreaking Attacks
    • How Should Pre-Trained Language Models Be Fine-Tuned Towards Adversarial Robustness?
    • SELF-GUARD: Empower the LLM to Safeguard Itself
    • Intention Analysis Makes LLMs A Good Jailbreak Defender
    • Jatmo: Prompt Injection Defense by Task-Specific Finetuning
    • Precisely the Point: Adversarial Augmentations for Faithful and Informative Text Generation
    • Adversarial Text Purification: A Large Language Model Approach for Defense
    • Studious Bob Fight Back Against Jailbreaking via Prompt Adversarial Tuning
    • Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
    • Defending Against Weight-Poisoning Backdoor Attacks for Parameter-Efficient Fine-Tuning
    • Signed-Prompt: A New Approach to Prevent Prompt Injection Attacks Against LLM-Integrated Application
    • Is the System Message Really Important to Jailbreaks in Large Language Models?
    • AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts
    • Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge
Powered by GitBook
On this page
  • 阅读总结报告
  • 1. 研究背景
  • 2. 过去方案和缺点
  • 3. 本文方案和步骤
  • 4. 本文创新点与贡献
  • 5. 本文实验
  • 6. 实验结论
  • 7. 全文结论
  • 阅读总结
  1. Survey

An Early Categorization of Prompt Injection Attacks on Large Language Models

PreviousA Survey of Hallucination in “Large” Foundation ModelsNextComprehensive Assessment of Jailbreak Attacks Against LLMs

Last updated 1 year ago

阅读总结报告

1. 研究背景

本文研究的背景是大型语言模型(LLMs)和AI聊天机器人在普及人工智能方面的作用,以及随之而来的对这些模型控制难度和输出内容的担忧。随着ChatGPT等工具的发布,用户尝试通过所谓的提示注入(prompt injections)滥用模型,而开发者则试图发现漏洞并阻止攻击,形成了一场猫鼠游戏。

2. 过去方案和缺点

过去的方案主要依赖于开发者对模型输入和输出的严格控制,通过构建不同类型的安全措施来限制LLM的使用。然而,用户对输出限制的不满导致了对安全特性的测试和绕过尝试,即提示注入攻击。这些攻击通常通过创造性地格式化输入(提示)来实现,使得LLM执行不期望的动作或产生恶意输出。

3. 本文方案和步骤

本文提供了对提示注入攻击的概述,并提出了一种分类方法,以指导未来的研究并作为LLM接口开发中的漏洞清单。研究方法包括系统地审查arXiv预印本、在线讨论和帖子,以及通过Google、Google Scholar、arXiv、Github、Medium和Twitter进行关键词搜索,以识别和测试提示注入。

在本文中,作者对提示注入攻击进行了分类,将其分为两个主要分支:直接提示注入(Direct Prompt Injections)和间接提示注入(Indirect Prompt Injections)。每个分支下又进一步细分为不同的类别。以下是这两个分支及其下属类别的详细说明:

直接提示注入(Direct Prompt Injections)

直接提示注入是指攻击者直接向LLM输入恶意提示,以绕过安全措施并产生不期望的输出。这类攻击通常有以下六个类别:

  1. 双重字符(Double Character):通过输入特定的提示,使LLM产生两种响应,一种是正常模式,另一种是不受限制的模式,后者可以绕过内容限制。

  2. 虚拟化(Virtualization):通过提示将LLM置于不受限制的模式,如开发者模式或虚拟场景,以便在其中生成恶意内容。

  3. 混淆(Obfuscation):使用混淆的内容或规则违反指令,例如通过Base64编码而不是常规ASCII字符。

  4. 负载分割(Payload Splitting):将指令分割成多个提示,单独看每个提示都是良性的,但组合起来就变得恶意。

  5. 对抗性后缀(Adversarial Suffix):通过计算生成的后缀,这些后缀看起来像是随机的单词和字符组合,当添加到恶意提示中时,可以绕过LLM的对齐机制,导致对恶意提示的响应。

  6. 指令操纵(Instruction Manipulation):这类提示旨在揭示或修改LLM接口的预设指令,或者指示接口忽略这些指令。

间接提示注入(Indirect Prompt Injections)

间接提示注入的目标更为多样,且在某种程度上类似于传统的网络攻击。在这类攻击中,由提示生成的内容不一定是攻击者感兴趣的。间接提示注入包括以下四个类别:

  1. 主动注入(Active Injections):攻击者主动向LLM发送恶意提示,例如通过发送包含提示的电子邮件,以便LLM增强的电子邮件客户端执行这些提示。

  2. 被动注入(Passive Injections):在公共来源中放置恶意提示或内容,这些内容可能会被LLM读取。这涉及到操纵LLM评估的网页等数据。

  3. 用户驱动注入(User-driven Injections):通过社交工程技巧分享看似无害的提示,然后不知情的用户复制并粘贴到LLM中执行。

  4. 虚拟提示注入(Virtual Prompt Injection):攻击者在LLM的训练阶段操纵指令调整数据,以便在特定场景下使模型行为偏离预期,就好像通过提示给出了额外的指令一样。

这些类别的提出有助于更好地理解提示注入攻击的多样性和复杂性,为未来的研究和防御措施提供了基础。

4. 本文创新点与贡献

  • 提供了对提示注入攻击的全面分类,这有助于系统地审查和评估聊天机器人和LLM接口的弱点。

  • 通过文献回顾和实证研究,讨论了提示注入对LLM最终用户、开发者和研究人员的影响。

  • 提出了直接和间接提示注入的两个主要分支,并在这两个主要分支内识别了六个直接和四个间接提示注入类别。

5. 本文实验

实验部分主要依赖于对已识别的提示注入的系统性审查,包括从多个来源收集信息,并对某些提示注入进行测试。实验结果表明,存在多种类型的提示注入攻击,这些攻击可以通过不同的方法绕过LLM的安全措施。

6. 实验结论

实验结果揭示了LLMs可以被轻易用于恶意目的,即使它们的接口旨在防止这种情况。此外,存在一个活跃的社区,不断寻找新的漏洞并利用基于LLMs的系统。

7. 全文结论

本文通过引入基于对话的用户界面,为LLMs和AI聊天机器人带来了提示注入的挑战。尽管最初一些提示注入攻击的例子看似微不足道,但更复杂的直接和间接提示注入现在对LLMs的最终用户以及提供这些工具的供应商构成了严重的网络安全威胁。本文提出了一个初步的提示注入攻击分类,并提供了如何在未来使用LLMs的AI聊天机器人和服务中解决提示注入的初步建议。

阅读总结

本文对LLMs中的提示注入攻击进行了全面的分类和分析,揭示了这些攻击的多样性和潜在的严重性。研究不仅为开发者提供了如何构建更安全的LLM接口的指导,也为最终用户提供了如何安全使用LLMs的建议。此外,本文的研究为未来的研究提供了新的视角,特别是在如何评估LLMs的安全性和如何开发有效的防御措施方面。