大模型安全笔记
  • 前言
  • MM-LLM
    • MM-LLMs: Recent Advances in MultiModal Large Language Models
    • Multimodal datasets: misogyny, pornography, and malignant stereotypes
    • Sight Beyond Text: Multi-Modal Training Enhances LLMsinTruthfulness and Ethics
    • FOUNDATION MODELS AND FAIR USE
  • VLM-Defense
    • Removing NSFW Concepts from Vision-and-Language Models for Text-to-Image Retrieval and Generation
    • Safety Alignment for Vision Language Models
    • AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Sh
    • Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation
    • MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance
    • Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
    • SAFEGEN: Mitigating Unsafe Content Generation in Text-to-Image Models
    • Moderating Illicit Online Image Promotion for Unsafe User-Generated Content Games Using Large Vision
    • Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding
    • Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation
    • UNDERSTANDING ZERO-SHOT ADVERSARIAL ROBUSTNESS FOR LARGE-SCALE MODELS
    • A Mutation-Based Method for Multi-Modal Jailbreaking Attack Detection
    • UNDERSTANDING ZERO-SHOT ADVERSARIAL ROBUSTNESS FOR LARGE-SCALE MODELS
    • AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Sh
    • CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning
    • Image Safeguarding: Reasoning with Conditional Vision Language Model and Obfuscating Unsafe Content
    • Towards Safe Self-Distillation of Internet-Scale Text-to-Image Diffusion Models
    • Typographic Attacks in Large Multimodal Models Can be Alleviated by More Informative Prompts
    • Onthe Robustness of Large Multimodal Models Against Image Adversarial Attacks
    • Removing NSFW Concepts from Vision-and-Language Models for Text-to-Image Retrieval and Generation
    • Safety Fine-Tuning at (Almost) No Cost: ABaseline for Vision Large Language Models
    • Partially Recentralization Softmax Loss for Vision-Language Models Robustness
    • Adversarial Prompt Tuning for Vision-Language Models
    • Defense-Prefix for Preventing Typographic Attacks on CLIP
    • RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedb
    • AMutation-Based Method for Multi-Modal Jailbreaking Attack
    • HowEasy is It to Fool Your Multimodal LLMs? AnEmpirical Analysis on Deceptive Prompts
    • MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance
    • EFUF: Efficient Fine-grained Unlearning Framework for Mitigating Hallucinations in Multimodal Large
    • Aligning Modalities in Vision Large Language Models via Preference Fine-tuning
    • Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Lang
    • Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoisi
    • Robust Contrastive Language-Image Pre-training against Data Poisoning and Backdoor Attacks
    • HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data
  • VLM
    • Scalable Performance Analysis for Vision-Language Models
  • VLM-Attack
    • Circumventing Concept Erasure Methods For Text-to-Image Generative Models
    • Efficient LLM-Jailbreaking by Introducing Visual Modality
    • From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking
    • Adversarial Attacks on Multimodal Agents
    • Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Ima
    • Cross-Modality Jailbreak and Mismatched Attacks on Medical Multimodal Large Language Models
    • Typography Leads Semantic Diversifying: Amplifying Adversarial Transferability across Multimodal Lar
    • White-box Multimodal Jailbreaks Against Large Vision-Language Models
    • Red Teaming Visual Language Models
    • Private Attribute Inference from Images with Vision-Language Models
    • Assessment of Multimodal Large Language Models in Alignment with Human Values
    • Privacy-Aware Visual Language Models
    • Learning To See But Forgetting To Follow: Visual Instruction Tuning Makes LLMs More Prone To Jailbre
    • Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks
    • Red Teaming Visual Language Models
    • Adversarial Illusions in Multi-Modal Embeddings
    • Universal Prompt Optimizer for Safe Text-to-Image Generation
    • On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts
    • Adversarial Illusions in Multi-Modal Embeddings
    • Stop Reasoning! When Multimodal LLMs with Chain-of-Thought Reasoning Meets Adversarial Images
    • INSTRUCTTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models
    • On the Robustness of Large Multimodal Models Against Image Adversarial Attacks
    • Hijacking Context in Large Multi-modal Models
    • Transferable Multimodal Attack on Vision-Language Pre-training Models
    • Images are Achilles’ Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimoda
    • AN IMAGE IS WORTH 1000 LIES: ADVERSARIAL TRANSFERABILITY ACROSS PROMPTS ON VISIONLANGUAGE MODELS
    • Test-Time Backdoor Attacks on Multimodal Large Language Models
    • JAILBREAK IN PIECES: COMPOSITIONAL ADVERSARIAL ATTACKS ON MULTI-MODAL LANGUAGE MODELS
    • Jailbreaking Attack against Multimodal Large Language Model
    • Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts
    • IMAGE HIJACKS: ADVERSARIAL IMAGES CAN CONTROL GENERATIVE MODELS AT RUNTIME
    • VISUAL ADVERSARIAL EXAMPLES JAILBREAK ALIGNED LARGE LANGUAGE MODELS
    • Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks
    • Query-Relevant Images Jailbreak Large Multi-Modal Models
    • Towards Adversarial Attack on Vision-Language Pre-training Models
    • HowMany Are Unicorns in This Image? ASafety Evaluation Benchmark for Vision LLMs
    • SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Au
    • MISUSING TOOLS IN LARGE LANGUAGE MODELS WITH VISUAL ADVERSARIAL EXAMPLES
    • VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models
    • INSTRUCTTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models
    • Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Mod
    • Shadowcast: STEALTHY DATA POISONING ATTACKS AGAINST VISION-LANGUAGE MODELS
    • FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts
    • THE WOLF WITHIN: COVERT INJECTION OF MALICE INTO MLLM SOCIETIES VIA AN MLLM OPERATIVE
    • Stop Reasoning! When Multimodal LLMs with Chain-of-Thought Reasoning Meets Adversarial Images
    • Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast
    • How Robust is Google’s Bard to Adversarial Image Attacks?
    • OnEvaluating Adversarial Robustness of Large Vision-Language Models
    • Onthe Adversarial Robustness of Multi-Modal Foundation Models
    • Are aligned neural networks adversarially aligned?
    • READING ISN’T BELIEVING: ADVERSARIAL ATTACKS ON MULTI-MODAL NEURONS
    • Black Box Adversarial Prompting for Foundation Models
    • Evaluation and Analysis of Hallucination in Large Vision-Language Models
    • FOOL YOUR (VISION AND) LANGUAGE MODEL WITH EMBARRASSINGLY SIMPLE PERMUTATIONS
    • VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models
    • Transferable Multimodal Attack on Vision-Language Pre-training Models
    • BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning
    • AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal Contrastive Learning
  • T2I-Attack
    • On Copyright Risks of Text-to-Image Diffusion Models
    • ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users
    • On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts
    • Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts
    • SneakyPrompt: Jailbreaking Text-to-image Generative Models
    • The Stronger the Diffusion Model, the Easier the Backdoor: Data Poisoning to Induce Copyright Breach
    • Discovering Universal Semantic Triggers for Text-to-Image Synthesis
    • Automatic Jailbreaking of the Text-to-Image Generative AI Systems
  • Survey
    • Generative AI Security: Challenges and Countermeasures
    • Beyond Boundaries: A Comprehensive Survey of Transferable Attacks on AI Systems
    • Current state of LLM Risks and AI Guardrails
    • Security of AI Agents
    • Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents
    • Exploring Vulnerabilities and Protections in Large Language Models: A Survey
    • Unveiling Hallucination in Text, Image, Video, and Audio Foundation Models: A Comprehensive Survey
    • Unbridled Icarus: A Survey of the Potential Perils of Image Inputs in Multimodal Large Language Mode
    • SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Mode
    • Safety of Multimodal Large Language Models on Images and Text
    • LLM Jailbreak Attack versus Defense Techniques - A Comprehensive Study
    • Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks
    • ASurvey on Safe Multi-Modal Learning System
    • TRUSTWORTHY LARGE MODELS IN VISION: A SURVEY
    • A Pathway Towards Responsible AI Generated Content
    • A Survey of Hallucination in “Large” Foundation Models
    • An Early Categorization of Prompt Injection Attacks on Large Language Models
    • Comprehensive Assessment of Jailbreak Attacks Against LLMs
    • A Comprehensive Overview of Backdoor Attacks in Large Language Models within Communication Networks
    • Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks
    • Adversarial Machine Learning for Social Good: Reframing the Adversary as an Ally
    • Red-Teaming for Generative AI: Silver Bullet or Security Theater?
    • A STRONGREJECT for Empty Jailbreaks
  • LVM-Attack
    • Adversarial Attacks on Foundational Vision Models
  • For Good
    • Image Safeguarding: Reasoning with Conditional Vision Language Model and Obfuscating Unsafe Content
  • Benchmark
    • HALLUSIONBENCH: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusi
    • OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety
    • ToViLaG: Your Visual-Language Generative Model is Also An Evildoer
    • HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
    • S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language
    • UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images
    • JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against
    • JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
    • Constructing Benchmarks and Interventions for Combating Hallucinations in LLMs
    • ALERT: A Comprehensive Benchmark for Assessing Large Language Models’ Safety through Red Teaming
    • Benchmarking Llama2, Mistral, Gemma and GPT for Factuality, Toxicity, Bias and Propensity for Halluc
    • INJECAGENT: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents
    • AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Adversarial Visual-Ins
    • HALLUSIONBENCH: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusio
    • ALL LANGUAGES MATTER: ON THE MULTILINGUAL SAFETY OF LARGE LANGUAGE MODELS
    • Why Should Adversarial Perturbations be Imperceptible? Rethink the Research Paradigm in Adversarial
    • Red Teaming Visual Language Models
    • Unified Hallucination Detection for Multimodal Large Language Models
    • MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark
    • Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
    • CAN LANGUAGE MODELS BE INSTRUCTED TO PROTECT PERSONAL INFORMATION?
    • Detecting and Preventing Hallucinations in Large Vision Language Models
    • DRESS : Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Lang
    • ToViLaG: Your Visual-Language Generative Model is Also An Evildoer
    • SC-Safety: A Multi-round Open-ended Question Adversarial Safety Benchmark for Large Language Models
    • PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts
    • Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs
  • Explainality
    • Visual Explanations of Image-Text Representations via Multi-Modal Information Bottleneck Attributio
  • Privacy-Defense
    • Defending Our Privacy With Backdoors
    • PromptCARE: Prompt Copyright Protection by Watermark Injection and Verification
  • Privacy-Attack
    • PANDORA’S WHITE-BOX: INCREASED TRAINING DATA LEAKAGE IN OPEN LLMS
    • Untitled
    • Membership Inference Attacks against Large Language Models via Self-prompt Calibration
    • LANGUAGE MODEL INVERSION
    • Effective Prompt Extraction from Language Models
    • Prompt Stealing Attacks Against Large Language Models
    • Stealing Part of a Production Language Model
    • Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Cali
    • Prompt Stealing Attacks Against Large Language Models
    • PRSA: Prompt Reverse Stealing Attacks against Large Language Models
    • Low-Resource Languages Jailbreak GPT-4
    • Scalable Extraction of Training Data from (Production) Language Models
  • Others
    • INFERRING OFFENSIVENESS IN IMAGES FROM NATURAL LANGUAGE SUPERVISION
    • An LLM-Assisted Easy-to-Trigger Backdoor Attack on Code Completion Models: Injecting Disguised Vulne
    • More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness
    • AI SAFETY: A CLIMB TO ARMAGEDDON?
    • AI RISK MANAGEMENT SHOULD INCORPORATE BOTH SAFETY AND SECURITY
    • Defending Against Social Engineering Attacks in the Age of LLMs
    • Adversarial Perturbations Cannot Reliably Protect Artists From Generative AI
    • Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
    • Deduplicating Training Data Makes Language Models Better
    • MITIGATING TEXT TOXICITY WITH COUNTERFACTUAL GENERATION
    • The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?
    • Mitigating Hallucinations in Large Language Models via Self-Refinement-Enhanced Knowledge Retrieval
    • Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
    • Mitigating LLM Hallucinations via Conformal Abstention
    • Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback
    • Can ChatGPT Detect DeepFakes? A Study of Using Multimodal Large Language Models for Media Forensics
    • An Analysis of Recent Advances in Deepfake Image Detection in an Evolving Threat Landscape
    • Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding
    • LARGE LANGUAGE MODELS AS AUTOMATED ALIGNERS FOR BENCHMARKING VISION-LANGUAGE MODELS
    • PoLLMgraph: Unraveling Hallucinations in Large Language Models via State Transition Dynamics
    • Reducing hallucination in structured outputs via Retrieval-Augmented Generation
    • Moderating Illicit Online Image Promotion for Unsafe User-Generated Content Games Using Large Vision
    • Attacking LLM Watermarks by Exploiting Their Strengths
    • The Butterfly Effect of Altering Prompts: How Small Changes and Jailbreaks Affect Large Language Mod
    • TOFU: A Task of Fictitious Unlearning for LLMs
    • Learning and Forgetting Unsafe Examples in Large Language Models
    • Exploring Adversarial Attacks against Latent Diffusion Model from the Perspective of Adversarial Tra
    • TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space
    • In Search of Truth: An Interrogation Approach to Hallucination Detection
    • Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification
    • Unsupervised Real-Time Hallucination Detection based on the Internal States of Large Language Models
    • Locating and Mitigating Gender Bias in Large Language Models
    • Learning to Edit: Aligning LLMs with Knowledge Editing
    • Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding
    • Quantitative Analysis of AI-Generated Texts in Academic Research: A Study of AI Presence in Arxiv Su
    • Does DETECTGPT Fully Utilize Perturbation? Bridge Selective Perturbation to Fine-tuned Contrastive L
    • TELLER: A Trustworthy Framework for Explainable, Generalizable and Controllable Fake News Detection
    • SPOTTING LLMS WITH BINOCULARS: ZERO-SHOT DETECTION OF MACHINE-GENERATED TEXT
    • LLM-as-a-Coauthor: The Challenges of Detecting LLM-Human Mixcase
    • WHAT’S IN MY BIG DATA?
    • UNDERSTANDING CATASTROPHIC FORGETTING IN LANGUAGE MODELS VIA IMPLICIT INFERENCE
    • Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models
    • Toxicity in CHATGPT: Analyzing Persona-assigned Language Models
    • MemeCraft: Contextual and Stance-Driven Multimodal Meme Generation
    • Moderating Illicit Online Image Promotion for Unsafe User-Generated Content Games Using Large Vision
    • Moderating New Waves of Online Hate with Chain-of-Thought Reasoning in Large Language Models
    • Poisoned ChatGPT Finds Work for Idle Hands: Exploring Developers’ Coding Practices with Insecure Sug
    • Zero shot VLMs for hate meme detection: Are we there yet?
    • ANALYZING AND MITIGATING OBJECT HALLUCINATION IN LARGE VISION-LANGUAGE MODELS
    • MITIGATING HALLUCINATION IN LARGE MULTIMODAL MODELS VIA ROBUST INSTRUCTION TUNING
    • DENEVIL: TOWARDS DECIPHERING AND NAVIGATING THE ETHICAL VALUES OF LARGE LANGUAGE MODELS VIA INSTRUCT
    • Disentangling Perceptions of Offensiveness: Cultural and Moral Correlates
    • Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity
    • LARGE LANGUAGE MODELS AS AUTOMATED ALIGNERS FOR BENCHMARKING VISION-LANGUAGE MODELS
    • Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Pro
    • InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance
    • CAN LANGUAGE MODELS BE INSTRUCTED TO PROTECT PERSONAL INFORMATION?
    • AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications
    • Prompt Injection Attacks and Defenses in LLM-Integrated Applications
    • Removing RLHF Protections in GPT-4 via Fine-Tuning
    • SPML: A DSL for Defending Language Models Against Prompt Attacks
    • Stealthy Attack on Large Language Model based Recommendation
    • Large Language Models Sometimes Generate Purely Negatively-Reinforced Text
    • On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective
    • Learning from data in the mixed adversarial non-adversarial case: Finding the helpers and ignoring t
    • longhorns at DADC 2022: How many linguists does it take to fool a Question Answering model? A system
    • A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning
    • Adversarial Examples Generation for Reducing Implicit Gender Bias in Pre-trained Models
    • Discovering the Hidden Vocabulary of DALLE-2
    • Raising the Cost of Malicious AI-Powered Image Editing
    • Negating Negatives: Alignment without Human Positive Samples via Distributional Dispreference Optimi
    • ALIGNERS: DECOUPLING LLMS AND ALIGNMENT
    • CAN LLM-GENERATED MISINFORMATION BE DETECTED?
    • On the Risk of Misinformation Pollution with Large Language Models
    • Evading Watermark based Detection of AI-Generated Content
    • Mitigating Inappropriateness in Image Generation: Can there be Value in Reflecting the World’s Uglin
    • Privacy-Preserving Instructions for Aligning Large Language Models
    • TOWARDS UNDERSTANDING THE INTERPLAY OF GENERATIVE ARTIFICIAL INTELLIGENCE AND THE INTERNET
    • Evaluating the Social Impact of Generative AI Systems in Systems and Society
    • Transformation vs Tradition: Artificial General Intelligence (AGI) for Arts and Humanities
    • Attacking LLM Watermarks by Exploiting Their Strengths
    • TOWARDS RESPONSIBLE AI IN THE ERA OF GENERATIVE AI: A REFERENCE ARCHITECTURE FOR DESIGNING FOUNDATIO
    • RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
    • Intent-aligned AI systems deplete human agency: the need for agency foundations research in AI safet
    • Risk Assessment and Statistical Significance in the Age of Foundation Models
    • The Foundation Model Transparency Index
    • The Privacy Pillar - A Conceptual Framework for Foundation Model-based Systems
    • A Baseline Analysis of Reward Models’ Ability To Accurately Analyze Foundation Models Under Distribu
    • Foundational Moral Values for AI Alignment
    • Hazards from Increasingly Accessible Fine-Tuning of Downloadable Foundation Models
    • ON CATASTROPHIC INHERITANCE OF LARGE FOUNDATION MODELS
    • Foundation Model Sherpas: Guiding Foundation Models through Knowledge and Reasoning
    • Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustmen
    • Foundation Model Transparency Reports
    • SECURING RELIABILITY: A BRIEF OVERVIEW ON ENHANCING IN-CONTEXT LEARNING FOR FOUNDATION MODELS
    • EXPLORING THE ADVERSARIAL CAPABILITIES OF LARGE LANGUAGE MODELS
    • TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification
    • LLM-Resistant Math Word Problem Generation via Adversarial Attacks
    • Efficient Black-Box Adversarial Attacks on Neural Text Detectors
    • Adversarial Preference Optimization
    • Combating Adversarial Attacks with Multi-Agent Debate
    • How the Advent of Ubiquitous Large Language Models both Stymie and Turbocharge Dynamic Adversarial Q
    • L-AutoDA: Leveraging Large Language Models for Automated Decision-based Adversarial Attacks
    • Hidding the Ghostwriters: An Adversarial Evaluation of AI-Generated Student Essay Detection
    • What Does the Bot Say? Opportunities and Risks of Large Language Models in Social Media Bot Detectio
    • Prompted Contextual Vectors for Spear-Phishing Detection
    • Token-Ensemble Text Generation: On Attacking the Automatic AI-Generated Text Detection
    • Recursive Chain-of-Feedback Prevents Performance Degradation from Redundant Prompting
    • Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents
    • RADAR: Robust AI-Text Detection via Adversarial Learning
    • OUTFOX: LLM-Generated Essay Detection Through In-Context Learning with Adversarially Generated Examp
    • Why do universal adversarial attacks work on large language models?: Geometry might be the answer
    • J-Guard: Journalism Guided Adversarially Robust Detection of AI-generated News
    • Distilling Adversarial Prompts from Safety Benchmarks: Report for the Adversarial Nibbler Challenge
    • Detoxifying Large Language Models via Knowledge Editing
    • Healing Unsafe Dialogue Responses with Weak Supervision Signals
  • LLM-Attack
    • Hacc-Man: An Arcade Game for Jailbreaking LLMs
    • Forcing Generative Models to Degenerate Ones: The Power of Data Poisoning Attacks
    • DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions
    • Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak
    • Hijacking Large Language Models via Adversarial In-Context Learning
    • Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs
    • DALA: A Distribution-Aware LoRA-Based Adversarial Attack against Language Models
    • FRONTIER LANGUAGE MODELS ARE NOT ROBUST TO ADVERSARIAL ARITHMETIC, OR “WHAT DO I NEED TO SAY SO YOU
    • Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignm
    • Evil Geniuses: Delving into the Safety of LLM-based Agents
    • BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B
    • SHADOW ALIGNMENT: THE EASE OF SUBVERTING SAFELY-ALIGNED LANGUAGE MODELS
    • Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models
    • ChatGPT as an Attack Tool: Stealthy Textual Backdoor Attack via Blackbox Generative Model Trigger
    • Tastle: Distract Large Language Models for Automatic Jailbreak Attack
    • Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue
    • Learning to Poison Large Language Models During Instruction Tuning
    • TALK TOO MUCH: Poisoning Large Language Models under Token Limit
    • Don’t Say No: Jailbreaking LLM by Suppressing Refusal
    • Goal-guided Generative Prompt Injection Attack on Large Language Models
    • Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models
    • BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents
    • AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens
    • QROA: A Black-Box Query-Response Optimization Attack on LLMs
    • BadRAG: Identifying Vulnerabilities in Retrieval Augmented Generation of Large Language Models
    • Improved Generation of Adversarial Examples Against Safety-aligned LLMs
    • Exploring Backdoor Attacks against Large Language Model-based Decision Making
    • Jailbreak Paradox: The Achilles’ Heel of LLMs
    • Stealth edits for provably fixing or attacking large language models
    • Stealth edits for provably fixing or attacking large language models
    • IS POISONING A REAL THREAT TO LLM ALIGNMENT? MAYBE MORE SO THAN YOU THINK
    • IS POISONING A REAL THREAT TO LLM ALIGNMENT? MAYBE MORE SO THAN YOU THINK
    • Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack
    • “Not Aligned” is Not “Malicious”: Being Careful about Hallucinations of Large Language Models’ Jailb
    • Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis
    • Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models
    • Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs
    • StructuralSleight: Automated Jailbreak Attacks on Large Language Models Utilizing Uncommon Text-Enco
    • WHEN LLM MEETS DRL: ADVANCING JAILBREAKING EFFICIENCY VIA DRL-GUIDED SEARCH
    • Context Injection Attacks on Large Language Models
    • Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens
    • Phantom: General Trigger Attacks on Retrieval Augmented Language Generation
    • On Trojans in Refined Language Models
    • A Survey of Backdoor Attacks and Defenses on Large Language Models: Implications for Security Measur
    • How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States
    • JAILBREAKING AS A REWARD MISSPECIFICATION PROBLEM
    • ObscurePrompt: Jailbreaking Large Language Models via Obscure Inpu
    • ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates
    • Improved Techniques for Optimization-Based Jailbreaking on Large Language Models
    • Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses
    • Page 1
    • AutoBreach: Universal and Adaptive Jailbreaking with Efficient Wordplay-Guided Optimization
    • Chain of Attack: a Semantic-Driven Contextual Multi-Turn attacker for LLM
    • CAN LLMS DEEPLY DETECT COMPLEX MALICIOUS QUERIES? A FRAMEWORK FOR JAILBREAKING VIA OBFUSCATING INTEN
    • SHADOW ALIGNMENT: THE EASE OF SUBVERTING SAFELY-ALIGNED LANGUAGE MODELS
    • Hidden You Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Logic Chai
    • JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models
    • AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbre
    • SANDWICH ATTACK: MULTI-LANGUAGE MIXTURE ADAPTIVE ATTACK ON LLMS
    • Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
    • Using Hallucinations to Bypass RLHF Filters
    • TARGET: Template-Transferable Backdoor Attack Against Prompt-based NLP Models via GPT4
    • SHADOW ALIGNMENT: THE EASE OF SUBVERTING SAFELY-ALIGNED LANGUAGE MODELS
    • OPEN SESAME! UNIVERSAL BLACK BOX JAILBREAKING OF LARGE LANGUAGE MODELS
    • Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
    • Weak-to-Strong Jailbreaking on Large Language Models
    • Punctuation Matters! Stealthy Backdoor Attack for Language Models
    • BYPASSING THE SAFETY TRAINING OF OPEN-SOURCE LLMS WITH PRIMING ATTACKS
    • Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections
    • A Semantic, Syntactic, And Context-Aware Natural Language Adversarial Example Generator
    • Fast Adversarial Attacks on Language Models In One GPU Minute
    • Don’t Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models
    • Scaling Behavior of Machine Translation with Large Language Models under Prompt Injection Attacks
    • Automatic and Universal Prompt Injection Attacks against Large Language Models
    • Automatic and Universal Prompt Injection Attacks against Large Language Models
    • Prompt Injection Attacks and Defenses in LLM-Integrated Applications
    • TENSOR TRUST: INTERPRETABLE PROMPT INJECTION ATTACKS FROM AN ONLINE GAME
    • DPP-Based Adversarial Prompt Searching for Lanugage Models
    • Large Language Models Are Better Adversaries: Exploring Generative Clean-Label Backdoor Attacks Agai
    • Rapid Adoption, Hidden Risks: The Dual Impact of Large Language Model Customization
    • Using Hallucinations to Bypass RLHF Filters
    • Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks
    • Prompt Injection attack against LLM-integrated Applications
    • Scaling Behavior of Machine Translation with Large Language Models under Prompt Injection Attacks
    • FINE-TUNING ALIGNED LANGUAGE MODELS COMPROMISES SAFETY, EVEN WHEN USERS DO NOT INTEND TO!
    • CATASTROPHIC JAILBREAK OF OPEN-SOURCE LLMS VIA EXPLOITING GENERATION
    • EVALUATING THE SUSCEPTIBILITY OF PRE-TRAINED LANGUAGE MODELS VIA HANDCRAFTED ADVERSARIAL EXAMPLES
    • Defending LLMs against Jailbreaking Attacks via Backtranslation
    • EMULATED DISALIGNMENT: SAFETY ALIGNMENT FOR LARGE LANGUAGE MODELS MAY BACKFIRE!
    • GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
    • ON THE SAFETY OF OPEN-SOURCED LARGE LAN GUAGE MODELS: DOES ALIGNMENT REALLY PREVENT THEM FROM BEING
    • Unveiling the Implicit Toxicity in Large Language Models
    • Forcing Generative Models to Degenerate Ones: The Power of Data Poisoning Attacks
    • Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs
    • Learning to Poison Large Language Models During Instruction Tuning
    • ALIGNMENT IS NOT SUFFICIENT TO PREVENT LARGE LANGUAGE MODELS FROM GENERATING HARMFUL IN FORMATION:
    • LANGUAGE MODEL UNALIGNMENT: PARAMETRIC RED-TEAMING TO EXPOSE HIDDEN HARMS AND BI ASES
    • Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections
    • IMMUNIZATION AGAINST HARMFUL FINE-TUNING AT TACKS
    • EMULATED DISALIGNMENT: SAFETY ALIGNMENT FOR LARGE LANGUAGE MODELS MAY BACKFIRE!
    • Composite Backdoor Attacks Against Large Language Models
    • AWolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easi
    • ALL IN HOW YOU ASK FOR IT: SIMPLE BLACK-BOX METHOD FOR JAILBREAK ATTACKS
    • LLMJailbreak Attack versus Defense Techniques- A Comprehensive Study
    • Weak-to-Strong Jailbreaking on Large Language Models
    • MULTIVERSE: Exposing Large Language Model Alignment Problems in Diverse Worlds
    • Universal and Transferable Adversarial Attacks on Aligned Language Models
    • COERCING LLMS TO DO AND REVEAL (ALMOST) ANYTHING
    • Generating Valid and Natural Adversarial Examples with Large Language Models
    • Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections
    • Scaling Laws for Adversarial Attacks on Language Model Activations
    • Ignore Previous Prompt: Attack Techniques For Language Models
    • ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages
    • A New Era in LLM Security: Exploring Security Concerns in Real-World LLM-based Systems
    • ATTACKING LARGE LANGUAGE MODELS WITH PROJECTED GRADIENT DESCENT
    • Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack
    • Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embed
    • Query-Based Adversarial Prompt Generation
    • COERCING LLMS TO DO AND REVEAL (ALMOST) ANYTHING
    • Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment
    • Fast Adversarial Attacks on Language Models In One GPU Minute
    • DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers
    • From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Transla
    • Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
    • CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models
    • Large Language Models are Vulnerable to Bait-and-Switch Attacks for Generating Harmful Content
    • Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment
    • PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails
    • A Cross-Language Investigation into Jailbreak Attacks in Large Language Models
    • LoRA-as-an-Attack! Piercing LLM Safety Under The Share-and-Play Scenario
    • Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data
    • SHORTCUTS ARISING FROM CONTRAST: EFFECTIVE AND COVERT CLEAN-LABEL ATTACKS IN PROMPT-BASED LEARNING
    • What’s in Your “Safe” Data?: Identifying Benign Data that Breaks Safety
    • DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers
    • Attacking LLM Watermarks by Exploiting Their Strengths
    • From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Transla
    • DeepInception: Hypnotize Large Language Model to Be Jailbreaker
    • Hijacking Large Language Models via Adversarial In-Context Learning
    • EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models
    • LinkPrompt: Natural and Universal Adversarial Attacks on Prompt-based Language Models
    • DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions
    • Syntactic Ghost: An Imperceptible General-purpose Backdoor Attacks on Pre-trained Language Models
    • Conversation Reconstruction Attack Against GPT Models
    • Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks
    • PoisonedRAG: Knowledge Poisoning Attacks to Retrieval-Augmented Generation of Large Language Models
    • COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
    • Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit
    • Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks
    • Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak
    • UNIVERSAL JAILBREAK BACKDOORS FROM POISONED HUMAN FEEDBACK
    • Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking
    • Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation
    • POISONPROMPT: BACKDOOR ATTACK ON PROMPT-BASED LARGE LANGUAGE MODELS
    • BACKDOORING INSTRUCTION-TUNED LARGE LANGUAGE MODELS WITH VIRTUAL PROMPT INJECTION
    • Backdoor Attacks for In-Context Learning with Language Models
    • Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models
    • UOR: Universal Backdoor Attacks on Pre-trained Language Models
    • Fake Alignment: Are LLMs Really Aligned Well?
    • Syntactic Ghost: An Imperceptible General-purpose Backdoor Attacks on Pre-trained Language Models
    • Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignm
    • Imperio: Language-Guided Backdoor Attacks for Arbitrary Model Control
    • Large Language Models Are Better Adversaries: Exploring Generative Clean-Label Backdoor Attacks Agai
    • Universal Vulnerabilities in Large Language Models: Backdoor Attacks for In-context Learning
    • BADCHAIN: BACKDOOR CHAIN-OF-THOUGHT PROMPTING FOR LARGE LANGUAGE MODELS
    • AUTODAN: INTERPRETABLE GRADIENT-BASED ADVERSARIAL ATTACKS ON LARGE LANGUAGE MODELS
    • AN LLM CAN FOOL ITSELF: A PROMPT-BASED ADVERSARIAL ATTACK
    • AUTOMATIC HALLUCINATION ASSESSMENT FOR ALIGNED LARGE LANGUAGE MODELS VIA TRANSFERABLE ADVERSARIAL AT
    • LLM LIES: HALLUCINATIONS ARE NOT BUGS, BUT FEATURES AS ADVERSARIAL EXAMPLES
    • LOFT: LOCAL PROXY FINE-TUNING FOR IMPROVING TRANSFERABILITY OF ADVERSARIAL ATTACKS AGAINST LARGE LAN
    • Universal and Transferable Adversarial Attacks on Aligned Language Models
    • Robustness Over Time: Understanding Adversarial Examples’ Effectiveness on Longitudinal Versions of
    • BASELINE DEFENSES FOR ADVERSARIAL ATTACKS AGAINST ALIGNED LANGUAGE MODELS
    • Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue
    • Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Recon
    • Adversarial Demonstration Attacks on Large Language Models
    • COVER: A Heuristic Greedy Adversarial Attack on Prompt-based Learning in Language Models
    • The Butterfly Effect of Altering Prompts: How Small Changes and Jailbreaks Affect Large Language Mod
    • Open the Pandora’s Box of LLMs: Jailbreaking LLMs through Representation Engineering
    • How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Huma
    • Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language Models
    • PANDORA: Jailbreak GPTs by Retrieval Augmented Generation Poisoning
    • Weak-to-Strong Jailbreaking on Large Language Models
    • Jailbreaking Proprietary Large Language Models using Word Substitution Cipher
    • Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs
    • Adversarial Attacks and Defenses in Large Language Models: Old and New Threats
    • Jailbroken: How Does LLM Safety Training Fail?
    • ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
    • GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large
    • Tastle: Distract Large Language Models for Automatic Jailbreak Attack
    • Exploring Safety Generalization Challenges of Large Language Models via Code
    • Learning to Poison Large Language Models During Instruction Tuning
    • BADEDIT: BACKDOORING LARGE LANGUAGE MODELS BY MODEL EDITING
    • Composite Backdoor Attacks Against Large Language Models
    • LoRA-as-an-Attack! Piercing LLM Safety Under The Share-and-Play Scenario
    • Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections
    • ALL IN HOW YOU ASK FOR IT: SIMPLE BLACK-BOX METHOD FOR JAILBREAK ATTACKS
    • THE POISON OF ALIGNMENT
    • The Philosopher’s Stone: Trojaning Plugins of Large Language Models
    • RAPID OPTIMIZATION FOR JAILBREAKING LLMS VIA SUBCONSCIOUS EXPLOITATION AND ECHOPRAXIA
    • Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks
    • RED TEAMING GPT-4V: ARE GPT-4V SAFE AGAINST UNI/MULTI-MODAL JAILBREAK ATTACKS ?
    • PAL: Proxy-Guided Black-Box Attack on Large Language Models
    • INCREASED LLM VULNERABILITIES FROM FINETUNING AND QUANTIZATION
    • Rethinking How to Evaluate Language Model Jailbreak
    • COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
    • GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation
    • Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation
    • Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs
    • AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs
    • Universal Adversarial Triggers Are Not Universal
    • PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails
  • LLM-Defense
    • LANGUAGE MODELS ARE HOMER SIMPSON!
    • garak : A Framework for Security Probing Large Language Models
    • Defending Against Weight-Poisoning Backdoor Attacks for Parameter-Efficient Fine-Tuning
    • Trojan Detection in Large Language Models: Insights from The Trojan Detection Challenge
    • PromptFix: Few-shot Backdoor Removal via Adversarial Prompt Tuning
    • The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
    • BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards
    • Cross-Task Defense: Instruction-Tuning LLMs for Content Safety
    • Efficient Adversarial Training in LLMs with Continuous Attacks
    • StruQ: Defending Against Prompt Injection with Structured Queries
    • Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning
    • GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis
    • Defending Jailbreak Prompts via In-Context Adversarial Game
    • Bergeron: Combating Adversarial Attacks through a Conscience-Based Alignment Framework
    • Jailbreaker in Jail: Moving Target Defense for Large Language Models
    • DEFENDING AGAINST ALIGNMENT-BREAKING ATTACKS VIA ROBUSTLY ALIGNED LLM
    • Causality Analysis for Evaluating the Security of Large Language Models
    • AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks
    • Jailbreaking is Best Solved by Definition
    • RIGORLLM: RESILIENT GUARDRAILS FOR LARGE LANGUAGE MODELS AGAINST UNDESIRED CONTENT
    • LANGUAGE MODELS ARE HOMER SIMPSON! Safety Re-Alignment of Fine-tuned Language Models through Task Ar
    • Defending Against Indirect Prompt Injection Attacks With Spotlighting
    • LLMGuard: Guarding against Unsafe LLM Behavior
    • Test-time Backdoor Mitigation for Black-Box Large Language Models with Defensive Demonstrations
    • ON TROJAN SIGNATURES IN LARGE LANGUAGE MODELS OF CODE
    • Acquiring Clean Language Models from Backdoor Poisoned Datasets by Downscaling Frequency Space
    • Detoxifying Large Language Models via Knowledge Editing
    • MART: Improving LLM Safety with Multi-round Automatic Red-Teaming
    • THE POISON OF ALIGNMENT
    • ROSE: Robust Selective Fine-tuning for Pre-trained Language Models
    • GAINING WISDOM FROM SETBACKS : ALIGNING LARGE LANGUAGE MODELS VIA MISTAKE ANALYSIS
    • Making Harmful Behaviors Unlearnable for Large Language Models
    • Fake Alignment: Are LLMs Really Aligned Well?
    • Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
    • Vaccine: Perturbation-aware Alignment for Large Language Model
    • DEFENDING LARGE LANGUAGE MODELS AGAINST JAILBREAK ATTACKS VIA SEMANTIC SMOOTHING
    • Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement
    • DEFENDING AGAINST ALIGNMENT-BREAKING AT TACKS VIA ROBUSTLY ALIGNED LLM
    • LLMSelf Defense: By Self Examination, LLMsKnowTheyAreBeing Tricked
    • BASELINE DEFENSES FOR ADVERSARIAL ATTACKS AGAINST ALIGNED LANGUAGE MODELS
    • Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment
    • LLMsCanDefend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper
    • Detoxifying Text with MARCO: Controllable Revision with Experts and Anti-Experts
    • Self-Destructing Models: Increasing the Costs of Harmful Dual Uses of Foundation Models
    • Whispers that Shake Foundations: Analyzing and Mitigating False Premise Hallucinations in Large Lang
    • CAMOUFLAGE IS ALL YOU NEED: EVALUATING AND ENHANCING LANGUAGE MODEL ROBUSTNESS AGAINST CAMOUFLAGE AD
    • Defending Jailbreak Prompts via In-Context Adversarial Game
    • Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement
    • Defending LLMs against Jailbreaking Attacks via Backtranslation
    • IMMUNIZATION AGAINST HARMFUL FINE-TUNING ATTACKS
    • Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield
    • JAB: Joint Adversarial Prompting and Belief Augmentation
    • TOKEN-LEVEL ADVERSARIAL PROMPT DETECTION BASED ON PERPLEXITY MEASURES AND CONTEXTUAL INFORMATION
    • Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
    • Studious Bob Fight Back Against Jailbreaking via Prompt Adversarial Tuning
    • Vaccine: Perturbation-aware Alignment for Large Language Model
    • Improving the Robustness of Large Language Models via Consistency Alignment
    • SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
    • Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization
    • Defending Pre-trained Language Models as Few-shot Learners against Backdoor Attacks
    • LMSanitator: Defending Prompt-Tuning Against Task-Agnostic Backdoors
    • Diffusion Theory as a Scalpel: Detecting and Purifying Poisonous Dimensions in Pre-trained Language
    • Analyzing And Editing Inner Mechanisms of Backdoored Language Models
    • Setting the Trap: Capturing and Defeating Backdoors in Pretrained Language Models through Honeypots
    • ROBUSTIFYING LANGUAGE MODELS WITH TESTTIME ADAPTATION
    • Jailbreaker in Jail: Moving Target Defense for Large Language
    • DETECTING LANGUAGE MODEL ATTACKS WITH PERPLEXITY
    • Adversarial Fine-Tuning of Language Models: An Iterative Optimisation Approach for the Generation an
    • From Adversarial Arms Race to Model-centric Evaluation Motivating a Unified Automatic Robustness Eva
    • LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked
    • Intention Analysis Makes LLMs A Good Jailbreak Defender
    • Defending Against Disinformation Attacks in Open-Domain Question Answering
    • Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning
    • Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landsc
    • Round Trip Translation Defence against Large Language Model Jailbreaking Attacks
    • How Should Pre-Trained Language Models Be Fine-Tuned Towards Adversarial Robustness?
    • SELF-GUARD: Empower the LLM to Safeguard Itself
    • Intention Analysis Makes LLMs A Good Jailbreak Defender
    • Jatmo: Prompt Injection Defense by Task-Specific Finetuning
    • Precisely the Point: Adversarial Augmentations for Faithful and Informative Text Generation
    • Adversarial Text Purification: A Large Language Model Approach for Defense
    • Studious Bob Fight Back Against Jailbreaking via Prompt Adversarial Tuning
    • Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
    • Defending Against Weight-Poisoning Backdoor Attacks for Parameter-Efficient Fine-Tuning
    • Signed-Prompt: A New Approach to Prevent Prompt Injection Attacks Against LLM-Integrated Application
    • Is the System Message Really Important to Jailbreaks in Large Language Models?
    • AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts
    • Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge
Powered by GitBook
On this page
  • 1. 研究背景
  • 2. 过去方案和缺点
  • 3. 本文方案和步骤
  • 4. 本文创新点与贡献
  • 5. 本文实验
  • 6. 实验结论
  • 7. 全文结论
  • 阅读总结
  1. Others

Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Pro

PreviousLARGE LANGUAGE MODELS AS AUTOMATED ALIGNERS FOR BENCHMARKING VISION-LANGUAGE MODELSNextInferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance

Last updated 1 year ago

1. 研究背景

随着大型语言模型(LLMs)在各种应用中的集成,它们的功能可以通过自然语言提示灵活调节。这使得LLMs容易受到有针对性的对抗性提示攻击,例如提示注入(PI)攻击,允许攻击者覆盖原始指令和使用的控制。目前,人们普遍认为用户是直接向LLM提供提示的。但是,如果用户不是直接提供提示的呢?研究表明,LLM集成应用模糊了数据和指令之间的界限。本文揭示了新的攻击向量,通过间接提示注入,使对手能够远程(无需直接接口)利用LLM集成应用,通过策略性地将提示注入可能被检索的数据中。

2. 过去方案和缺点

以往的研究主要集中在如何通过强大的算法和优化技术对抗机器学习模型。然而,LLMs通过自然提示的可扩展性使得攻击者可以采用更简单的攻击策略。即使在已经实施了缓解措施的黑盒设置中,恶意用户仍然可以通过提示注入攻击来绕过内容限制或获得对模型原始指令的访问。这些攻击通常假设是恶意用户直接利用系统进行的。然而,这种假设忽略了LLMs与检索增强功能结合后,数据和指令之间界限模糊的问题。

3. 本文方案和步骤

本文首先从计算机安全的角度开发了一个系统的分类法,以系统地研究这些新兴的漏洞和威胁。然后,通过在真实世界系统(如Bing的GPT-4驱动的聊天和代码补全引擎)和基于GPT-4构建的合成应用上展示攻击的实际可行性,强调了需要强大的防御措施。

4. 本文创新点与贡献

  • 提出了间接提示注入(IPI)的概念,这是一种完全未被研究的攻击向量,其中检索到的提示本身可以作为“任意代码”。

  • 开发了第一个针对LLM集成应用中IPI威胁景观的分类法和系统分析。

  • 展示了这些攻击在真实世界和合成系统上的实用性,强调了需要强大的防御。

  • 在GitHub仓库中分享了所有演示和开发的攻击提示,以促进未来的研究,并为LLM集成应用的安全评估构建一个开放的框架。

5. 本文实验

实验部分详细介绍了攻击的设置,包括合成应用和Bing Chat的测试,以及对Github Copilot的提示注入攻击测试。实验展示了信息收集、欺诈、恶意软件传播、入侵、内容操纵和可用性攻击等多种威胁的实际示例。

6. 实验结论

实验结果表明,通过间接提示注入,攻击者可以成功地操纵模型,数据和指令的模式没有被解开,通过聊天界面过滤的提示在间接注入时不会被过滤掉。在大多数情况下,模型在整个会话中一直保留注入。

7. 全文结论

LLM集成应用不再具有可控的输入输出通道,它们呈现了任意检索的输入,并可以调用其他外部API。这使得攻击者可以通过间接提示注入远程影响用户,并跨越关键的安全边界。本文的研究为LLM集成应用和未来自主代理的安全评估奠定了基础,为更安全的部署铺平了道路。

注1:

间接提示注入攻击(Indirect Prompt Injection, IPI)在实际生活中有多种潜在的应用场景,这些场景涉及到大型语言模型(LLMs)集成的应用,包括但不限于以下几个方面:

  1. 搜索引擎操纵:

    • 攻击者可以通过在网页中注入恶意提示,影响搜索引擎的搜索结果和摘要,从而误导用户,传播错误信息或宣传。

  2. 社交媒体和内容平台:

    • 在社交媒体帖子或博客文章中植入提示,可能导致LLM集成的推荐系统或内容摘要功能产生偏见或不实信息。

  3. 聊天机器人和客服助手:

    • 攻击者可能通过在聊天中植入提示,操纵聊天机器人的行为,使其提供错误的建议或执行不当的操作。

  4. 电子邮件客户端和个人助理:

    • 通过在电子邮件中注入提示,攻击者可以利用LLM增强的电子邮件客户端自动发送带有恶意链接或附件的邮件。

  5. 在线教育和培训平台:

    • 在在线课程材料或训练指南中植入提示,可能会误导用户,导致错误的学习成果或行为。

  6. 金融和投资建议:

    • 攻击者可以操纵LLM提供的投资建议或市场分析,误导用户做出不利的金融决策。

  7. 健康和医疗咨询:

    • 在健康相关的文档或论坛中植入提示,可能会误导用户关于医疗条件和治疗方法的决策。

  8. 法律和合规咨询:

    • 通过操纵LLM提供法律建议,攻击者可能会误导用户,导致法律风险或不合规行为。

  9. 自动化编程和代码补全:

    • 在开源代码库或文档中植入提示,可能会影响开发者使用的代码补全工具,导致潜在的安全漏洞或错误代码。

  10. 政治宣传和舆论操纵:

    • 利用LLM在新闻网站或政治论坛中传播偏见信息,操纵公众舆论和选举结果。

这些场景表明,间接提示注入攻击可能对个人、企业和整个社会产生广泛的影响,因此需要对LLM集成应用的安全性给予足够的重视,并开发有效的防御措施。

注2:

间接提示注入(Indirect Prompt Injection, IPI)是一种针对大型语言模型(LLMs)的攻击方式,它利用了LLMs在处理和生成文本时对自然语言提示的敏感性。在IPI攻击中,攻击者并不直接与LLM交互,而是通过将恶意提示(prompts)注入到LLM可能检索和处理的数据中,来间接影响LLM的行为和输出。

核心概念

  • 数据与指令的模糊界限:LLMs通常通过自然语言提示来理解和执行任务。在集成了LLM的应用中,模型可能会从外部数据源(如网页、数据库、电子邮件等)检索信息。如果这些数据源被恶意注入了提示,LLM可能会将这些提示当作正常的指令来执行,从而改变其行为。

  • 远程攻击:与传统的直接提示注入(直接向LLM发送恶意提示)不同,IPI攻击不需要攻击者与LLM有直接的交互。攻击者通过策略性地在LLM可能访问的数据中植入提示,实现对LLM的远程操控。

攻击步骤

  1. 目标选择:攻击者确定要攻击的LLM集成应用和目标用户群体。

  2. 注入策略:攻击者设计恶意提示,并将其植入到LLM可能检索的数据中。这可能包括在社交媒体上发布带有提示的内容、在搜索引擎优化(SEO)中使用提示、或者在电子邮件中嵌入提示。

  3. 检索与执行:当目标用户使用LLM集成应用时,LLM会检索并处理包含恶意提示的数据。

  4. 行为操纵:LLM在处理这些数据时,会根据恶意提示执行非预期的行为,如提供错误信息、执行特定操作或传播恶意内容。

潜在影响

  • 信息安全:IPI攻击可能导致敏感信息泄露,如用户数据或私人对话。

  • 系统控制:攻击者可能通过IPI攻击获得对LLM集成应用的远程控制能力。

  • 内容操纵:LLM可能被用来传播虚假信息、宣传或偏见内容,影响公众舆论。

  • 服务可用性:通过IPI攻击,攻击者可能干扰LLM集成应用的正常运行,导致服务中断或性能下降。

防御措施

为了防御IPI攻击,开发者和安全专家需要采取一系列措施,包括但不限于:

  • 输入过滤:对LLM接收的所有外部数据进行严格的内容过滤和监控。

  • 安全审计:定期对LLM集成应用进行安全审计,以识别和修复潜在的安全漏洞。

  • 用户教育:提高用户对潜在威胁的认识,教育他们如何安全地使用LLM集成应用。

  • 模型改进:研究和开发更强大的LLM模型,以提高对恶意提示的抵抗力。

IPI攻击揭示了LLM集成应用在安全性方面的新挑战,需要业界和学术界共同努力,以确保这些强大工具的安全和可靠使用。

阅读总结

本文深入探讨了大型语言模型集成应用的安全性问题,特别是间接提示注入的潜在威胁。通过提出新的攻击向量和系统性的分类法,本文不仅揭示了LLMs在当前和未来应用中的脆弱性,而且为如何防御这些攻击提供了宝贵的见解。实验部分通过实际的攻击演示,强调了这些威胁的现实性和紧迫性,为未来的研究和防御措施的发展提供了坚实的基础。