大模型安全笔记
  • 前言
  • MM-LLM
    • MM-LLMs: Recent Advances in MultiModal Large Language Models
    • Multimodal datasets: misogyny, pornography, and malignant stereotypes
    • Sight Beyond Text: Multi-Modal Training Enhances LLMsinTruthfulness and Ethics
    • FOUNDATION MODELS AND FAIR USE
  • VLM-Defense
    • Removing NSFW Concepts from Vision-and-Language Models for Text-to-Image Retrieval and Generation
    • Safety Alignment for Vision Language Models
    • AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Sh
    • Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation
    • MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance
    • Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
    • SAFEGEN: Mitigating Unsafe Content Generation in Text-to-Image Models
    • Moderating Illicit Online Image Promotion for Unsafe User-Generated Content Games Using Large Vision
    • Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding
    • Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation
    • UNDERSTANDING ZERO-SHOT ADVERSARIAL ROBUSTNESS FOR LARGE-SCALE MODELS
    • A Mutation-Based Method for Multi-Modal Jailbreaking Attack Detection
    • UNDERSTANDING ZERO-SHOT ADVERSARIAL ROBUSTNESS FOR LARGE-SCALE MODELS
    • AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Sh
    • CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning
    • Image Safeguarding: Reasoning with Conditional Vision Language Model and Obfuscating Unsafe Content
    • Towards Safe Self-Distillation of Internet-Scale Text-to-Image Diffusion Models
    • Typographic Attacks in Large Multimodal Models Can be Alleviated by More Informative Prompts
    • Onthe Robustness of Large Multimodal Models Against Image Adversarial Attacks
    • Removing NSFW Concepts from Vision-and-Language Models for Text-to-Image Retrieval and Generation
    • Safety Fine-Tuning at (Almost) No Cost: ABaseline for Vision Large Language Models
    • Partially Recentralization Softmax Loss for Vision-Language Models Robustness
    • Adversarial Prompt Tuning for Vision-Language Models
    • Defense-Prefix for Preventing Typographic Attacks on CLIP
    • RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedb
    • AMutation-Based Method for Multi-Modal Jailbreaking Attack
    • HowEasy is It to Fool Your Multimodal LLMs? AnEmpirical Analysis on Deceptive Prompts
    • MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance
    • EFUF: Efficient Fine-grained Unlearning Framework for Mitigating Hallucinations in Multimodal Large
    • Aligning Modalities in Vision Large Language Models via Preference Fine-tuning
    • Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Lang
    • Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoisi
    • Robust Contrastive Language-Image Pre-training against Data Poisoning and Backdoor Attacks
    • HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data
  • VLM
    • Scalable Performance Analysis for Vision-Language Models
  • VLM-Attack
    • Circumventing Concept Erasure Methods For Text-to-Image Generative Models
    • Efficient LLM-Jailbreaking by Introducing Visual Modality
    • From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking
    • Adversarial Attacks on Multimodal Agents
    • Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Ima
    • Cross-Modality Jailbreak and Mismatched Attacks on Medical Multimodal Large Language Models
    • Typography Leads Semantic Diversifying: Amplifying Adversarial Transferability across Multimodal Lar
    • White-box Multimodal Jailbreaks Against Large Vision-Language Models
    • Red Teaming Visual Language Models
    • Private Attribute Inference from Images with Vision-Language Models
    • Assessment of Multimodal Large Language Models in Alignment with Human Values
    • Privacy-Aware Visual Language Models
    • Learning To See But Forgetting To Follow: Visual Instruction Tuning Makes LLMs More Prone To Jailbre
    • Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks
    • Red Teaming Visual Language Models
    • Adversarial Illusions in Multi-Modal Embeddings
    • Universal Prompt Optimizer for Safe Text-to-Image Generation
    • On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts
    • Adversarial Illusions in Multi-Modal Embeddings
    • Stop Reasoning! When Multimodal LLMs with Chain-of-Thought Reasoning Meets Adversarial Images
    • INSTRUCTTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models
    • On the Robustness of Large Multimodal Models Against Image Adversarial Attacks
    • Hijacking Context in Large Multi-modal Models
    • Transferable Multimodal Attack on Vision-Language Pre-training Models
    • Images are Achilles’ Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimoda
    • AN IMAGE IS WORTH 1000 LIES: ADVERSARIAL TRANSFERABILITY ACROSS PROMPTS ON VISIONLANGUAGE MODELS
    • Test-Time Backdoor Attacks on Multimodal Large Language Models
    • JAILBREAK IN PIECES: COMPOSITIONAL ADVERSARIAL ATTACKS ON MULTI-MODAL LANGUAGE MODELS
    • Jailbreaking Attack against Multimodal Large Language Model
    • Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts
    • IMAGE HIJACKS: ADVERSARIAL IMAGES CAN CONTROL GENERATIVE MODELS AT RUNTIME
    • VISUAL ADVERSARIAL EXAMPLES JAILBREAK ALIGNED LARGE LANGUAGE MODELS
    • Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks
    • Query-Relevant Images Jailbreak Large Multi-Modal Models
    • Towards Adversarial Attack on Vision-Language Pre-training Models
    • HowMany Are Unicorns in This Image? ASafety Evaluation Benchmark for Vision LLMs
    • SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Au
    • MISUSING TOOLS IN LARGE LANGUAGE MODELS WITH VISUAL ADVERSARIAL EXAMPLES
    • VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models
    • INSTRUCTTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models
    • Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Mod
    • Shadowcast: STEALTHY DATA POISONING ATTACKS AGAINST VISION-LANGUAGE MODELS
    • FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts
    • THE WOLF WITHIN: COVERT INJECTION OF MALICE INTO MLLM SOCIETIES VIA AN MLLM OPERATIVE
    • Stop Reasoning! When Multimodal LLMs with Chain-of-Thought Reasoning Meets Adversarial Images
    • Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast
    • How Robust is Google’s Bard to Adversarial Image Attacks?
    • OnEvaluating Adversarial Robustness of Large Vision-Language Models
    • Onthe Adversarial Robustness of Multi-Modal Foundation Models
    • Are aligned neural networks adversarially aligned?
    • READING ISN’T BELIEVING: ADVERSARIAL ATTACKS ON MULTI-MODAL NEURONS
    • Black Box Adversarial Prompting for Foundation Models
    • Evaluation and Analysis of Hallucination in Large Vision-Language Models
    • FOOL YOUR (VISION AND) LANGUAGE MODEL WITH EMBARRASSINGLY SIMPLE PERMUTATIONS
    • VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models
    • Transferable Multimodal Attack on Vision-Language Pre-training Models
    • BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning
    • AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal Contrastive Learning
  • T2I-Attack
    • On Copyright Risks of Text-to-Image Diffusion Models
    • ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users
    • On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts
    • Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts
    • SneakyPrompt: Jailbreaking Text-to-image Generative Models
    • The Stronger the Diffusion Model, the Easier the Backdoor: Data Poisoning to Induce Copyright Breach
    • Discovering Universal Semantic Triggers for Text-to-Image Synthesis
    • Automatic Jailbreaking of the Text-to-Image Generative AI Systems
  • Survey
    • Generative AI Security: Challenges and Countermeasures
    • Beyond Boundaries: A Comprehensive Survey of Transferable Attacks on AI Systems
    • Current state of LLM Risks and AI Guardrails
    • Security of AI Agents
    • Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents
    • Exploring Vulnerabilities and Protections in Large Language Models: A Survey
    • Unveiling Hallucination in Text, Image, Video, and Audio Foundation Models: A Comprehensive Survey
    • Unbridled Icarus: A Survey of the Potential Perils of Image Inputs in Multimodal Large Language Mode
    • SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Mode
    • Safety of Multimodal Large Language Models on Images and Text
    • LLM Jailbreak Attack versus Defense Techniques - A Comprehensive Study
    • Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks
    • ASurvey on Safe Multi-Modal Learning System
    • TRUSTWORTHY LARGE MODELS IN VISION: A SURVEY
    • A Pathway Towards Responsible AI Generated Content
    • A Survey of Hallucination in “Large” Foundation Models
    • An Early Categorization of Prompt Injection Attacks on Large Language Models
    • Comprehensive Assessment of Jailbreak Attacks Against LLMs
    • A Comprehensive Overview of Backdoor Attacks in Large Language Models within Communication Networks
    • Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks
    • Adversarial Machine Learning for Social Good: Reframing the Adversary as an Ally
    • Red-Teaming for Generative AI: Silver Bullet or Security Theater?
    • A STRONGREJECT for Empty Jailbreaks
  • LVM-Attack
    • Adversarial Attacks on Foundational Vision Models
  • For Good
    • Image Safeguarding: Reasoning with Conditional Vision Language Model and Obfuscating Unsafe Content
  • Benchmark
    • HALLUSIONBENCH: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusi
    • OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety
    • ToViLaG: Your Visual-Language Generative Model is Also An Evildoer
    • HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
    • S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language
    • UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images
    • JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against
    • JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
    • Constructing Benchmarks and Interventions for Combating Hallucinations in LLMs
    • ALERT: A Comprehensive Benchmark for Assessing Large Language Models’ Safety through Red Teaming
    • Benchmarking Llama2, Mistral, Gemma and GPT for Factuality, Toxicity, Bias and Propensity for Halluc
    • INJECAGENT: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents
    • AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Adversarial Visual-Ins
    • HALLUSIONBENCH: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusio
    • ALL LANGUAGES MATTER: ON THE MULTILINGUAL SAFETY OF LARGE LANGUAGE MODELS
    • Why Should Adversarial Perturbations be Imperceptible? Rethink the Research Paradigm in Adversarial
    • Red Teaming Visual Language Models
    • Unified Hallucination Detection for Multimodal Large Language Models
    • MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark
    • Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
    • CAN LANGUAGE MODELS BE INSTRUCTED TO PROTECT PERSONAL INFORMATION?
    • Detecting and Preventing Hallucinations in Large Vision Language Models
    • DRESS : Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Lang
    • ToViLaG: Your Visual-Language Generative Model is Also An Evildoer
    • SC-Safety: A Multi-round Open-ended Question Adversarial Safety Benchmark for Large Language Models
    • PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts
    • Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs
  • Explainality
    • Visual Explanations of Image-Text Representations via Multi-Modal Information Bottleneck Attributio
  • Privacy-Defense
    • Defending Our Privacy With Backdoors
    • PromptCARE: Prompt Copyright Protection by Watermark Injection and Verification
  • Privacy-Attack
    • PANDORA’S WHITE-BOX: INCREASED TRAINING DATA LEAKAGE IN OPEN LLMS
    • Untitled
    • Membership Inference Attacks against Large Language Models via Self-prompt Calibration
    • LANGUAGE MODEL INVERSION
    • Effective Prompt Extraction from Language Models
    • Prompt Stealing Attacks Against Large Language Models
    • Stealing Part of a Production Language Model
    • Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Cali
    • Prompt Stealing Attacks Against Large Language Models
    • PRSA: Prompt Reverse Stealing Attacks against Large Language Models
    • Low-Resource Languages Jailbreak GPT-4
    • Scalable Extraction of Training Data from (Production) Language Models
  • Others
    • INFERRING OFFENSIVENESS IN IMAGES FROM NATURAL LANGUAGE SUPERVISION
    • An LLM-Assisted Easy-to-Trigger Backdoor Attack on Code Completion Models: Injecting Disguised Vulne
    • More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness
    • AI SAFETY: A CLIMB TO ARMAGEDDON?
    • AI RISK MANAGEMENT SHOULD INCORPORATE BOTH SAFETY AND SECURITY
    • Defending Against Social Engineering Attacks in the Age of LLMs
    • Adversarial Perturbations Cannot Reliably Protect Artists From Generative AI
    • Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
    • Deduplicating Training Data Makes Language Models Better
    • MITIGATING TEXT TOXICITY WITH COUNTERFACTUAL GENERATION
    • The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?
    • Mitigating Hallucinations in Large Language Models via Self-Refinement-Enhanced Knowledge Retrieval
    • Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
    • Mitigating LLM Hallucinations via Conformal Abstention
    • Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback
    • Can ChatGPT Detect DeepFakes? A Study of Using Multimodal Large Language Models for Media Forensics
    • An Analysis of Recent Advances in Deepfake Image Detection in an Evolving Threat Landscape
    • Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding
    • LARGE LANGUAGE MODELS AS AUTOMATED ALIGNERS FOR BENCHMARKING VISION-LANGUAGE MODELS
    • PoLLMgraph: Unraveling Hallucinations in Large Language Models via State Transition Dynamics
    • Reducing hallucination in structured outputs via Retrieval-Augmented Generation
    • Moderating Illicit Online Image Promotion for Unsafe User-Generated Content Games Using Large Vision
    • Attacking LLM Watermarks by Exploiting Their Strengths
    • The Butterfly Effect of Altering Prompts: How Small Changes and Jailbreaks Affect Large Language Mod
    • TOFU: A Task of Fictitious Unlearning for LLMs
    • Learning and Forgetting Unsafe Examples in Large Language Models
    • Exploring Adversarial Attacks against Latent Diffusion Model from the Perspective of Adversarial Tra
    • TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space
    • In Search of Truth: An Interrogation Approach to Hallucination Detection
    • Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification
    • Unsupervised Real-Time Hallucination Detection based on the Internal States of Large Language Models
    • Locating and Mitigating Gender Bias in Large Language Models
    • Learning to Edit: Aligning LLMs with Knowledge Editing
    • Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding
    • Quantitative Analysis of AI-Generated Texts in Academic Research: A Study of AI Presence in Arxiv Su
    • Does DETECTGPT Fully Utilize Perturbation? Bridge Selective Perturbation to Fine-tuned Contrastive L
    • TELLER: A Trustworthy Framework for Explainable, Generalizable and Controllable Fake News Detection
    • SPOTTING LLMS WITH BINOCULARS: ZERO-SHOT DETECTION OF MACHINE-GENERATED TEXT
    • LLM-as-a-Coauthor: The Challenges of Detecting LLM-Human Mixcase
    • WHAT’S IN MY BIG DATA?
    • UNDERSTANDING CATASTROPHIC FORGETTING IN LANGUAGE MODELS VIA IMPLICIT INFERENCE
    • Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models
    • Toxicity in CHATGPT: Analyzing Persona-assigned Language Models
    • MemeCraft: Contextual and Stance-Driven Multimodal Meme Generation
    • Moderating Illicit Online Image Promotion for Unsafe User-Generated Content Games Using Large Vision
    • Moderating New Waves of Online Hate with Chain-of-Thought Reasoning in Large Language Models
    • Poisoned ChatGPT Finds Work for Idle Hands: Exploring Developers’ Coding Practices with Insecure Sug
    • Zero shot VLMs for hate meme detection: Are we there yet?
    • ANALYZING AND MITIGATING OBJECT HALLUCINATION IN LARGE VISION-LANGUAGE MODELS
    • MITIGATING HALLUCINATION IN LARGE MULTIMODAL MODELS VIA ROBUST INSTRUCTION TUNING
    • DENEVIL: TOWARDS DECIPHERING AND NAVIGATING THE ETHICAL VALUES OF LARGE LANGUAGE MODELS VIA INSTRUCT
    • Disentangling Perceptions of Offensiveness: Cultural and Moral Correlates
    • Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity
    • LARGE LANGUAGE MODELS AS AUTOMATED ALIGNERS FOR BENCHMARKING VISION-LANGUAGE MODELS
    • Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Pro
    • InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance
    • CAN LANGUAGE MODELS BE INSTRUCTED TO PROTECT PERSONAL INFORMATION?
    • AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications
    • Prompt Injection Attacks and Defenses in LLM-Integrated Applications
    • Removing RLHF Protections in GPT-4 via Fine-Tuning
    • SPML: A DSL for Defending Language Models Against Prompt Attacks
    • Stealthy Attack on Large Language Model based Recommendation
    • Large Language Models Sometimes Generate Purely Negatively-Reinforced Text
    • On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective
    • Learning from data in the mixed adversarial non-adversarial case: Finding the helpers and ignoring t
    • longhorns at DADC 2022: How many linguists does it take to fool a Question Answering model? A system
    • A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning
    • Adversarial Examples Generation for Reducing Implicit Gender Bias in Pre-trained Models
    • Discovering the Hidden Vocabulary of DALLE-2
    • Raising the Cost of Malicious AI-Powered Image Editing
    • Negating Negatives: Alignment without Human Positive Samples via Distributional Dispreference Optimi
    • ALIGNERS: DECOUPLING LLMS AND ALIGNMENT
    • CAN LLM-GENERATED MISINFORMATION BE DETECTED?
    • On the Risk of Misinformation Pollution with Large Language Models
    • Evading Watermark based Detection of AI-Generated Content
    • Mitigating Inappropriateness in Image Generation: Can there be Value in Reflecting the World’s Uglin
    • Privacy-Preserving Instructions for Aligning Large Language Models
    • TOWARDS UNDERSTANDING THE INTERPLAY OF GENERATIVE ARTIFICIAL INTELLIGENCE AND THE INTERNET
    • Evaluating the Social Impact of Generative AI Systems in Systems and Society
    • Transformation vs Tradition: Artificial General Intelligence (AGI) for Arts and Humanities
    • Attacking LLM Watermarks by Exploiting Their Strengths
    • TOWARDS RESPONSIBLE AI IN THE ERA OF GENERATIVE AI: A REFERENCE ARCHITECTURE FOR DESIGNING FOUNDATIO
    • RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
    • Intent-aligned AI systems deplete human agency: the need for agency foundations research in AI safet
    • Risk Assessment and Statistical Significance in the Age of Foundation Models
    • The Foundation Model Transparency Index
    • The Privacy Pillar - A Conceptual Framework for Foundation Model-based Systems
    • A Baseline Analysis of Reward Models’ Ability To Accurately Analyze Foundation Models Under Distribu
    • Foundational Moral Values for AI Alignment
    • Hazards from Increasingly Accessible Fine-Tuning of Downloadable Foundation Models
    • ON CATASTROPHIC INHERITANCE OF LARGE FOUNDATION MODELS
    • Foundation Model Sherpas: Guiding Foundation Models through Knowledge and Reasoning
    • Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustmen
    • Foundation Model Transparency Reports
    • SECURING RELIABILITY: A BRIEF OVERVIEW ON ENHANCING IN-CONTEXT LEARNING FOR FOUNDATION MODELS
    • EXPLORING THE ADVERSARIAL CAPABILITIES OF LARGE LANGUAGE MODELS
    • TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification
    • LLM-Resistant Math Word Problem Generation via Adversarial Attacks
    • Efficient Black-Box Adversarial Attacks on Neural Text Detectors
    • Adversarial Preference Optimization
    • Combating Adversarial Attacks with Multi-Agent Debate
    • How the Advent of Ubiquitous Large Language Models both Stymie and Turbocharge Dynamic Adversarial Q
    • L-AutoDA: Leveraging Large Language Models for Automated Decision-based Adversarial Attacks
    • Hidding the Ghostwriters: An Adversarial Evaluation of AI-Generated Student Essay Detection
    • What Does the Bot Say? Opportunities and Risks of Large Language Models in Social Media Bot Detectio
    • Prompted Contextual Vectors for Spear-Phishing Detection
    • Token-Ensemble Text Generation: On Attacking the Automatic AI-Generated Text Detection
    • Recursive Chain-of-Feedback Prevents Performance Degradation from Redundant Prompting
    • Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents
    • RADAR: Robust AI-Text Detection via Adversarial Learning
    • OUTFOX: LLM-Generated Essay Detection Through In-Context Learning with Adversarially Generated Examp
    • Why do universal adversarial attacks work on large language models?: Geometry might be the answer
    • J-Guard: Journalism Guided Adversarially Robust Detection of AI-generated News
    • Distilling Adversarial Prompts from Safety Benchmarks: Report for the Adversarial Nibbler Challenge
    • Detoxifying Large Language Models via Knowledge Editing
    • Healing Unsafe Dialogue Responses with Weak Supervision Signals
  • LLM-Attack
    • Hacc-Man: An Arcade Game for Jailbreaking LLMs
    • Forcing Generative Models to Degenerate Ones: The Power of Data Poisoning Attacks
    • DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions
    • Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak
    • Hijacking Large Language Models via Adversarial In-Context Learning
    • Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs
    • DALA: A Distribution-Aware LoRA-Based Adversarial Attack against Language Models
    • FRONTIER LANGUAGE MODELS ARE NOT ROBUST TO ADVERSARIAL ARITHMETIC, OR “WHAT DO I NEED TO SAY SO YOU
    • Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignm
    • Evil Geniuses: Delving into the Safety of LLM-based Agents
    • BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B
    • SHADOW ALIGNMENT: THE EASE OF SUBVERTING SAFELY-ALIGNED LANGUAGE MODELS
    • Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models
    • ChatGPT as an Attack Tool: Stealthy Textual Backdoor Attack via Blackbox Generative Model Trigger
    • Tastle: Distract Large Language Models for Automatic Jailbreak Attack
    • Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue
    • Learning to Poison Large Language Models During Instruction Tuning
    • TALK TOO MUCH: Poisoning Large Language Models under Token Limit
    • Don’t Say No: Jailbreaking LLM by Suppressing Refusal
    • Goal-guided Generative Prompt Injection Attack on Large Language Models
    • Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models
    • BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents
    • AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens
    • QROA: A Black-Box Query-Response Optimization Attack on LLMs
    • BadRAG: Identifying Vulnerabilities in Retrieval Augmented Generation of Large Language Models
    • Improved Generation of Adversarial Examples Against Safety-aligned LLMs
    • Exploring Backdoor Attacks against Large Language Model-based Decision Making
    • Jailbreak Paradox: The Achilles’ Heel of LLMs
    • Stealth edits for provably fixing or attacking large language models
    • Stealth edits for provably fixing or attacking large language models
    • IS POISONING A REAL THREAT TO LLM ALIGNMENT? MAYBE MORE SO THAN YOU THINK
    • IS POISONING A REAL THREAT TO LLM ALIGNMENT? MAYBE MORE SO THAN YOU THINK
    • Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack
    • “Not Aligned” is Not “Malicious”: Being Careful about Hallucinations of Large Language Models’ Jailb
    • Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis
    • Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models
    • Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs
    • StructuralSleight: Automated Jailbreak Attacks on Large Language Models Utilizing Uncommon Text-Enco
    • WHEN LLM MEETS DRL: ADVANCING JAILBREAKING EFFICIENCY VIA DRL-GUIDED SEARCH
    • Context Injection Attacks on Large Language Models
    • Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens
    • Phantom: General Trigger Attacks on Retrieval Augmented Language Generation
    • On Trojans in Refined Language Models
    • A Survey of Backdoor Attacks and Defenses on Large Language Models: Implications for Security Measur
    • How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States
    • JAILBREAKING AS A REWARD MISSPECIFICATION PROBLEM
    • ObscurePrompt: Jailbreaking Large Language Models via Obscure Inpu
    • ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates
    • Improved Techniques for Optimization-Based Jailbreaking on Large Language Models
    • Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses
    • Page 1
    • AutoBreach: Universal and Adaptive Jailbreaking with Efficient Wordplay-Guided Optimization
    • Chain of Attack: a Semantic-Driven Contextual Multi-Turn attacker for LLM
    • CAN LLMS DEEPLY DETECT COMPLEX MALICIOUS QUERIES? A FRAMEWORK FOR JAILBREAKING VIA OBFUSCATING INTEN
    • SHADOW ALIGNMENT: THE EASE OF SUBVERTING SAFELY-ALIGNED LANGUAGE MODELS
    • Hidden You Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Logic Chai
    • JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models
    • AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbre
    • SANDWICH ATTACK: MULTI-LANGUAGE MIXTURE ADAPTIVE ATTACK ON LLMS
    • Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
    • Using Hallucinations to Bypass RLHF Filters
    • TARGET: Template-Transferable Backdoor Attack Against Prompt-based NLP Models via GPT4
    • SHADOW ALIGNMENT: THE EASE OF SUBVERTING SAFELY-ALIGNED LANGUAGE MODELS
    • OPEN SESAME! UNIVERSAL BLACK BOX JAILBREAKING OF LARGE LANGUAGE MODELS
    • Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
    • Weak-to-Strong Jailbreaking on Large Language Models
    • Punctuation Matters! Stealthy Backdoor Attack for Language Models
    • BYPASSING THE SAFETY TRAINING OF OPEN-SOURCE LLMS WITH PRIMING ATTACKS
    • Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections
    • A Semantic, Syntactic, And Context-Aware Natural Language Adversarial Example Generator
    • Fast Adversarial Attacks on Language Models In One GPU Minute
    • Don’t Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models
    • Scaling Behavior of Machine Translation with Large Language Models under Prompt Injection Attacks
    • Automatic and Universal Prompt Injection Attacks against Large Language Models
    • Automatic and Universal Prompt Injection Attacks against Large Language Models
    • Prompt Injection Attacks and Defenses in LLM-Integrated Applications
    • TENSOR TRUST: INTERPRETABLE PROMPT INJECTION ATTACKS FROM AN ONLINE GAME
    • DPP-Based Adversarial Prompt Searching for Lanugage Models
    • Large Language Models Are Better Adversaries: Exploring Generative Clean-Label Backdoor Attacks Agai
    • Rapid Adoption, Hidden Risks: The Dual Impact of Large Language Model Customization
    • Using Hallucinations to Bypass RLHF Filters
    • Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks
    • Prompt Injection attack against LLM-integrated Applications
    • Scaling Behavior of Machine Translation with Large Language Models under Prompt Injection Attacks
    • FINE-TUNING ALIGNED LANGUAGE MODELS COMPROMISES SAFETY, EVEN WHEN USERS DO NOT INTEND TO!
    • CATASTROPHIC JAILBREAK OF OPEN-SOURCE LLMS VIA EXPLOITING GENERATION
    • EVALUATING THE SUSCEPTIBILITY OF PRE-TRAINED LANGUAGE MODELS VIA HANDCRAFTED ADVERSARIAL EXAMPLES
    • Defending LLMs against Jailbreaking Attacks via Backtranslation
    • EMULATED DISALIGNMENT: SAFETY ALIGNMENT FOR LARGE LANGUAGE MODELS MAY BACKFIRE!
    • GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
    • ON THE SAFETY OF OPEN-SOURCED LARGE LAN GUAGE MODELS: DOES ALIGNMENT REALLY PREVENT THEM FROM BEING
    • Unveiling the Implicit Toxicity in Large Language Models
    • Forcing Generative Models to Degenerate Ones: The Power of Data Poisoning Attacks
    • Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs
    • Learning to Poison Large Language Models During Instruction Tuning
    • ALIGNMENT IS NOT SUFFICIENT TO PREVENT LARGE LANGUAGE MODELS FROM GENERATING HARMFUL IN FORMATION:
    • LANGUAGE MODEL UNALIGNMENT: PARAMETRIC RED-TEAMING TO EXPOSE HIDDEN HARMS AND BI ASES
    • Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections
    • IMMUNIZATION AGAINST HARMFUL FINE-TUNING AT TACKS
    • EMULATED DISALIGNMENT: SAFETY ALIGNMENT FOR LARGE LANGUAGE MODELS MAY BACKFIRE!
    • Composite Backdoor Attacks Against Large Language Models
    • AWolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easi
    • ALL IN HOW YOU ASK FOR IT: SIMPLE BLACK-BOX METHOD FOR JAILBREAK ATTACKS
    • LLMJailbreak Attack versus Defense Techniques- A Comprehensive Study
    • Weak-to-Strong Jailbreaking on Large Language Models
    • MULTIVERSE: Exposing Large Language Model Alignment Problems in Diverse Worlds
    • Universal and Transferable Adversarial Attacks on Aligned Language Models
    • COERCING LLMS TO DO AND REVEAL (ALMOST) ANYTHING
    • Generating Valid and Natural Adversarial Examples with Large Language Models
    • Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections
    • Scaling Laws for Adversarial Attacks on Language Model Activations
    • Ignore Previous Prompt: Attack Techniques For Language Models
    • ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages
    • A New Era in LLM Security: Exploring Security Concerns in Real-World LLM-based Systems
    • ATTACKING LARGE LANGUAGE MODELS WITH PROJECTED GRADIENT DESCENT
    • Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack
    • Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embed
    • Query-Based Adversarial Prompt Generation
    • COERCING LLMS TO DO AND REVEAL (ALMOST) ANYTHING
    • Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment
    • Fast Adversarial Attacks on Language Models In One GPU Minute
    • DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers
    • From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Transla
    • Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
    • CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models
    • Large Language Models are Vulnerable to Bait-and-Switch Attacks for Generating Harmful Content
    • Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment
    • PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails
    • A Cross-Language Investigation into Jailbreak Attacks in Large Language Models
    • LoRA-as-an-Attack! Piercing LLM Safety Under The Share-and-Play Scenario
    • Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data
    • SHORTCUTS ARISING FROM CONTRAST: EFFECTIVE AND COVERT CLEAN-LABEL ATTACKS IN PROMPT-BASED LEARNING
    • What’s in Your “Safe” Data?: Identifying Benign Data that Breaks Safety
    • DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers
    • Attacking LLM Watermarks by Exploiting Their Strengths
    • From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Transla
    • DeepInception: Hypnotize Large Language Model to Be Jailbreaker
    • Hijacking Large Language Models via Adversarial In-Context Learning
    • EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models
    • LinkPrompt: Natural and Universal Adversarial Attacks on Prompt-based Language Models
    • DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions
    • Syntactic Ghost: An Imperceptible General-purpose Backdoor Attacks on Pre-trained Language Models
    • Conversation Reconstruction Attack Against GPT Models
    • Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks
    • PoisonedRAG: Knowledge Poisoning Attacks to Retrieval-Augmented Generation of Large Language Models
    • COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
    • Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit
    • Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks
    • Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak
    • UNIVERSAL JAILBREAK BACKDOORS FROM POISONED HUMAN FEEDBACK
    • Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking
    • Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation
    • POISONPROMPT: BACKDOOR ATTACK ON PROMPT-BASED LARGE LANGUAGE MODELS
    • BACKDOORING INSTRUCTION-TUNED LARGE LANGUAGE MODELS WITH VIRTUAL PROMPT INJECTION
    • Backdoor Attacks for In-Context Learning with Language Models
    • Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models
    • UOR: Universal Backdoor Attacks on Pre-trained Language Models
    • Fake Alignment: Are LLMs Really Aligned Well?
    • Syntactic Ghost: An Imperceptible General-purpose Backdoor Attacks on Pre-trained Language Models
    • Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignm
    • Imperio: Language-Guided Backdoor Attacks for Arbitrary Model Control
    • Large Language Models Are Better Adversaries: Exploring Generative Clean-Label Backdoor Attacks Agai
    • Universal Vulnerabilities in Large Language Models: Backdoor Attacks for In-context Learning
    • BADCHAIN: BACKDOOR CHAIN-OF-THOUGHT PROMPTING FOR LARGE LANGUAGE MODELS
    • AUTODAN: INTERPRETABLE GRADIENT-BASED ADVERSARIAL ATTACKS ON LARGE LANGUAGE MODELS
    • AN LLM CAN FOOL ITSELF: A PROMPT-BASED ADVERSARIAL ATTACK
    • AUTOMATIC HALLUCINATION ASSESSMENT FOR ALIGNED LARGE LANGUAGE MODELS VIA TRANSFERABLE ADVERSARIAL AT
    • LLM LIES: HALLUCINATIONS ARE NOT BUGS, BUT FEATURES AS ADVERSARIAL EXAMPLES
    • LOFT: LOCAL PROXY FINE-TUNING FOR IMPROVING TRANSFERABILITY OF ADVERSARIAL ATTACKS AGAINST LARGE LAN
    • Universal and Transferable Adversarial Attacks on Aligned Language Models
    • Robustness Over Time: Understanding Adversarial Examples’ Effectiveness on Longitudinal Versions of
    • BASELINE DEFENSES FOR ADVERSARIAL ATTACKS AGAINST ALIGNED LANGUAGE MODELS
    • Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue
    • Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Recon
    • Adversarial Demonstration Attacks on Large Language Models
    • COVER: A Heuristic Greedy Adversarial Attack on Prompt-based Learning in Language Models
    • The Butterfly Effect of Altering Prompts: How Small Changes and Jailbreaks Affect Large Language Mod
    • Open the Pandora’s Box of LLMs: Jailbreaking LLMs through Representation Engineering
    • How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Huma
    • Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language Models
    • PANDORA: Jailbreak GPTs by Retrieval Augmented Generation Poisoning
    • Weak-to-Strong Jailbreaking on Large Language Models
    • Jailbreaking Proprietary Large Language Models using Word Substitution Cipher
    • Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs
    • Adversarial Attacks and Defenses in Large Language Models: Old and New Threats
    • Jailbroken: How Does LLM Safety Training Fail?
    • ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
    • GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large
    • Tastle: Distract Large Language Models for Automatic Jailbreak Attack
    • Exploring Safety Generalization Challenges of Large Language Models via Code
    • Learning to Poison Large Language Models During Instruction Tuning
    • BADEDIT: BACKDOORING LARGE LANGUAGE MODELS BY MODEL EDITING
    • Composite Backdoor Attacks Against Large Language Models
    • LoRA-as-an-Attack! Piercing LLM Safety Under The Share-and-Play Scenario
    • Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections
    • ALL IN HOW YOU ASK FOR IT: SIMPLE BLACK-BOX METHOD FOR JAILBREAK ATTACKS
    • THE POISON OF ALIGNMENT
    • The Philosopher’s Stone: Trojaning Plugins of Large Language Models
    • RAPID OPTIMIZATION FOR JAILBREAKING LLMS VIA SUBCONSCIOUS EXPLOITATION AND ECHOPRAXIA
    • Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks
    • RED TEAMING GPT-4V: ARE GPT-4V SAFE AGAINST UNI/MULTI-MODAL JAILBREAK ATTACKS ?
    • PAL: Proxy-Guided Black-Box Attack on Large Language Models
    • INCREASED LLM VULNERABILITIES FROM FINETUNING AND QUANTIZATION
    • Rethinking How to Evaluate Language Model Jailbreak
    • COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
    • GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation
    • Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation
    • Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs
    • AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs
    • Universal Adversarial Triggers Are Not Universal
    • PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails
  • LLM-Defense
    • LANGUAGE MODELS ARE HOMER SIMPSON!
    • garak : A Framework for Security Probing Large Language Models
    • Defending Against Weight-Poisoning Backdoor Attacks for Parameter-Efficient Fine-Tuning
    • Trojan Detection in Large Language Models: Insights from The Trojan Detection Challenge
    • PromptFix: Few-shot Backdoor Removal via Adversarial Prompt Tuning
    • The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
    • BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards
    • Cross-Task Defense: Instruction-Tuning LLMs for Content Safety
    • Efficient Adversarial Training in LLMs with Continuous Attacks
    • StruQ: Defending Against Prompt Injection with Structured Queries
    • Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning
    • GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis
    • Defending Jailbreak Prompts via In-Context Adversarial Game
    • Bergeron: Combating Adversarial Attacks through a Conscience-Based Alignment Framework
    • Jailbreaker in Jail: Moving Target Defense for Large Language Models
    • DEFENDING AGAINST ALIGNMENT-BREAKING ATTACKS VIA ROBUSTLY ALIGNED LLM
    • Causality Analysis for Evaluating the Security of Large Language Models
    • AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks
    • Jailbreaking is Best Solved by Definition
    • RIGORLLM: RESILIENT GUARDRAILS FOR LARGE LANGUAGE MODELS AGAINST UNDESIRED CONTENT
    • LANGUAGE MODELS ARE HOMER SIMPSON! Safety Re-Alignment of Fine-tuned Language Models through Task Ar
    • Defending Against Indirect Prompt Injection Attacks With Spotlighting
    • LLMGuard: Guarding against Unsafe LLM Behavior
    • Test-time Backdoor Mitigation for Black-Box Large Language Models with Defensive Demonstrations
    • ON TROJAN SIGNATURES IN LARGE LANGUAGE MODELS OF CODE
    • Acquiring Clean Language Models from Backdoor Poisoned Datasets by Downscaling Frequency Space
    • Detoxifying Large Language Models via Knowledge Editing
    • MART: Improving LLM Safety with Multi-round Automatic Red-Teaming
    • THE POISON OF ALIGNMENT
    • ROSE: Robust Selective Fine-tuning for Pre-trained Language Models
    • GAINING WISDOM FROM SETBACKS : ALIGNING LARGE LANGUAGE MODELS VIA MISTAKE ANALYSIS
    • Making Harmful Behaviors Unlearnable for Large Language Models
    • Fake Alignment: Are LLMs Really Aligned Well?
    • Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
    • Vaccine: Perturbation-aware Alignment for Large Language Model
    • DEFENDING LARGE LANGUAGE MODELS AGAINST JAILBREAK ATTACKS VIA SEMANTIC SMOOTHING
    • Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement
    • DEFENDING AGAINST ALIGNMENT-BREAKING AT TACKS VIA ROBUSTLY ALIGNED LLM
    • LLMSelf Defense: By Self Examination, LLMsKnowTheyAreBeing Tricked
    • BASELINE DEFENSES FOR ADVERSARIAL ATTACKS AGAINST ALIGNED LANGUAGE MODELS
    • Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment
    • LLMsCanDefend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper
    • Detoxifying Text with MARCO: Controllable Revision with Experts and Anti-Experts
    • Self-Destructing Models: Increasing the Costs of Harmful Dual Uses of Foundation Models
    • Whispers that Shake Foundations: Analyzing and Mitigating False Premise Hallucinations in Large Lang
    • CAMOUFLAGE IS ALL YOU NEED: EVALUATING AND ENHANCING LANGUAGE MODEL ROBUSTNESS AGAINST CAMOUFLAGE AD
    • Defending Jailbreak Prompts via In-Context Adversarial Game
    • Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement
    • Defending LLMs against Jailbreaking Attacks via Backtranslation
    • IMMUNIZATION AGAINST HARMFUL FINE-TUNING ATTACKS
    • Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield
    • JAB: Joint Adversarial Prompting and Belief Augmentation
    • TOKEN-LEVEL ADVERSARIAL PROMPT DETECTION BASED ON PERPLEXITY MEASURES AND CONTEXTUAL INFORMATION
    • Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
    • Studious Bob Fight Back Against Jailbreaking via Prompt Adversarial Tuning
    • Vaccine: Perturbation-aware Alignment for Large Language Model
    • Improving the Robustness of Large Language Models via Consistency Alignment
    • SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
    • Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization
    • Defending Pre-trained Language Models as Few-shot Learners against Backdoor Attacks
    • LMSanitator: Defending Prompt-Tuning Against Task-Agnostic Backdoors
    • Diffusion Theory as a Scalpel: Detecting and Purifying Poisonous Dimensions in Pre-trained Language
    • Analyzing And Editing Inner Mechanisms of Backdoored Language Models
    • Setting the Trap: Capturing and Defeating Backdoors in Pretrained Language Models through Honeypots
    • ROBUSTIFYING LANGUAGE MODELS WITH TESTTIME ADAPTATION
    • Jailbreaker in Jail: Moving Target Defense for Large Language
    • DETECTING LANGUAGE MODEL ATTACKS WITH PERPLEXITY
    • Adversarial Fine-Tuning of Language Models: An Iterative Optimisation Approach for the Generation an
    • From Adversarial Arms Race to Model-centric Evaluation Motivating a Unified Automatic Robustness Eva
    • LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked
    • Intention Analysis Makes LLMs A Good Jailbreak Defender
    • Defending Against Disinformation Attacks in Open-Domain Question Answering
    • Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning
    • Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landsc
    • Round Trip Translation Defence against Large Language Model Jailbreaking Attacks
    • How Should Pre-Trained Language Models Be Fine-Tuned Towards Adversarial Robustness?
    • SELF-GUARD: Empower the LLM to Safeguard Itself
    • Intention Analysis Makes LLMs A Good Jailbreak Defender
    • Jatmo: Prompt Injection Defense by Task-Specific Finetuning
    • Precisely the Point: Adversarial Augmentations for Faithful and Informative Text Generation
    • Adversarial Text Purification: A Large Language Model Approach for Defense
    • Studious Bob Fight Back Against Jailbreaking via Prompt Adversarial Tuning
    • Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
    • Defending Against Weight-Poisoning Backdoor Attacks for Parameter-Efficient Fine-Tuning
    • Signed-Prompt: A New Approach to Prevent Prompt Injection Attacks Against LLM-Integrated Application
    • Is the System Message Really Important to Jailbreaks in Large Language Models?
    • AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts
    • Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge
Powered by GitBook
On this page
  1. Others

More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness

PreviousAn LLM-Assisted Easy-to-Trigger Backdoor Attack on Code Completion Models: Injecting Disguised VulneNextAI SAFETY: A CLIMB TO ARMAGEDDON?

Last updated 10 months ago

阅读总结报告

1. 研究背景

随着大型语言模型(LLMs)的迅猛发展,它们在众多认知任务上展现出前所未有的性能。然而,为了安全地利用这些强大模型的力量,需要确保它们的输出与人类价值观紧密对齐。尽管像基于人类反馈的强化学习(RLHF)这样的偏好学习算法在对齐人类偏好方面表现出了有效性,但它们对模型可信度的假设性改进尚未得到充分证实。

2. 过去方案和缺点

现有研究中存在一个明显的差距,即我们对于流行的偏好学习框架如何影响它们微调模型的可信度了解不足。RLHF及其变体旨在增强模型与人类偏好的对齐,但它们的影响范围尚未被充分探索。

3. 本文方案和步骤

本研究旨在填补这一知识空白,通过检查不同RLHF变体,如监督微调(SFT)、近端策略优化(PPO)和直接偏好优化(DPO)对各种语言模型可信度基准的影响。研究涉及多个关键方面,包括模型在毒性、刻板印象偏见、机器伦理、真实性和隐私方面的表现。

4. 本文创新点与贡献

  • 本研究提供了实证结果,展示了偏好学习算法对LLMs的影响。

  • 分析了这些结果,开始理解这些对齐方法的复杂动态及其对可信AI发展的影响。

  • 通过研究,希望引导社区朝着开发既具备能力又可信的LLMs方向发展。

5. 本文实验

  • 使用开源Pythia Suite作为目标模型,尺寸从7000万到69亿参数不等。

  • 采用Anthropic HH数据集作为人类偏好数据集,用于模型对齐。

  • 使用三种流行的RLHF变体进行微调,并在五个可信度维度上评估模型。

6. 实验结论

  • RLHF通过偏好数据、对齐算法和特定可信度方面之间的复杂相互作用,其对可信度的改进远非保证,有时甚至可能产生相反的效果。

  • 在模型尺寸扩大时,只有DPO带来了轻微的减毒效果,而PPO和SFT导致毒性增加。

  • 所有三种对齐方法都显著增加了刻板印象偏见,并且在模型输出中减少了真实性。

7. 全文结论

本研究提供了对人类偏好对齐技术,特别是三种基于人类反馈的强化学习变体对语言模型在五个关键垂直领域的可信度影响的全面分析。研究结果揭示了这些方面和所采用的对齐方法之间的复杂相互作用,强调了在开发和部署语言模型时考虑可信度的多面性的重要性。

阅读总结

这项研究深入探讨了RLHF及其变体对语言模型在多个关键可信度方面的影响,发现对齐过程并不总是能提升模型的可信度,有时甚至可能带来负面影响。研究结果表明,对齐方法的影响取决于所使用的偏好数据集和特定算法,这些关系很难泛化,本质上是复杂的。通过揭示这些对齐方法的内在动态,本研究希望为未来开发更可信的语言模型提供指导。