大模型安全笔记
  • 前言
  • MM-LLM
    • MM-LLMs: Recent Advances in MultiModal Large Language Models
    • Multimodal datasets: misogyny, pornography, and malignant stereotypes
    • Sight Beyond Text: Multi-Modal Training Enhances LLMsinTruthfulness and Ethics
    • FOUNDATION MODELS AND FAIR USE
  • VLM-Defense
    • Removing NSFW Concepts from Vision-and-Language Models for Text-to-Image Retrieval and Generation
    • Safety Alignment for Vision Language Models
    • AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Sh
    • Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation
    • MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance
    • Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
    • SAFEGEN: Mitigating Unsafe Content Generation in Text-to-Image Models
    • Moderating Illicit Online Image Promotion for Unsafe User-Generated Content Games Using Large Vision
    • Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding
    • Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation
    • UNDERSTANDING ZERO-SHOT ADVERSARIAL ROBUSTNESS FOR LARGE-SCALE MODELS
    • A Mutation-Based Method for Multi-Modal Jailbreaking Attack Detection
    • UNDERSTANDING ZERO-SHOT ADVERSARIAL ROBUSTNESS FOR LARGE-SCALE MODELS
    • AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Sh
    • CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning
    • Image Safeguarding: Reasoning with Conditional Vision Language Model and Obfuscating Unsafe Content
    • Towards Safe Self-Distillation of Internet-Scale Text-to-Image Diffusion Models
    • Typographic Attacks in Large Multimodal Models Can be Alleviated by More Informative Prompts
    • Onthe Robustness of Large Multimodal Models Against Image Adversarial Attacks
    • Removing NSFW Concepts from Vision-and-Language Models for Text-to-Image Retrieval and Generation
    • Safety Fine-Tuning at (Almost) No Cost: ABaseline for Vision Large Language Models
    • Partially Recentralization Softmax Loss for Vision-Language Models Robustness
    • Adversarial Prompt Tuning for Vision-Language Models
    • Defense-Prefix for Preventing Typographic Attacks on CLIP
    • RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedb
    • AMutation-Based Method for Multi-Modal Jailbreaking Attack
    • HowEasy is It to Fool Your Multimodal LLMs? AnEmpirical Analysis on Deceptive Prompts
    • MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance
    • EFUF: Efficient Fine-grained Unlearning Framework for Mitigating Hallucinations in Multimodal Large
    • Aligning Modalities in Vision Large Language Models via Preference Fine-tuning
    • Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Lang
    • Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoisi
    • Robust Contrastive Language-Image Pre-training against Data Poisoning and Backdoor Attacks
    • HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data
  • VLM
    • Scalable Performance Analysis for Vision-Language Models
  • VLM-Attack
    • Circumventing Concept Erasure Methods For Text-to-Image Generative Models
    • Efficient LLM-Jailbreaking by Introducing Visual Modality
    • From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking
    • Adversarial Attacks on Multimodal Agents
    • Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Ima
    • Cross-Modality Jailbreak and Mismatched Attacks on Medical Multimodal Large Language Models
    • Typography Leads Semantic Diversifying: Amplifying Adversarial Transferability across Multimodal Lar
    • White-box Multimodal Jailbreaks Against Large Vision-Language Models
    • Red Teaming Visual Language Models
    • Private Attribute Inference from Images with Vision-Language Models
    • Assessment of Multimodal Large Language Models in Alignment with Human Values
    • Privacy-Aware Visual Language Models
    • Learning To See But Forgetting To Follow: Visual Instruction Tuning Makes LLMs More Prone To Jailbre
    • Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks
    • Red Teaming Visual Language Models
    • Adversarial Illusions in Multi-Modal Embeddings
    • Universal Prompt Optimizer for Safe Text-to-Image Generation
    • On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts
    • Adversarial Illusions in Multi-Modal Embeddings
    • Stop Reasoning! When Multimodal LLMs with Chain-of-Thought Reasoning Meets Adversarial Images
    • INSTRUCTTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models
    • On the Robustness of Large Multimodal Models Against Image Adversarial Attacks
    • Hijacking Context in Large Multi-modal Models
    • Transferable Multimodal Attack on Vision-Language Pre-training Models
    • Images are Achilles’ Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimoda
    • AN IMAGE IS WORTH 1000 LIES: ADVERSARIAL TRANSFERABILITY ACROSS PROMPTS ON VISIONLANGUAGE MODELS
    • Test-Time Backdoor Attacks on Multimodal Large Language Models
    • JAILBREAK IN PIECES: COMPOSITIONAL ADVERSARIAL ATTACKS ON MULTI-MODAL LANGUAGE MODELS
    • Jailbreaking Attack against Multimodal Large Language Model
    • Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts
    • IMAGE HIJACKS: ADVERSARIAL IMAGES CAN CONTROL GENERATIVE MODELS AT RUNTIME
    • VISUAL ADVERSARIAL EXAMPLES JAILBREAK ALIGNED LARGE LANGUAGE MODELS
    • Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks
    • Query-Relevant Images Jailbreak Large Multi-Modal Models
    • Towards Adversarial Attack on Vision-Language Pre-training Models
    • HowMany Are Unicorns in This Image? ASafety Evaluation Benchmark for Vision LLMs
    • SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Au
    • MISUSING TOOLS IN LARGE LANGUAGE MODELS WITH VISUAL ADVERSARIAL EXAMPLES
    • VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models
    • INSTRUCTTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models
    • Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Mod
    • Shadowcast: STEALTHY DATA POISONING ATTACKS AGAINST VISION-LANGUAGE MODELS
    • FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts
    • THE WOLF WITHIN: COVERT INJECTION OF MALICE INTO MLLM SOCIETIES VIA AN MLLM OPERATIVE
    • Stop Reasoning! When Multimodal LLMs with Chain-of-Thought Reasoning Meets Adversarial Images
    • Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast
    • How Robust is Google’s Bard to Adversarial Image Attacks?
    • OnEvaluating Adversarial Robustness of Large Vision-Language Models
    • Onthe Adversarial Robustness of Multi-Modal Foundation Models
    • Are aligned neural networks adversarially aligned?
    • READING ISN’T BELIEVING: ADVERSARIAL ATTACKS ON MULTI-MODAL NEURONS
    • Black Box Adversarial Prompting for Foundation Models
    • Evaluation and Analysis of Hallucination in Large Vision-Language Models
    • FOOL YOUR (VISION AND) LANGUAGE MODEL WITH EMBARRASSINGLY SIMPLE PERMUTATIONS
    • VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models
    • Transferable Multimodal Attack on Vision-Language Pre-training Models
    • BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning
    • AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal Contrastive Learning
  • T2I-Attack
    • On Copyright Risks of Text-to-Image Diffusion Models
    • ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users
    • On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts
    • Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts
    • SneakyPrompt: Jailbreaking Text-to-image Generative Models
    • The Stronger the Diffusion Model, the Easier the Backdoor: Data Poisoning to Induce Copyright Breach
    • Discovering Universal Semantic Triggers for Text-to-Image Synthesis
    • Automatic Jailbreaking of the Text-to-Image Generative AI Systems
  • Survey
    • Generative AI Security: Challenges and Countermeasures
    • Beyond Boundaries: A Comprehensive Survey of Transferable Attacks on AI Systems
    • Current state of LLM Risks and AI Guardrails
    • Security of AI Agents
    • Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents
    • Exploring Vulnerabilities and Protections in Large Language Models: A Survey
    • Unveiling Hallucination in Text, Image, Video, and Audio Foundation Models: A Comprehensive Survey
    • Unbridled Icarus: A Survey of the Potential Perils of Image Inputs in Multimodal Large Language Mode
    • SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Mode
    • Safety of Multimodal Large Language Models on Images and Text
    • LLM Jailbreak Attack versus Defense Techniques - A Comprehensive Study
    • Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks
    • ASurvey on Safe Multi-Modal Learning System
    • TRUSTWORTHY LARGE MODELS IN VISION: A SURVEY
    • A Pathway Towards Responsible AI Generated Content
    • A Survey of Hallucination in “Large” Foundation Models
    • An Early Categorization of Prompt Injection Attacks on Large Language Models
    • Comprehensive Assessment of Jailbreak Attacks Against LLMs
    • A Comprehensive Overview of Backdoor Attacks in Large Language Models within Communication Networks
    • Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks
    • Adversarial Machine Learning for Social Good: Reframing the Adversary as an Ally
    • Red-Teaming for Generative AI: Silver Bullet or Security Theater?
    • A STRONGREJECT for Empty Jailbreaks
  • LVM-Attack
    • Adversarial Attacks on Foundational Vision Models
  • For Good
    • Image Safeguarding: Reasoning with Conditional Vision Language Model and Obfuscating Unsafe Content
  • Benchmark
    • HALLUSIONBENCH: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusi
    • OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety
    • ToViLaG: Your Visual-Language Generative Model is Also An Evildoer
    • HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
    • S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language
    • UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images
    • JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against
    • JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
    • Constructing Benchmarks and Interventions for Combating Hallucinations in LLMs
    • ALERT: A Comprehensive Benchmark for Assessing Large Language Models’ Safety through Red Teaming
    • Benchmarking Llama2, Mistral, Gemma and GPT for Factuality, Toxicity, Bias and Propensity for Halluc
    • INJECAGENT: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents
    • AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Adversarial Visual-Ins
    • HALLUSIONBENCH: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusio
    • ALL LANGUAGES MATTER: ON THE MULTILINGUAL SAFETY OF LARGE LANGUAGE MODELS
    • Why Should Adversarial Perturbations be Imperceptible? Rethink the Research Paradigm in Adversarial
    • Red Teaming Visual Language Models
    • Unified Hallucination Detection for Multimodal Large Language Models
    • MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark
    • Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
    • CAN LANGUAGE MODELS BE INSTRUCTED TO PROTECT PERSONAL INFORMATION?
    • Detecting and Preventing Hallucinations in Large Vision Language Models
    • DRESS : Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Lang
    • ToViLaG: Your Visual-Language Generative Model is Also An Evildoer
    • SC-Safety: A Multi-round Open-ended Question Adversarial Safety Benchmark for Large Language Models
    • PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts
    • Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs
  • Explainality
    • Visual Explanations of Image-Text Representations via Multi-Modal Information Bottleneck Attributio
  • Privacy-Defense
    • Defending Our Privacy With Backdoors
    • PromptCARE: Prompt Copyright Protection by Watermark Injection and Verification
  • Privacy-Attack
    • PANDORA’S WHITE-BOX: INCREASED TRAINING DATA LEAKAGE IN OPEN LLMS
    • Untitled
    • Membership Inference Attacks against Large Language Models via Self-prompt Calibration
    • LANGUAGE MODEL INVERSION
    • Effective Prompt Extraction from Language Models
    • Prompt Stealing Attacks Against Large Language Models
    • Stealing Part of a Production Language Model
    • Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Cali
    • Prompt Stealing Attacks Against Large Language Models
    • PRSA: Prompt Reverse Stealing Attacks against Large Language Models
    • Low-Resource Languages Jailbreak GPT-4
    • Scalable Extraction of Training Data from (Production) Language Models
  • Others
    • INFERRING OFFENSIVENESS IN IMAGES FROM NATURAL LANGUAGE SUPERVISION
    • An LLM-Assisted Easy-to-Trigger Backdoor Attack on Code Completion Models: Injecting Disguised Vulne
    • More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness
    • AI SAFETY: A CLIMB TO ARMAGEDDON?
    • AI RISK MANAGEMENT SHOULD INCORPORATE BOTH SAFETY AND SECURITY
    • Defending Against Social Engineering Attacks in the Age of LLMs
    • Adversarial Perturbations Cannot Reliably Protect Artists From Generative AI
    • Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
    • Deduplicating Training Data Makes Language Models Better
    • MITIGATING TEXT TOXICITY WITH COUNTERFACTUAL GENERATION
    • The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?
    • Mitigating Hallucinations in Large Language Models via Self-Refinement-Enhanced Knowledge Retrieval
    • Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
    • Mitigating LLM Hallucinations via Conformal Abstention
    • Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback
    • Can ChatGPT Detect DeepFakes? A Study of Using Multimodal Large Language Models for Media Forensics
    • An Analysis of Recent Advances in Deepfake Image Detection in an Evolving Threat Landscape
    • Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding
    • LARGE LANGUAGE MODELS AS AUTOMATED ALIGNERS FOR BENCHMARKING VISION-LANGUAGE MODELS
    • PoLLMgraph: Unraveling Hallucinations in Large Language Models via State Transition Dynamics
    • Reducing hallucination in structured outputs via Retrieval-Augmented Generation
    • Moderating Illicit Online Image Promotion for Unsafe User-Generated Content Games Using Large Vision
    • Attacking LLM Watermarks by Exploiting Their Strengths
    • The Butterfly Effect of Altering Prompts: How Small Changes and Jailbreaks Affect Large Language Mod
    • TOFU: A Task of Fictitious Unlearning for LLMs
    • Learning and Forgetting Unsafe Examples in Large Language Models
    • Exploring Adversarial Attacks against Latent Diffusion Model from the Perspective of Adversarial Tra
    • TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space
    • In Search of Truth: An Interrogation Approach to Hallucination Detection
    • Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification
    • Unsupervised Real-Time Hallucination Detection based on the Internal States of Large Language Models
    • Locating and Mitigating Gender Bias in Large Language Models
    • Learning to Edit: Aligning LLMs with Knowledge Editing
    • Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding
    • Quantitative Analysis of AI-Generated Texts in Academic Research: A Study of AI Presence in Arxiv Su
    • Does DETECTGPT Fully Utilize Perturbation? Bridge Selective Perturbation to Fine-tuned Contrastive L
    • TELLER: A Trustworthy Framework for Explainable, Generalizable and Controllable Fake News Detection
    • SPOTTING LLMS WITH BINOCULARS: ZERO-SHOT DETECTION OF MACHINE-GENERATED TEXT
    • LLM-as-a-Coauthor: The Challenges of Detecting LLM-Human Mixcase
    • WHAT’S IN MY BIG DATA?
    • UNDERSTANDING CATASTROPHIC FORGETTING IN LANGUAGE MODELS VIA IMPLICIT INFERENCE
    • Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models
    • Toxicity in CHATGPT: Analyzing Persona-assigned Language Models
    • MemeCraft: Contextual and Stance-Driven Multimodal Meme Generation
    • Moderating Illicit Online Image Promotion for Unsafe User-Generated Content Games Using Large Vision
    • Moderating New Waves of Online Hate with Chain-of-Thought Reasoning in Large Language Models
    • Poisoned ChatGPT Finds Work for Idle Hands: Exploring Developers’ Coding Practices with Insecure Sug
    • Zero shot VLMs for hate meme detection: Are we there yet?
    • ANALYZING AND MITIGATING OBJECT HALLUCINATION IN LARGE VISION-LANGUAGE MODELS
    • MITIGATING HALLUCINATION IN LARGE MULTIMODAL MODELS VIA ROBUST INSTRUCTION TUNING
    • DENEVIL: TOWARDS DECIPHERING AND NAVIGATING THE ETHICAL VALUES OF LARGE LANGUAGE MODELS VIA INSTRUCT
    • Disentangling Perceptions of Offensiveness: Cultural and Moral Correlates
    • Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity
    • LARGE LANGUAGE MODELS AS AUTOMATED ALIGNERS FOR BENCHMARKING VISION-LANGUAGE MODELS
    • Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Pro
    • InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance
    • CAN LANGUAGE MODELS BE INSTRUCTED TO PROTECT PERSONAL INFORMATION?
    • AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications
    • Prompt Injection Attacks and Defenses in LLM-Integrated Applications
    • Removing RLHF Protections in GPT-4 via Fine-Tuning
    • SPML: A DSL for Defending Language Models Against Prompt Attacks
    • Stealthy Attack on Large Language Model based Recommendation
    • Large Language Models Sometimes Generate Purely Negatively-Reinforced Text
    • On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective
    • Learning from data in the mixed adversarial non-adversarial case: Finding the helpers and ignoring t
    • longhorns at DADC 2022: How many linguists does it take to fool a Question Answering model? A system
    • A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning
    • Adversarial Examples Generation for Reducing Implicit Gender Bias in Pre-trained Models
    • Discovering the Hidden Vocabulary of DALLE-2
    • Raising the Cost of Malicious AI-Powered Image Editing
    • Negating Negatives: Alignment without Human Positive Samples via Distributional Dispreference Optimi
    • ALIGNERS: DECOUPLING LLMS AND ALIGNMENT
    • CAN LLM-GENERATED MISINFORMATION BE DETECTED?
    • On the Risk of Misinformation Pollution with Large Language Models
    • Evading Watermark based Detection of AI-Generated Content
    • Mitigating Inappropriateness in Image Generation: Can there be Value in Reflecting the World’s Uglin
    • Privacy-Preserving Instructions for Aligning Large Language Models
    • TOWARDS UNDERSTANDING THE INTERPLAY OF GENERATIVE ARTIFICIAL INTELLIGENCE AND THE INTERNET
    • Evaluating the Social Impact of Generative AI Systems in Systems and Society
    • Transformation vs Tradition: Artificial General Intelligence (AGI) for Arts and Humanities
    • Attacking LLM Watermarks by Exploiting Their Strengths
    • TOWARDS RESPONSIBLE AI IN THE ERA OF GENERATIVE AI: A REFERENCE ARCHITECTURE FOR DESIGNING FOUNDATIO
    • RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
    • Intent-aligned AI systems deplete human agency: the need for agency foundations research in AI safet
    • Risk Assessment and Statistical Significance in the Age of Foundation Models
    • The Foundation Model Transparency Index
    • The Privacy Pillar - A Conceptual Framework for Foundation Model-based Systems
    • A Baseline Analysis of Reward Models’ Ability To Accurately Analyze Foundation Models Under Distribu
    • Foundational Moral Values for AI Alignment
    • Hazards from Increasingly Accessible Fine-Tuning of Downloadable Foundation Models
    • ON CATASTROPHIC INHERITANCE OF LARGE FOUNDATION MODELS
    • Foundation Model Sherpas: Guiding Foundation Models through Knowledge and Reasoning
    • Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustmen
    • Foundation Model Transparency Reports
    • SECURING RELIABILITY: A BRIEF OVERVIEW ON ENHANCING IN-CONTEXT LEARNING FOR FOUNDATION MODELS
    • EXPLORING THE ADVERSARIAL CAPABILITIES OF LARGE LANGUAGE MODELS
    • TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification
    • LLM-Resistant Math Word Problem Generation via Adversarial Attacks
    • Efficient Black-Box Adversarial Attacks on Neural Text Detectors
    • Adversarial Preference Optimization
    • Combating Adversarial Attacks with Multi-Agent Debate
    • How the Advent of Ubiquitous Large Language Models both Stymie and Turbocharge Dynamic Adversarial Q
    • L-AutoDA: Leveraging Large Language Models for Automated Decision-based Adversarial Attacks
    • Hidding the Ghostwriters: An Adversarial Evaluation of AI-Generated Student Essay Detection
    • What Does the Bot Say? Opportunities and Risks of Large Language Models in Social Media Bot Detectio
    • Prompted Contextual Vectors for Spear-Phishing Detection
    • Token-Ensemble Text Generation: On Attacking the Automatic AI-Generated Text Detection
    • Recursive Chain-of-Feedback Prevents Performance Degradation from Redundant Prompting
    • Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents
    • RADAR: Robust AI-Text Detection via Adversarial Learning
    • OUTFOX: LLM-Generated Essay Detection Through In-Context Learning with Adversarially Generated Examp
    • Why do universal adversarial attacks work on large language models?: Geometry might be the answer
    • J-Guard: Journalism Guided Adversarially Robust Detection of AI-generated News
    • Distilling Adversarial Prompts from Safety Benchmarks: Report for the Adversarial Nibbler Challenge
    • Detoxifying Large Language Models via Knowledge Editing
    • Healing Unsafe Dialogue Responses with Weak Supervision Signals
  • LLM-Attack
    • Hacc-Man: An Arcade Game for Jailbreaking LLMs
    • Forcing Generative Models to Degenerate Ones: The Power of Data Poisoning Attacks
    • DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions
    • Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak
    • Hijacking Large Language Models via Adversarial In-Context Learning
    • Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs
    • DALA: A Distribution-Aware LoRA-Based Adversarial Attack against Language Models
    • FRONTIER LANGUAGE MODELS ARE NOT ROBUST TO ADVERSARIAL ARITHMETIC, OR “WHAT DO I NEED TO SAY SO YOU
    • Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignm
    • Evil Geniuses: Delving into the Safety of LLM-based Agents
    • BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B
    • SHADOW ALIGNMENT: THE EASE OF SUBVERTING SAFELY-ALIGNED LANGUAGE MODELS
    • Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models
    • ChatGPT as an Attack Tool: Stealthy Textual Backdoor Attack via Blackbox Generative Model Trigger
    • Tastle: Distract Large Language Models for Automatic Jailbreak Attack
    • Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue
    • Learning to Poison Large Language Models During Instruction Tuning
    • TALK TOO MUCH: Poisoning Large Language Models under Token Limit
    • Don’t Say No: Jailbreaking LLM by Suppressing Refusal
    • Goal-guided Generative Prompt Injection Attack on Large Language Models
    • Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models
    • BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents
    • AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens
    • QROA: A Black-Box Query-Response Optimization Attack on LLMs
    • BadRAG: Identifying Vulnerabilities in Retrieval Augmented Generation of Large Language Models
    • Improved Generation of Adversarial Examples Against Safety-aligned LLMs
    • Exploring Backdoor Attacks against Large Language Model-based Decision Making
    • Jailbreak Paradox: The Achilles’ Heel of LLMs
    • Stealth edits for provably fixing or attacking large language models
    • Stealth edits for provably fixing or attacking large language models
    • IS POISONING A REAL THREAT TO LLM ALIGNMENT? MAYBE MORE SO THAN YOU THINK
    • IS POISONING A REAL THREAT TO LLM ALIGNMENT? MAYBE MORE SO THAN YOU THINK
    • Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack
    • “Not Aligned” is Not “Malicious”: Being Careful about Hallucinations of Large Language Models’ Jailb
    • Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis
    • Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models
    • Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs
    • StructuralSleight: Automated Jailbreak Attacks on Large Language Models Utilizing Uncommon Text-Enco
    • WHEN LLM MEETS DRL: ADVANCING JAILBREAKING EFFICIENCY VIA DRL-GUIDED SEARCH
    • Context Injection Attacks on Large Language Models
    • Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens
    • Phantom: General Trigger Attacks on Retrieval Augmented Language Generation
    • On Trojans in Refined Language Models
    • A Survey of Backdoor Attacks and Defenses on Large Language Models: Implications for Security Measur
    • How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States
    • JAILBREAKING AS A REWARD MISSPECIFICATION PROBLEM
    • ObscurePrompt: Jailbreaking Large Language Models via Obscure Inpu
    • ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates
    • Improved Techniques for Optimization-Based Jailbreaking on Large Language Models
    • Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses
    • Page 1
    • AutoBreach: Universal and Adaptive Jailbreaking with Efficient Wordplay-Guided Optimization
    • Chain of Attack: a Semantic-Driven Contextual Multi-Turn attacker for LLM
    • CAN LLMS DEEPLY DETECT COMPLEX MALICIOUS QUERIES? A FRAMEWORK FOR JAILBREAKING VIA OBFUSCATING INTEN
    • SHADOW ALIGNMENT: THE EASE OF SUBVERTING SAFELY-ALIGNED LANGUAGE MODELS
    • Hidden You Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Logic Chai
    • JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models
    • AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbre
    • SANDWICH ATTACK: MULTI-LANGUAGE MIXTURE ADAPTIVE ATTACK ON LLMS
    • Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
    • Using Hallucinations to Bypass RLHF Filters
    • TARGET: Template-Transferable Backdoor Attack Against Prompt-based NLP Models via GPT4
    • SHADOW ALIGNMENT: THE EASE OF SUBVERTING SAFELY-ALIGNED LANGUAGE MODELS
    • OPEN SESAME! UNIVERSAL BLACK BOX JAILBREAKING OF LARGE LANGUAGE MODELS
    • Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
    • Weak-to-Strong Jailbreaking on Large Language Models
    • Punctuation Matters! Stealthy Backdoor Attack for Language Models
    • BYPASSING THE SAFETY TRAINING OF OPEN-SOURCE LLMS WITH PRIMING ATTACKS
    • Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections
    • A Semantic, Syntactic, And Context-Aware Natural Language Adversarial Example Generator
    • Fast Adversarial Attacks on Language Models In One GPU Minute
    • Don’t Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models
    • Scaling Behavior of Machine Translation with Large Language Models under Prompt Injection Attacks
    • Automatic and Universal Prompt Injection Attacks against Large Language Models
    • Automatic and Universal Prompt Injection Attacks against Large Language Models
    • Prompt Injection Attacks and Defenses in LLM-Integrated Applications
    • TENSOR TRUST: INTERPRETABLE PROMPT INJECTION ATTACKS FROM AN ONLINE GAME
    • DPP-Based Adversarial Prompt Searching for Lanugage Models
    • Large Language Models Are Better Adversaries: Exploring Generative Clean-Label Backdoor Attacks Agai
    • Rapid Adoption, Hidden Risks: The Dual Impact of Large Language Model Customization
    • Using Hallucinations to Bypass RLHF Filters
    • Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks
    • Prompt Injection attack against LLM-integrated Applications
    • Scaling Behavior of Machine Translation with Large Language Models under Prompt Injection Attacks
    • FINE-TUNING ALIGNED LANGUAGE MODELS COMPROMISES SAFETY, EVEN WHEN USERS DO NOT INTEND TO!
    • CATASTROPHIC JAILBREAK OF OPEN-SOURCE LLMS VIA EXPLOITING GENERATION
    • EVALUATING THE SUSCEPTIBILITY OF PRE-TRAINED LANGUAGE MODELS VIA HANDCRAFTED ADVERSARIAL EXAMPLES
    • Defending LLMs against Jailbreaking Attacks via Backtranslation
    • EMULATED DISALIGNMENT: SAFETY ALIGNMENT FOR LARGE LANGUAGE MODELS MAY BACKFIRE!
    • GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
    • ON THE SAFETY OF OPEN-SOURCED LARGE LAN GUAGE MODELS: DOES ALIGNMENT REALLY PREVENT THEM FROM BEING
    • Unveiling the Implicit Toxicity in Large Language Models
    • Forcing Generative Models to Degenerate Ones: The Power of Data Poisoning Attacks
    • Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs
    • Learning to Poison Large Language Models During Instruction Tuning
    • ALIGNMENT IS NOT SUFFICIENT TO PREVENT LARGE LANGUAGE MODELS FROM GENERATING HARMFUL IN FORMATION:
    • LANGUAGE MODEL UNALIGNMENT: PARAMETRIC RED-TEAMING TO EXPOSE HIDDEN HARMS AND BI ASES
    • Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections
    • IMMUNIZATION AGAINST HARMFUL FINE-TUNING AT TACKS
    • EMULATED DISALIGNMENT: SAFETY ALIGNMENT FOR LARGE LANGUAGE MODELS MAY BACKFIRE!
    • Composite Backdoor Attacks Against Large Language Models
    • AWolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easi
    • ALL IN HOW YOU ASK FOR IT: SIMPLE BLACK-BOX METHOD FOR JAILBREAK ATTACKS
    • LLMJailbreak Attack versus Defense Techniques- A Comprehensive Study
    • Weak-to-Strong Jailbreaking on Large Language Models
    • MULTIVERSE: Exposing Large Language Model Alignment Problems in Diverse Worlds
    • Universal and Transferable Adversarial Attacks on Aligned Language Models
    • COERCING LLMS TO DO AND REVEAL (ALMOST) ANYTHING
    • Generating Valid and Natural Adversarial Examples with Large Language Models
    • Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections
    • Scaling Laws for Adversarial Attacks on Language Model Activations
    • Ignore Previous Prompt: Attack Techniques For Language Models
    • ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages
    • A New Era in LLM Security: Exploring Security Concerns in Real-World LLM-based Systems
    • ATTACKING LARGE LANGUAGE MODELS WITH PROJECTED GRADIENT DESCENT
    • Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack
    • Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embed
    • Query-Based Adversarial Prompt Generation
    • COERCING LLMS TO DO AND REVEAL (ALMOST) ANYTHING
    • Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment
    • Fast Adversarial Attacks on Language Models In One GPU Minute
    • DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers
    • From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Transla
    • Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
    • CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models
    • Large Language Models are Vulnerable to Bait-and-Switch Attacks for Generating Harmful Content
    • Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment
    • PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails
    • A Cross-Language Investigation into Jailbreak Attacks in Large Language Models
    • LoRA-as-an-Attack! Piercing LLM Safety Under The Share-and-Play Scenario
    • Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data
    • SHORTCUTS ARISING FROM CONTRAST: EFFECTIVE AND COVERT CLEAN-LABEL ATTACKS IN PROMPT-BASED LEARNING
    • What’s in Your “Safe” Data?: Identifying Benign Data that Breaks Safety
    • DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers
    • Attacking LLM Watermarks by Exploiting Their Strengths
    • From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Transla
    • DeepInception: Hypnotize Large Language Model to Be Jailbreaker
    • Hijacking Large Language Models via Adversarial In-Context Learning
    • EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models
    • LinkPrompt: Natural and Universal Adversarial Attacks on Prompt-based Language Models
    • DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions
    • Syntactic Ghost: An Imperceptible General-purpose Backdoor Attacks on Pre-trained Language Models
    • Conversation Reconstruction Attack Against GPT Models
    • Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks
    • PoisonedRAG: Knowledge Poisoning Attacks to Retrieval-Augmented Generation of Large Language Models
    • COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
    • Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit
    • Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks
    • Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak
    • UNIVERSAL JAILBREAK BACKDOORS FROM POISONED HUMAN FEEDBACK
    • Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking
    • Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation
    • POISONPROMPT: BACKDOOR ATTACK ON PROMPT-BASED LARGE LANGUAGE MODELS
    • BACKDOORING INSTRUCTION-TUNED LARGE LANGUAGE MODELS WITH VIRTUAL PROMPT INJECTION
    • Backdoor Attacks for In-Context Learning with Language Models
    • Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models
    • UOR: Universal Backdoor Attacks on Pre-trained Language Models
    • Fake Alignment: Are LLMs Really Aligned Well?
    • Syntactic Ghost: An Imperceptible General-purpose Backdoor Attacks on Pre-trained Language Models
    • Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignm
    • Imperio: Language-Guided Backdoor Attacks for Arbitrary Model Control
    • Large Language Models Are Better Adversaries: Exploring Generative Clean-Label Backdoor Attacks Agai
    • Universal Vulnerabilities in Large Language Models: Backdoor Attacks for In-context Learning
    • BADCHAIN: BACKDOOR CHAIN-OF-THOUGHT PROMPTING FOR LARGE LANGUAGE MODELS
    • AUTODAN: INTERPRETABLE GRADIENT-BASED ADVERSARIAL ATTACKS ON LARGE LANGUAGE MODELS
    • AN LLM CAN FOOL ITSELF: A PROMPT-BASED ADVERSARIAL ATTACK
    • AUTOMATIC HALLUCINATION ASSESSMENT FOR ALIGNED LARGE LANGUAGE MODELS VIA TRANSFERABLE ADVERSARIAL AT
    • LLM LIES: HALLUCINATIONS ARE NOT BUGS, BUT FEATURES AS ADVERSARIAL EXAMPLES
    • LOFT: LOCAL PROXY FINE-TUNING FOR IMPROVING TRANSFERABILITY OF ADVERSARIAL ATTACKS AGAINST LARGE LAN
    • Universal and Transferable Adversarial Attacks on Aligned Language Models
    • Robustness Over Time: Understanding Adversarial Examples’ Effectiveness on Longitudinal Versions of
    • BASELINE DEFENSES FOR ADVERSARIAL ATTACKS AGAINST ALIGNED LANGUAGE MODELS
    • Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue
    • Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Recon
    • Adversarial Demonstration Attacks on Large Language Models
    • COVER: A Heuristic Greedy Adversarial Attack on Prompt-based Learning in Language Models
    • The Butterfly Effect of Altering Prompts: How Small Changes and Jailbreaks Affect Large Language Mod
    • Open the Pandora’s Box of LLMs: Jailbreaking LLMs through Representation Engineering
    • How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Huma
    • Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language Models
    • PANDORA: Jailbreak GPTs by Retrieval Augmented Generation Poisoning
    • Weak-to-Strong Jailbreaking on Large Language Models
    • Jailbreaking Proprietary Large Language Models using Word Substitution Cipher
    • Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs
    • Adversarial Attacks and Defenses in Large Language Models: Old and New Threats
    • Jailbroken: How Does LLM Safety Training Fail?
    • ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
    • GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large
    • Tastle: Distract Large Language Models for Automatic Jailbreak Attack
    • Exploring Safety Generalization Challenges of Large Language Models via Code
    • Learning to Poison Large Language Models During Instruction Tuning
    • BADEDIT: BACKDOORING LARGE LANGUAGE MODELS BY MODEL EDITING
    • Composite Backdoor Attacks Against Large Language Models
    • LoRA-as-an-Attack! Piercing LLM Safety Under The Share-and-Play Scenario
    • Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections
    • ALL IN HOW YOU ASK FOR IT: SIMPLE BLACK-BOX METHOD FOR JAILBREAK ATTACKS
    • THE POISON OF ALIGNMENT
    • The Philosopher’s Stone: Trojaning Plugins of Large Language Models
    • RAPID OPTIMIZATION FOR JAILBREAKING LLMS VIA SUBCONSCIOUS EXPLOITATION AND ECHOPRAXIA
    • Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks
    • RED TEAMING GPT-4V: ARE GPT-4V SAFE AGAINST UNI/MULTI-MODAL JAILBREAK ATTACKS ?
    • PAL: Proxy-Guided Black-Box Attack on Large Language Models
    • INCREASED LLM VULNERABILITIES FROM FINETUNING AND QUANTIZATION
    • Rethinking How to Evaluate Language Model Jailbreak
    • COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
    • GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation
    • Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation
    • Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs
    • AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs
    • Universal Adversarial Triggers Are Not Universal
    • PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails
  • LLM-Defense
    • LANGUAGE MODELS ARE HOMER SIMPSON!
    • garak : A Framework for Security Probing Large Language Models
    • Defending Against Weight-Poisoning Backdoor Attacks for Parameter-Efficient Fine-Tuning
    • Trojan Detection in Large Language Models: Insights from The Trojan Detection Challenge
    • PromptFix: Few-shot Backdoor Removal via Adversarial Prompt Tuning
    • The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
    • BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards
    • Cross-Task Defense: Instruction-Tuning LLMs for Content Safety
    • Efficient Adversarial Training in LLMs with Continuous Attacks
    • StruQ: Defending Against Prompt Injection with Structured Queries
    • Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning
    • GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis
    • Defending Jailbreak Prompts via In-Context Adversarial Game
    • Bergeron: Combating Adversarial Attacks through a Conscience-Based Alignment Framework
    • Jailbreaker in Jail: Moving Target Defense for Large Language Models
    • DEFENDING AGAINST ALIGNMENT-BREAKING ATTACKS VIA ROBUSTLY ALIGNED LLM
    • Causality Analysis for Evaluating the Security of Large Language Models
    • AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks
    • Jailbreaking is Best Solved by Definition
    • RIGORLLM: RESILIENT GUARDRAILS FOR LARGE LANGUAGE MODELS AGAINST UNDESIRED CONTENT
    • LANGUAGE MODELS ARE HOMER SIMPSON! Safety Re-Alignment of Fine-tuned Language Models through Task Ar
    • Defending Against Indirect Prompt Injection Attacks With Spotlighting
    • LLMGuard: Guarding against Unsafe LLM Behavior
    • Test-time Backdoor Mitigation for Black-Box Large Language Models with Defensive Demonstrations
    • ON TROJAN SIGNATURES IN LARGE LANGUAGE MODELS OF CODE
    • Acquiring Clean Language Models from Backdoor Poisoned Datasets by Downscaling Frequency Space
    • Detoxifying Large Language Models via Knowledge Editing
    • MART: Improving LLM Safety with Multi-round Automatic Red-Teaming
    • THE POISON OF ALIGNMENT
    • ROSE: Robust Selective Fine-tuning for Pre-trained Language Models
    • GAINING WISDOM FROM SETBACKS : ALIGNING LARGE LANGUAGE MODELS VIA MISTAKE ANALYSIS
    • Making Harmful Behaviors Unlearnable for Large Language Models
    • Fake Alignment: Are LLMs Really Aligned Well?
    • Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
    • Vaccine: Perturbation-aware Alignment for Large Language Model
    • DEFENDING LARGE LANGUAGE MODELS AGAINST JAILBREAK ATTACKS VIA SEMANTIC SMOOTHING
    • Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement
    • DEFENDING AGAINST ALIGNMENT-BREAKING AT TACKS VIA ROBUSTLY ALIGNED LLM
    • LLMSelf Defense: By Self Examination, LLMsKnowTheyAreBeing Tricked
    • BASELINE DEFENSES FOR ADVERSARIAL ATTACKS AGAINST ALIGNED LANGUAGE MODELS
    • Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment
    • LLMsCanDefend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper
    • Detoxifying Text with MARCO: Controllable Revision with Experts and Anti-Experts
    • Self-Destructing Models: Increasing the Costs of Harmful Dual Uses of Foundation Models
    • Whispers that Shake Foundations: Analyzing and Mitigating False Premise Hallucinations in Large Lang
    • CAMOUFLAGE IS ALL YOU NEED: EVALUATING AND ENHANCING LANGUAGE MODEL ROBUSTNESS AGAINST CAMOUFLAGE AD
    • Defending Jailbreak Prompts via In-Context Adversarial Game
    • Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement
    • Defending LLMs against Jailbreaking Attacks via Backtranslation
    • IMMUNIZATION AGAINST HARMFUL FINE-TUNING ATTACKS
    • Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield
    • JAB: Joint Adversarial Prompting and Belief Augmentation
    • TOKEN-LEVEL ADVERSARIAL PROMPT DETECTION BASED ON PERPLEXITY MEASURES AND CONTEXTUAL INFORMATION
    • Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
    • Studious Bob Fight Back Against Jailbreaking via Prompt Adversarial Tuning
    • Vaccine: Perturbation-aware Alignment for Large Language Model
    • Improving the Robustness of Large Language Models via Consistency Alignment
    • SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
    • Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization
    • Defending Pre-trained Language Models as Few-shot Learners against Backdoor Attacks
    • LMSanitator: Defending Prompt-Tuning Against Task-Agnostic Backdoors
    • Diffusion Theory as a Scalpel: Detecting and Purifying Poisonous Dimensions in Pre-trained Language
    • Analyzing And Editing Inner Mechanisms of Backdoored Language Models
    • Setting the Trap: Capturing and Defeating Backdoors in Pretrained Language Models through Honeypots
    • ROBUSTIFYING LANGUAGE MODELS WITH TESTTIME ADAPTATION
    • Jailbreaker in Jail: Moving Target Defense for Large Language
    • DETECTING LANGUAGE MODEL ATTACKS WITH PERPLEXITY
    • Adversarial Fine-Tuning of Language Models: An Iterative Optimisation Approach for the Generation an
    • From Adversarial Arms Race to Model-centric Evaluation Motivating a Unified Automatic Robustness Eva
    • LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked
    • Intention Analysis Makes LLMs A Good Jailbreak Defender
    • Defending Against Disinformation Attacks in Open-Domain Question Answering
    • Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning
    • Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landsc
    • Round Trip Translation Defence against Large Language Model Jailbreaking Attacks
    • How Should Pre-Trained Language Models Be Fine-Tuned Towards Adversarial Robustness?
    • SELF-GUARD: Empower the LLM to Safeguard Itself
    • Intention Analysis Makes LLMs A Good Jailbreak Defender
    • Jatmo: Prompt Injection Defense by Task-Specific Finetuning
    • Precisely the Point: Adversarial Augmentations for Faithful and Informative Text Generation
    • Adversarial Text Purification: A Large Language Model Approach for Defense
    • Studious Bob Fight Back Against Jailbreaking via Prompt Adversarial Tuning
    • Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
    • Defending Against Weight-Poisoning Backdoor Attacks for Parameter-Efficient Fine-Tuning
    • Signed-Prompt: A New Approach to Prevent Prompt Injection Attacks Against LLM-Integrated Application
    • Is the System Message Really Important to Jailbreaks in Large Language Models?
    • AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts
    • Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge
Powered by GitBook
On this page
  1. Others

Foundational Moral Values for AI Alignment

PreviousA Baseline Analysis of Reward Models’ Ability To Accurately Analyze Foundation Models Under DistribuNextHazards from Increasingly Accessible Fine-Tuning of Downloadable Foundation Models

Last updated 1 year ago

  1. 研究背景 本文探讨了人工智能(AI)对齐问题,即如何确保AI系统按照既定的道德价值观和目标行动。当前,AI对齐的目标缺乏明确和哲学上稳固的结构,这导致了技术手段与社会目标的错位,可能对社会造成危害。作者提出,为了解决AI对齐问题,需要有清晰、可辩护的价值观作为指导。

  2. 过去方案和缺点 以往的AI对齐工作往往没有明确指出应该对齐的价值观是什么,或者这些价值观在不同情境下的具体含义。例如,LLMs(大型语言模型)强调的三个价值观:有用性、无害性和诚实性,但在不同条件下这些概念的具体含义并不明确。这表明,仅凭价值观本身可能不足以应对最抽象的基础情况和最具体、关键的实践情况。

  1. 本文方案和步骤 作者提出了五种基础道德价值观:生存、可持续的代际存在、社会、教育和真理。这些价值观不仅为技术对齐工作提供了更清晰的方向,而且作为一个框架,突出了AI系统在获取和维持这些价值观方面的威胁和机遇。作者还讨论了如何将这些基础价值观转化为中层原则和具体的行动指导原则,以便在AI模型的训练和开发过程中使用。

  2. 本文创新点与贡献 本文的创新之处在于提出了一个基于人类生存所需条件的价值观体系,这些价值观是不可约减的,相互依赖,并且对于人类的存在是必要的。作者还展示了这些价值观如何在哲学传统、联合国可持续发展目标(SDGs)以及当代科技和AI伦理实践中得到支持。

  3. 本文实验 本文没有进行实验,而是通过哲学论证和伦理分析来支持提出的价值观。作者通过逻辑上的归谬法来证明这些价值观的必要性,并探讨了它们在不同层面(个体、组织、国家和国际)的应用。

  4. 实验结论 由于本文没有进行实验,所以没有实验结论。但是,作者通过逻辑论证和跨学科的分析得出结论,这些基础道德价值观是解决AI对齐问题的关键。

  5. 全文结论 作者认为,为了有效地对齐AI,必须有一套基础价值观作为指导。这些价值观不仅对于人类的存在至关重要,而且对于构建有益于社会的AI系统也是必要的。尽管这些价值观在应用上存在局限性,但它们提供了一个有用的框架,用于指导AI的发展和部署。

阅读总结报告 本研究提出了一个基于人类生存基础的AI对齐价值观体系,包括生存、可持续的代际存在、社会、教育和真理。这些价值观不仅为AI技术的发展提供了道德指导,而且强调了在个体、组织、国家和国际层面上对齐AI的重要性。作者通过哲学论证和跨学科分析,展示了这些价值观的普遍性和必要性,并指出了在实际应用中可能遇到的挑战和机遇。尽管本文没有进行实验验证,但其提出的价值观体系对于指导AI的伦理发展具有重要的理论和实践意义。

注:

基于人类生存所需条件的价值观体系是本文提出的核心概念,旨在为人工智能(AI)对齐问题提供一个坚实的道德基础。这个体系包括五种基础道德价值观,它们是人类生存和发展的前提条件,相互依赖且不可分割。以下是这五种价值观的详细说明:

  1. 生存(Survival)

    • 人类需要生存,这是AI对齐的基础。生存意味着寻找食物、提供庇护、维持健康、防御捕食者、预防或准备应对灾难等。这是人类存在的前提,也是AI对齐工作必须考虑的首要条件。

  2. 可持续的代际存在(Sustainable Intergenerational Existence)

    • 人类不仅需要短期生存,还需要长期生存。这意味着我们需要繁衍后代,保护环境,确保未来世代能够继续存在。这包括了对家庭生活、后代教育和环境保护的重视。

  3. 社会(Society)

    • 人类是社会性生物,无法单独生存。我们需要家庭、社会结构和经济供应链来共同生活。社会是一个使个体能够共存的结构,它包括了分工、资源管理以及必要的知识和技能。

  4. 教育(Education)

    • 人类需要教育,无论是正式的学校教育还是非正式的文化和社会学习。教育是文化传承的基础,对于个体和社会的长期生存至关重要。它涉及到知识的传递、技能的培养以及年轻一代的培养。

  5. 真理(Truth)

    • 人类需要了解现实世界的真相,或者至少需要足够准确的理解来指导我们的行动和教育下一代。真理的追求是理论和实践的结合,它对于人类的生存和社会的稳定至关重要。

这个价值观体系不仅是理论上的构建,而且得到了全球哲学传统、联合国可持续发展目标(SDGs)以及当代科技和AI伦理实践的支持。作者强调,这些价值观在不同文化和社会中可能有不同的具体表现形式,但它们构成了人类共同追求的基础。在AI对齐的背景下,这些价值观提供了一个框架,用于评估和指导AI系统的行为,确保它们不仅不会威胁到人类的这些基本条件,而且能够促进人类的福祉和可持续发展。