大模型安全笔记

LLM-Defense

LANGUAGE MODELS ARE HOMER SIMPSON!garak : A Framework for Security Probing Large Language Models Defending Against Weight-Poisoning Backdoor Attacks for Parameter-Efficient Fine-Tuning Trojan Detection in Large Language Models: Insights from The Trojan Detection Challenge PromptFix: Few-shot Backdoor Removal via Adversarial Prompt Tuning The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards Cross-Task Defense: Instruction-Tuning LLMs for Content Safety Efficient Adversarial Training in LLMs with Continuous Attacks StruQ: Defending Against Prompt Injection with Structured Queries Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis Defending Jailbreak Prompts via In-Context Adversarial Game Bergeron: Combating Adversarial Attacks through a Conscience-Based Alignment Framework Jailbreaker in Jail: Moving Target Defense for Large Language Models DEFENDING AGAINST ALIGNMENT-BREAKING ATTACKS VIA ROBUSTLY ALIGNED LLM Causality Analysis for Evaluating the Security of Large Language Models AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks Jailbreaking is Best Solved by Definition RIGORLLM: RESILIENT GUARDRAILS FOR LARGE LANGUAGE MODELS AGAINST UNDESIRED CONTENT LANGUAGE MODELS ARE HOMER SIMPSON! Safety Re-Alignment of Fine-tuned Language Models through Task Ar Defending Against Indirect Prompt Injection Attacks With Spotlighting LLMGuard: Guarding against Unsafe LLM Behavior Test-time Backdoor Mitigation for Black-Box Large Language Models with Defensive Demonstrations ON TROJAN SIGNATURES IN LARGE LANGUAGE MODELS OF CODE Acquiring Clean Language Models from Backdoor Poisoned Datasets by Downscaling Frequency Space Detoxifying Large Language Models via Knowledge Editing MART: Improving LLM Safety with Multi-round Automatic Red-Teaming THE POISON OF ALIGNMENT ROSE: Robust Selective Fine-tuning for Pre-trained Language Models GAINING WISDOM FROM SETBACKS : ALIGNING LARGE LANGUAGE MODELS VIA MISTAKE ANALYSIS Making Harmful Behaviors Unlearnable for Large Language Models Fake Alignment: Are LLMs Really Aligned Well?Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment Vaccine: Perturbation-aware Alignment for Large Language Model DEFENDING LARGE LANGUAGE MODELS AGAINST JAILBREAK ATTACKS VIA SEMANTIC SMOOTHING Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement DEFENDING AGAINST ALIGNMENT-BREAKING AT TACKS VIA ROBUSTLY ALIGNED LLM LLMSelf Defense: By Self Examination, LLMsKnowTheyAreBeing Tricked BASELINE DEFENSES FOR ADVERSARIAL ATTACKS AGAINST ALIGNED LANGUAGE MODELS Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment LLMsCanDefend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper Detoxifying Text with MARCO: Controllable Revision with Experts and Anti-Experts Self-Destructing Models: Increasing the Costs of Harmful Dual Uses of Foundation Models Whispers that Shake Foundations: Analyzing and Mitigating False Premise Hallucinations in Large Lang CAMOUFLAGE IS ALL YOU NEED: EVALUATING AND ENHANCING LANGUAGE MODEL ROBUSTNESS AGAINST CAMOUFLAGE AD Defending Jailbreak Prompts via In-Context Adversarial Game Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement Defending LLMs against Jailbreaking Attacks via Backtranslation IMMUNIZATION AGAINST HARMFUL FINE-TUNING ATTACKS Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield JAB: Joint Adversarial Prompting and Belief Augmentation TOKEN-LEVEL ADVERSARIAL PROMPT DETECTION BASED ON PERPLEXITY MEASURES AND CONTEXTUAL INFORMATION Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks Studious Bob Fight Back Against Jailbreaking via Prompt Adversarial Tuning Vaccine: Perturbation-aware Alignment for Large Language Model Improving the Robustness of Large Language Models via Consistency Alignment SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization Defending Pre-trained Language Models as Few-shot Learners against Backdoor Attacks LMSanitator: Defending Prompt-Tuning Against Task-Agnostic Backdoors Diffusion Theory as a Scalpel: Detecting and Purifying Poisonous Dimensions in Pre-trained Language Analyzing And Editing Inner Mechanisms of Backdoored Language Models Setting the Trap: Capturing and Defeating Backdoors in Pretrained Language Models through Honeypots ROBUSTIFYING LANGUAGE MODELS WITH TESTTIME ADAPTATION Jailbreaker in Jail: Moving Target Defense for Large Language DETECTING LANGUAGE MODEL ATTACKS WITH PERPLEXITY Adversarial Fine-Tuning of Language Models: An Iterative Optimisation Approach for the Generation an From Adversarial Arms Race to Model-centric Evaluation Motivating a Unified Automatic Robustness Eva LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked Intention Analysis Makes LLMs A Good Jailbreak Defender Defending Against Disinformation Attacks in Open-Domain Question Answering Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landsc Round Trip Translation Defence against Large Language Model Jailbreaking Attacks How Should Pre-Trained Language Models Be Fine-Tuned Towards Adversarial Robustness?SELF-GUARD: Empower the LLM to Safeguard Itself Intention Analysis Makes LLMs A Good Jailbreak Defender Jatmo: Prompt Injection Defense by Task-Specific Finetuning Precisely the Point: Adversarial Augmentations for Faithful and Informative Text Generation Adversarial Text Purification: A Large Language Model Approach for Defense Studious Bob Fight Back Against Jailbreaking via Prompt Adversarial Tuning Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks Defending Against Weight-Poisoning Backdoor Attacks for Parameter-Efficient Fine-Tuning Signed-Prompt: A New Approach to Prevent Prompt Injection Attacks Against LLM-Integrated Application Is the System Message Really Important to Jailbreaks in Large Language Models?AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge

PreviousPRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails NextLANGUAGE MODELS ARE HOMER SIMPSON!