LLM-Defense

LANGUAGE MODELS ARE HOMER SIMPSON!garak : A Framework for Security Probing Large Language ModelsDefending Against Weight-Poisoning Backdoor Attacks for Parameter-Efficient Fine-TuningTrojan Detection in Large Language Models: Insights from The Trojan Detection ChallengePromptFix: Few-shot Backdoor Removal via Adversarial Prompt TuningThe Instruction Hierarchy: Training LLMs to Prioritize Privileged InstructionsBELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM SafeguardsCross-Task Defense: Instruction-Tuning LLMs for Content SafetyEfficient Adversarial Training in LLMs with Continuous AttacksStruQ: Defending Against Prompt Injection with Structured QueriesPruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-TuningGradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient AnalysisDefending Jailbreak Prompts via In-Context Adversarial GameBergeron: Combating Adversarial Attacks through a Conscience-Based Alignment FrameworkJailbreaker in Jail: Moving Target Defense for Large Language ModelsDEFENDING AGAINST ALIGNMENT-BREAKING ATTACKS VIA ROBUSTLY ALIGNED LLMCausality Analysis for Evaluating the Security of Large Language ModelsAutoDefense: Multi-Agent LLM Defense against Jailbreak AttacksJailbreaking is Best Solved by DefinitionRIGORLLM: RESILIENT GUARDRAILS FOR LARGE LANGUAGE MODELS AGAINST UNDESIRED CONTENTLANGUAGE MODELS ARE HOMER SIMPSON! Safety Re-Alignment of Fine-tuned Language Models through Task ArDefending Against Indirect Prompt Injection Attacks With SpotlightingLLMGuard: Guarding against Unsafe LLM BehaviorTest-time Backdoor Mitigation for Black-Box Large Language Models with Defensive DemonstrationsON TROJAN SIGNATURES IN LARGE LANGUAGE MODELS OF CODEAcquiring Clean Language Models from Backdoor Poisoned Datasets by Downscaling Frequency SpaceDetoxifying Large Language Models via Knowledge EditingMART: Improving LLM Safety with Multi-round Automatic Red-TeamingTHE POISON OF ALIGNMENTROSE: Robust Selective Fine-tuning for Pre-trained Language ModelsGAINING WISDOM FROM SETBACKS : ALIGNING LARGE LANGUAGE MODELS VIA MISTAKE ANALYSISMaking Harmful Behaviors Unlearnable for Large Language ModelsFake Alignment: Are LLMs Really Aligned Well?Red-Teaming Large Language Models using Chain of Utterances for Safety-AlignmentVaccine: Perturbation-aware Alignment for Large Language ModelDEFENDING LARGE LANGUAGE MODELS AGAINST JAILBREAK ATTACKS VIA SEMANTIC SMOOTHINGBreak the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-RefinementDEFENDING AGAINST ALIGNMENT-BREAKING AT TACKS VIA ROBUSTLY ALIGNED LLMLLMSelf Defense: By Self Examination, LLMsKnowTheyAreBeing TrickedBASELINE DEFENSES FOR ADVERSARIAL ATTACKS AGAINST ALIGNED LANGUAGE MODELSMitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced AlignmentLLMsCanDefend Themselves Against Jailbreaking in a Practical Manner: A Vision PaperDetoxifying Text with MARCO: Controllable Revision with Experts and Anti-ExpertsSelf-Destructing Models: Increasing the Costs of Harmful Dual Uses of Foundation ModelsWhispers that Shake Foundations: Analyzing and Mitigating False Premise Hallucinations in Large LangCAMOUFLAGE IS ALL YOU NEED: EVALUATING AND ENHANCING LANGUAGE MODEL ROBUSTNESS AGAINST CAMOUFLAGE ADDefending Jailbreak Prompts via In-Context Adversarial GameBreak the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-RefinementDefending LLMs against Jailbreaking Attacks via BacktranslationIMMUNIZATION AGAINST HARMFUL FINE-TUNING ATTACKSRobust Safety Classifier for Large Language Models: Adversarial Prompt ShieldJAB: Joint Adversarial Prompting and Belief AugmentationTOKEN-LEVEL ADVERSARIAL PROMPT DETECTION BASED ON PERPLEXITY MEASURES AND CONTEXTUAL INFORMATIONRobust Prompt Optimization for Defending Language Models Against Jailbreaking AttacksStudious Bob Fight Back Against Jailbreaking via Prompt Adversarial TuningVaccine: Perturbation-aware Alignment for Large Language ModelImproving the Robustness of Large Language Models via Consistency AlignmentSafeDecoding: Defending against Jailbreak Attacks via Safety-Aware DecodingDefending Large Language Models Against Jailbreaking Attacks Through Goal PrioritizationDefending Pre-trained Language Models as Few-shot Learners against Backdoor AttacksLMSanitator: Defending Prompt-Tuning Against Task-Agnostic BackdoorsDiffusion Theory as a Scalpel: Detecting and Purifying Poisonous Dimensions in Pre-trained LanguageAnalyzing And Editing Inner Mechanisms of Backdoored Language ModelsSetting the Trap: Capturing and Defeating Backdoors in Pretrained Language Models through HoneypotsROBUSTIFYING LANGUAGE MODELS WITH TESTTIME ADAPTATIONJailbreaker in Jail: Moving Target Defense for Large LanguageDETECTING LANGUAGE MODEL ATTACKS WITH PERPLEXITYAdversarial Fine-Tuning of Language Models: An Iterative Optimisation Approach for the Generation anFrom Adversarial Arms Race to Model-centric Evaluation Motivating a Unified Automatic Robustness EvaLLM Self Defense: By Self Examination, LLMs Know They Are Being TrickedIntention Analysis Makes LLMs A Good Jailbreak DefenderDefending Against Disinformation Attacks in Open-Domain Question AnsweringPruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-TuningGradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss LandscRound Trip Translation Defence against Large Language Model Jailbreaking AttacksHow Should Pre-Trained Language Models Be Fine-Tuned Towards Adversarial Robustness?SELF-GUARD: Empower the LLM to Safeguard ItselfIntention Analysis Makes LLMs A Good Jailbreak DefenderJatmo: Prompt Injection Defense by Task-Specific FinetuningPrecisely the Point: Adversarial Augmentations for Faithful and Informative Text GenerationAdversarial Text Purification: A Large Language Model Approach for DefenseStudious Bob Fight Back Against Jailbreaking via Prompt Adversarial TuningRobust Prompt Optimization for Defending Language Models Against Jailbreaking AttacksDefending Against Weight-Poisoning Backdoor Attacks for Parameter-Efficient Fine-TuningSigned-Prompt: A New Approach to Prevent Prompt Injection Attacks Against LLM-Integrated ApplicationIs the System Message Really Important to Jailbreaks in Large Language Models?AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM ExpertsEraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge