LLM-Defense

LANGUAGE MODELS ARE HOMER SIMPSON!chevron-rightgarak : A Framework for Security Probing Large Language Modelschevron-rightDefending Against Weight-Poisoning Backdoor Attacks for Parameter-Efficient Fine-Tuningchevron-rightTrojan Detection in Large Language Models: Insights from The Trojan Detection Challengechevron-rightPromptFix: Few-shot Backdoor Removal via Adversarial Prompt Tuningchevron-rightThe Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructionschevron-rightBELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguardschevron-rightCross-Task Defense: Instruction-Tuning LLMs for Content Safetychevron-rightEfficient Adversarial Training in LLMs with Continuous Attackschevron-rightStruQ: Defending Against Prompt Injection with Structured Querieschevron-rightPruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuningchevron-rightGradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysischevron-rightDefending Jailbreak Prompts via In-Context Adversarial Gamechevron-rightBergeron: Combating Adversarial Attacks through a Conscience-Based Alignment Frameworkchevron-rightJailbreaker in Jail: Moving Target Defense for Large Language Modelschevron-rightDEFENDING AGAINST ALIGNMENT-BREAKING ATTACKS VIA ROBUSTLY ALIGNED LLMchevron-rightCausality Analysis for Evaluating the Security of Large Language Modelschevron-rightAutoDefense: Multi-Agent LLM Defense against Jailbreak Attackschevron-rightJailbreaking is Best Solved by Definitionchevron-rightRIGORLLM: RESILIENT GUARDRAILS FOR LARGE LANGUAGE MODELS AGAINST UNDESIRED CONTENTchevron-rightLANGUAGE MODELS ARE HOMER SIMPSON! Safety Re-Alignment of Fine-tuned Language Models through Task Archevron-rightDefending Against Indirect Prompt Injection Attacks With Spotlightingchevron-rightLLMGuard: Guarding against Unsafe LLM Behaviorchevron-rightTest-time Backdoor Mitigation for Black-Box Large Language Models with Defensive Demonstrationschevron-rightON TROJAN SIGNATURES IN LARGE LANGUAGE MODELS OF CODEchevron-rightAcquiring Clean Language Models from Backdoor Poisoned Datasets by Downscaling Frequency Spacechevron-rightDetoxifying Large Language Models via Knowledge Editingchevron-rightMART: Improving LLM Safety with Multi-round Automatic Red-Teamingchevron-rightTHE POISON OF ALIGNMENTchevron-rightROSE: Robust Selective Fine-tuning for Pre-trained Language Modelschevron-rightGAINING WISDOM FROM SETBACKS : ALIGNING LARGE LANGUAGE MODELS VIA MISTAKE ANALYSISchevron-rightMaking Harmful Behaviors Unlearnable for Large Language Modelschevron-rightFake Alignment: Are LLMs Really Aligned Well?chevron-rightRed-Teaming Large Language Models using Chain of Utterances for Safety-Alignmentchevron-rightVaccine: Perturbation-aware Alignment for Large Language Modelchevron-rightDEFENDING LARGE LANGUAGE MODELS AGAINST JAILBREAK ATTACKS VIA SEMANTIC SMOOTHINGchevron-rightBreak the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinementchevron-rightDEFENDING AGAINST ALIGNMENT-BREAKING AT TACKS VIA ROBUSTLY ALIGNED LLMchevron-rightLLMSelf Defense: By Self Examination, LLMsKnowTheyAreBeing Trickedchevron-rightBASELINE DEFENSES FOR ADVERSARIAL ATTACKS AGAINST ALIGNED LANGUAGE MODELSchevron-rightMitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignmentchevron-rightLLMsCanDefend Themselves Against Jailbreaking in a Practical Manner: A Vision Paperchevron-rightDetoxifying Text with MARCO: Controllable Revision with Experts and Anti-Expertschevron-rightSelf-Destructing Models: Increasing the Costs of Harmful Dual Uses of Foundation Modelschevron-rightWhispers that Shake Foundations: Analyzing and Mitigating False Premise Hallucinations in Large Langchevron-rightCAMOUFLAGE IS ALL YOU NEED: EVALUATING AND ENHANCING LANGUAGE MODEL ROBUSTNESS AGAINST CAMOUFLAGE ADchevron-rightDefending Jailbreak Prompts via In-Context Adversarial Gamechevron-rightBreak the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinementchevron-rightDefending LLMs against Jailbreaking Attacks via Backtranslationchevron-rightIMMUNIZATION AGAINST HARMFUL FINE-TUNING ATTACKSchevron-rightRobust Safety Classifier for Large Language Models: Adversarial Prompt Shieldchevron-rightJAB: Joint Adversarial Prompting and Belief Augmentationchevron-rightTOKEN-LEVEL ADVERSARIAL PROMPT DETECTION BASED ON PERPLEXITY MEASURES AND CONTEXTUAL INFORMATIONchevron-rightRobust Prompt Optimization for Defending Language Models Against Jailbreaking Attackschevron-rightStudious Bob Fight Back Against Jailbreaking via Prompt Adversarial Tuningchevron-rightVaccine: Perturbation-aware Alignment for Large Language Modelchevron-rightImproving the Robustness of Large Language Models via Consistency Alignmentchevron-rightSafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decodingchevron-rightDefending Large Language Models Against Jailbreaking Attacks Through Goal Prioritizationchevron-rightDefending Pre-trained Language Models as Few-shot Learners against Backdoor Attackschevron-rightLMSanitator: Defending Prompt-Tuning Against Task-Agnostic Backdoorschevron-rightDiffusion Theory as a Scalpel: Detecting and Purifying Poisonous Dimensions in Pre-trained Languagechevron-rightAnalyzing And Editing Inner Mechanisms of Backdoored Language Modelschevron-rightSetting the Trap: Capturing and Defeating Backdoors in Pretrained Language Models through Honeypotschevron-rightROBUSTIFYING LANGUAGE MODELS WITH TESTTIME ADAPTATIONchevron-rightJailbreaker in Jail: Moving Target Defense for Large Languagechevron-rightDETECTING LANGUAGE MODEL ATTACKS WITH PERPLEXITYchevron-rightAdversarial Fine-Tuning of Language Models: An Iterative Optimisation Approach for the Generation anchevron-rightFrom Adversarial Arms Race to Model-centric Evaluation Motivating a Unified Automatic Robustness Evachevron-rightLLM Self Defense: By Self Examination, LLMs Know They Are Being Trickedchevron-rightIntention Analysis Makes LLMs A Good Jailbreak Defenderchevron-rightDefending Against Disinformation Attacks in Open-Domain Question Answeringchevron-rightPruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuningchevron-rightGradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscchevron-rightRound Trip Translation Defence against Large Language Model Jailbreaking Attackschevron-rightHow Should Pre-Trained Language Models Be Fine-Tuned Towards Adversarial Robustness?chevron-rightSELF-GUARD: Empower the LLM to Safeguard Itselfchevron-rightIntention Analysis Makes LLMs A Good Jailbreak Defenderchevron-rightJatmo: Prompt Injection Defense by Task-Specific Finetuningchevron-rightPrecisely the Point: Adversarial Augmentations for Faithful and Informative Text Generationchevron-rightAdversarial Text Purification: A Large Language Model Approach for Defensechevron-rightStudious Bob Fight Back Against Jailbreaking via Prompt Adversarial Tuningchevron-rightRobust Prompt Optimization for Defending Language Models Against Jailbreaking Attackschevron-rightDefending Against Weight-Poisoning Backdoor Attacks for Parameter-Efficient Fine-Tuningchevron-rightSigned-Prompt: A New Approach to Prevent Prompt Injection Attacks Against LLM-Integrated Applicationchevron-rightIs the System Message Really Important to Jailbreaks in Large Language Models?chevron-rightAEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Expertschevron-rightEraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledgechevron-right