EMULATED DISALIGNMENT: SAFETY ALIGNMENT FOR LARGE LANGUAGE MODELS MAY BACKFIRE!

PreviousDefending LLMs against Jailbreaking Attacks via BacktranslationNextGPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
Last updated