How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States
PreviousA Survey of Backdoor Attacks and Defenses on Large Language Models: Implications for Security MeasurNextJAILBREAKING AS A REWARD MISSPECIFICATION PROBLEM
Last updated
