Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignm
PreviousFRONTIER LANGUAGE MODELS ARE NOT ROBUST TO ADVERSARIAL ARITHMETIC, OR “WHAT DO I NEED TO SAY SO YOUNextEvil Geniuses: Delving into the Safety of LLM-based Agents
Last updated
