Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data
PreviousLoRA-as-an-Attack! Piercing LLM Safety Under The Share-and-Play ScenarioNextSHORTCUTS ARISING FROM CONTRAST: EFFECTIVE AND COVERT CLEAN-LABEL ATTACKS IN PROMPT-BASED LEARNING
Last updated

