Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections

PreviousLANGUAGE MODEL UNALIGNMENT: PARAMETRIC RED-TEAMING TO EXPOSE HIDDEN HARMS AND BI ASESNextIMMUNIZATION AGAINST HARMFUL FINE-TUNING AT TACKS
Last updated

Last updated