More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness
PreviousAn LLM-Assisted Easy-to-Trigger Backdoor Attack on Code Completion Models: Injecting Disguised VulneNextAI SAFETY: A CLIMB TO ARMAGEDDON?
Last updated
