BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards
PreviousThe Instruction Hierarchy: Training LLMs to Prioritize Privileged InstructionsNextCross-Task Defense: Instruction-Tuning LLMs for Content Safety
Last updated
