大模型安全笔记

Benchmark

HALLUSIONBENCH: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusi OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety ToViLaG: Your Visual-Language Generative Model is Also An Evildoer HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models Constructing Benchmarks and Interventions for Combating Hallucinations in LLMs ALERT: A Comprehensive Benchmark for Assessing Large Language Models’ Safety through Red Teaming Benchmarking Llama2, Mistral, Gemma and GPT for Factuality, Toxicity, Bias and Propensity for Halluc INJECAGENT: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Adversarial Visual-Ins HALLUSIONBENCH: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusio ALL LANGUAGES MATTER: ON THE MULTILINGUAL SAFETY OF LARGE LANGUAGE MODELS Why Should Adversarial Perturbations be Imperceptible? Rethink the Research Paradigm in Adversarial Red Teaming Visual Language Models Unified Hallucination Detection for Multimodal Large Language Models MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning CAN LANGUAGE MODELS BE INSTRUCTED TO PROTECT PERSONAL INFORMATION?Detecting and Preventing Hallucinations in Large Vision Language Models DRESS : Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Lang ToViLaG: Your Visual-Language Generative Model is Also An Evildoer SC-Safety: A Multi-round Open-ended Question Adversarial Safety Benchmark for Large Language Models PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs

PreviousImage Safeguarding: Reasoning with Conditional Vision Language Model and Obfuscating Unsafe Content NextHALLUSIONBENCH: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusi