大模型安全笔记

Others

Others

INFERRING OFFENSIVENESS IN IMAGES FROM NATURAL LANGUAGE SUPERVISION An LLM-Assisted Easy-to-Trigger Backdoor Attack on Code Completion Models: Injecting Disguised Vulne More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness AI SAFETY: A CLIMB TO ARMAGEDDON?AI RISK MANAGEMENT SHOULD INCORPORATE BOTH SAFETY AND SECURITY Defending Against Social Engineering Attacks in the Age of LLMs Adversarial Perturbations Cannot Reliably Protect Artists From Generative AI Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback Deduplicating Training Data Makes Language Models Better MITIGATING TEXT TOXICITY WITH COUNTERFACTUAL GENERATION The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?Mitigating Hallucinations in Large Language Models via Self-Refinement-Enhanced Knowledge Retrieval Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?Mitigating LLM Hallucinations via Conformal Abstention Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback Can ChatGPT Detect DeepFakes? A Study of Using Multimodal Large Language Models for Media Forensics An Analysis of Recent Advances in Deepfake Image Detection in an Evolving Threat Landscape Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding LARGE LANGUAGE MODELS AS AUTOMATED ALIGNERS FOR BENCHMARKING VISION-LANGUAGE MODELS PoLLMgraph: Unraveling Hallucinations in Large Language Models via State Transition Dynamics Reducing hallucination in structured outputs via Retrieval-Augmented Generation Moderating Illicit Online Image Promotion for Unsafe User-Generated Content Games Using Large Vision Attacking LLM Watermarks by Exploiting Their Strengths The Butterfly Effect of Altering Prompts: How Small Changes and Jailbreaks Affect Large Language Mod TOFU: A Task of Fictitious Unlearning for LLMs Learning and Forgetting Unsafe Examples in Large Language Models Exploring Adversarial Attacks against Latent Diffusion Model from the Perspective of Adversarial Tra TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space In Search of Truth: An Interrogation Approach to Hallucination Detection Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification Unsupervised Real-Time Hallucination Detection based on the Internal States of Large Language Models Locating and Mitigating Gender Bias in Large Language Models Learning to Edit: Aligning LLMs with Knowledge Editing Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding Quantitative Analysis of AI-Generated Texts in Academic Research: A Study of AI Presence in Arxiv Su Does DETECTGPT Fully Utilize Perturbation? Bridge Selective Perturbation to Fine-tuned Contrastive L TELLER: A Trustworthy Framework for Explainable, Generalizable and Controllable Fake News Detection SPOTTING LLMS WITH BINOCULARS: ZERO-SHOT DETECTION OF MACHINE-GENERATED TEXT LLM-as-a-Coauthor: The Challenges of Detecting LLM-Human Mixcase WHAT’S IN MY BIG DATA?UNDERSTANDING CATASTROPHIC FORGETTING IN LANGUAGE MODELS VIA IMPLICIT INFERENCE Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models Toxicity in CHATGPT: Analyzing Persona-assigned Language Models MemeCraft: Contextual and Stance-Driven Multimodal Meme Generation Moderating Illicit Online Image Promotion for Unsafe User-Generated Content Games Using Large Vision Moderating New Waves of Online Hate with Chain-of-Thought Reasoning in Large Language Models Poisoned ChatGPT Finds Work for Idle Hands: Exploring Developers’ Coding Practices with Insecure Sug Zero shot VLMs for hate meme detection: Are we there yet?ANALYZING AND MITIGATING OBJECT HALLUCINATION IN LARGE VISION-LANGUAGE MODELS MITIGATING HALLUCINATION IN LARGE MULTIMODAL MODELS VIA ROBUST INSTRUCTION TUNING DENEVIL: TOWARDS DECIPHERING AND NAVIGATING THE ETHICAL VALUES OF LARGE LANGUAGE MODELS VIA INSTRUCT Disentangling Perceptions of Offensiveness: Cultural and Moral Correlates Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity LARGE LANGUAGE MODELS AS AUTOMATED ALIGNERS FOR BENCHMARKING VISION-LANGUAGE MODELS Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Pro InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance CAN LANGUAGE MODELS BE INSTRUCTED TO PROTECT PERSONAL INFORMATION?AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications Prompt Injection Attacks and Defenses in LLM-Integrated Applications Removing RLHF Protections in GPT-4 via Fine-Tuning SPML: A DSL for Defending Language Models Against Prompt Attacks Stealthy Attack on Large Language Model based Recommendation Large Language Models Sometimes Generate Purely Negatively-Reinforced Text On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective Learning from data in the mixed adversarial non-adversarial case: Finding the helpers and ignoring t longhorns at DADC 2022: How many linguists does it take to fool a Question Answering model? A system A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning Adversarial Examples Generation for Reducing Implicit Gender Bias in Pre-trained Models Discovering the Hidden Vocabulary of DALLE-2 Raising the Cost of Malicious AI-Powered Image Editing Negating Negatives: Alignment without Human Positive Samples via Distributional Dispreference Optimi ALIGNERS: DECOUPLING LLMS AND ALIGNMENT CAN LLM-GENERATED MISINFORMATION BE DETECTED?On the Risk of Misinformation Pollution with Large Language Models Evading Watermark based Detection of AI-Generated Content Mitigating Inappropriateness in Image Generation: Can there be Value in Reflecting the World’s Uglin Privacy-Preserving Instructions for Aligning Large Language Models TOWARDS UNDERSTANDING THE INTERPLAY OF GENERATIVE ARTIFICIAL INTELLIGENCE AND THE INTERNET Evaluating the Social Impact of Generative AI Systems in Systems and Society Transformation vs Tradition: Artificial General Intelligence (AGI) for Arts and Humanities Attacking LLM Watermarks by Exploiting Their Strengths TOWARDS RESPONSIBLE AI IN THE ERA OF GENERATIVE AI: A REFERENCE ARCHITECTURE FOR DESIGNING FOUNDATIO RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment Intent-aligned AI systems deplete human agency: the need for agency foundations research in AI safet Risk Assessment and Statistical Significance in the Age of Foundation Models The Foundation Model Transparency Index The Privacy Pillar - A Conceptual Framework for Foundation Model-based Systems A Baseline Analysis of Reward Models’ Ability To Accurately Analyze Foundation Models Under Distribu Foundational Moral Values for AI Alignment Hazards from Increasingly Accessible Fine-Tuning of Downloadable Foundation Models ON CATASTROPHIC INHERITANCE OF LARGE FOUNDATION MODELS Foundation Model Sherpas: Guiding Foundation Models through Knowledge and Reasoning Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustmen Foundation Model Transparency Reports SECURING RELIABILITY: A BRIEF OVERVIEW ON ENHANCING IN-CONTEXT LEARNING FOR FOUNDATION MODELS EXPLORING THE ADVERSARIAL CAPABILITIES OF LARGE LANGUAGE MODELS TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification LLM-Resistant Math Word Problem Generation via Adversarial Attacks Efficient Black-Box Adversarial Attacks on Neural Text Detectors Adversarial Preference Optimization Combating Adversarial Attacks with Multi-Agent Debate How the Advent of Ubiquitous Large Language Models both Stymie and Turbocharge Dynamic Adversarial Q L-AutoDA: Leveraging Large Language Models for Automated Decision-based Adversarial Attacks Hidding the Ghostwriters: An Adversarial Evaluation of AI-Generated Student Essay Detection What Does the Bot Say? Opportunities and Risks of Large Language Models in Social Media Bot Detectio Prompted Contextual Vectors for Spear-Phishing Detection Token-Ensemble Text Generation: On Attacking the Automatic AI-Generated Text Detection Recursive Chain-of-Feedback Prevents Performance Degradation from Redundant Prompting Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents RADAR: Robust AI-Text Detection via Adversarial Learning OUTFOX: LLM-Generated Essay Detection Through In-Context Learning with Adversarially Generated Examp Why do universal adversarial attacks work on large language models?: Geometry might be the answer J-Guard: Journalism Guided Adversarially Robust Detection of AI-generated News Distilling Adversarial Prompts from Safety Benchmarks: Report for the Adversarial Nibbler Challenge Detoxifying Large Language Models via Knowledge Editing Healing Unsafe Dialogue Responses with Weak Supervision Signals

PreviousScalable Extraction of Training Data from (Production) Language Models NextINFERRING OFFENSIVENESS IN IMAGES FROM NATURAL LANGUAGE SUPERVISION

Last updated 1 year ago