大模型安全笔记
Search...
Ctrl
K
Others
Others
Previous
Scalable Extraction of Training Data from (Production) Language Models
Next
INFERRING OFFENSIVENESS IN IMAGES FROM NATURAL LANGUAGE SUPERVISION
Last updated
12 months ago
INFERRING OFFENSIVENESS IN IMAGES FROM NATURAL LANGUAGE SUPERVISION
An LLM-Assisted Easy-to-Trigger Backdoor Attack on Code Completion Models: Injecting Disguised Vulne
More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness
AI SAFETY: A CLIMB TO ARMAGEDDON?
AI RISK MANAGEMENT SHOULD INCORPORATE BOTH SAFETY AND SECURITY
Defending Against Social Engineering Attacks in the Age of LLMs
Adversarial Perturbations Cannot Reliably Protect Artists From Generative AI
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Deduplicating Training Data Makes Language Models Better
MITIGATING TEXT TOXICITY WITH COUNTERFACTUAL GENERATION
The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?
Mitigating Hallucinations in Large Language Models via Self-Refinement-Enhanced Knowledge Retrieval
Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
Mitigating LLM Hallucinations via Conformal Abstention
Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback
Can ChatGPT Detect DeepFakes? A Study of Using Multimodal Large Language Models for Media Forensics
An Analysis of Recent Advances in Deepfake Image Detection in an Evolving Threat Landscape
Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding
LARGE LANGUAGE MODELS AS AUTOMATED ALIGNERS FOR BENCHMARKING VISION-LANGUAGE MODELS
PoLLMgraph: Unraveling Hallucinations in Large Language Models via State Transition Dynamics
Reducing hallucination in structured outputs via Retrieval-Augmented Generation
Moderating Illicit Online Image Promotion for Unsafe User-Generated Content Games Using Large Vision
Attacking LLM Watermarks by Exploiting Their Strengths
The Butterfly Effect of Altering Prompts: How Small Changes and Jailbreaks Affect Large Language Mod
TOFU: A Task of Fictitious Unlearning for LLMs
Learning and Forgetting Unsafe Examples in Large Language Models
Exploring Adversarial Attacks against Latent Diffusion Model from the Perspective of Adversarial Tra
TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space
In Search of Truth: An Interrogation Approach to Hallucination Detection
Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification
Unsupervised Real-Time Hallucination Detection based on the Internal States of Large Language Models
Locating and Mitigating Gender Bias in Large Language Models
Learning to Edit: Aligning LLMs with Knowledge Editing
Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding
Quantitative Analysis of AI-Generated Texts in Academic Research: A Study of AI Presence in Arxiv Su
Does DETECTGPT Fully Utilize Perturbation? Bridge Selective Perturbation to Fine-tuned Contrastive L
TELLER: A Trustworthy Framework for Explainable, Generalizable and Controllable Fake News Detection
SPOTTING LLMS WITH BINOCULARS: ZERO-SHOT DETECTION OF MACHINE-GENERATED TEXT
LLM-as-a-Coauthor: The Challenges of Detecting LLM-Human Mixcase
WHAT’S IN MY BIG DATA?
UNDERSTANDING CATASTROPHIC FORGETTING IN LANGUAGE MODELS VIA IMPLICIT INFERENCE
Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models
Toxicity in CHATGPT: Analyzing Persona-assigned Language Models
MemeCraft: Contextual and Stance-Driven Multimodal Meme Generation
Moderating Illicit Online Image Promotion for Unsafe User-Generated Content Games Using Large Vision
Moderating New Waves of Online Hate with Chain-of-Thought Reasoning in Large Language Models
Poisoned ChatGPT Finds Work for Idle Hands: Exploring Developers’ Coding Practices with Insecure Sug
Zero shot VLMs for hate meme detection: Are we there yet?
ANALYZING AND MITIGATING OBJECT HALLUCINATION IN LARGE VISION-LANGUAGE MODELS
MITIGATING HALLUCINATION IN LARGE MULTIMODAL MODELS VIA ROBUST INSTRUCTION TUNING
DENEVIL: TOWARDS DECIPHERING AND NAVIGATING THE ETHICAL VALUES OF LARGE LANGUAGE MODELS VIA INSTRUCT
Disentangling Perceptions of Offensiveness: Cultural and Moral Correlates
Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity
LARGE LANGUAGE MODELS AS AUTOMATED ALIGNERS FOR BENCHMARKING VISION-LANGUAGE MODELS
Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Pro
InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance
CAN LANGUAGE MODELS BE INSTRUCTED TO PROTECT PERSONAL INFORMATION?
AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications
Prompt Injection Attacks and Defenses in LLM-Integrated Applications
Removing RLHF Protections in GPT-4 via Fine-Tuning
SPML: A DSL for Defending Language Models Against Prompt Attacks
Stealthy Attack on Large Language Model based Recommendation
Large Language Models Sometimes Generate Purely Negatively-Reinforced Text
On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective
Learning from data in the mixed adversarial non-adversarial case: Finding the helpers and ignoring t
longhorns at DADC 2022: How many linguists does it take to fool a Question Answering model? A system
A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning
Adversarial Examples Generation for Reducing Implicit Gender Bias in Pre-trained Models
Discovering the Hidden Vocabulary of DALLE-2
Raising the Cost of Malicious AI-Powered Image Editing
Negating Negatives: Alignment without Human Positive Samples via Distributional Dispreference Optimi
ALIGNERS: DECOUPLING LLMS AND ALIGNMENT
CAN LLM-GENERATED MISINFORMATION BE DETECTED?
On the Risk of Misinformation Pollution with Large Language Models
Evading Watermark based Detection of AI-Generated Content
Mitigating Inappropriateness in Image Generation: Can there be Value in Reflecting the World’s Uglin
Privacy-Preserving Instructions for Aligning Large Language Models
TOWARDS UNDERSTANDING THE INTERPLAY OF GENERATIVE ARTIFICIAL INTELLIGENCE AND THE INTERNET
Evaluating the Social Impact of Generative AI Systems in Systems and Society
Transformation vs Tradition: Artificial General Intelligence (AGI) for Arts and Humanities
Attacking LLM Watermarks by Exploiting Their Strengths
TOWARDS RESPONSIBLE AI IN THE ERA OF GENERATIVE AI: A REFERENCE ARCHITECTURE FOR DESIGNING FOUNDATIO
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
Intent-aligned AI systems deplete human agency: the need for agency foundations research in AI safet
Risk Assessment and Statistical Significance in the Age of Foundation Models
The Foundation Model Transparency Index
The Privacy Pillar - A Conceptual Framework for Foundation Model-based Systems
A Baseline Analysis of Reward Models’ Ability To Accurately Analyze Foundation Models Under Distribu
Foundational Moral Values for AI Alignment
Hazards from Increasingly Accessible Fine-Tuning of Downloadable Foundation Models
ON CATASTROPHIC INHERITANCE OF LARGE FOUNDATION MODELS
Foundation Model Sherpas: Guiding Foundation Models through Knowledge and Reasoning
Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustmen
Foundation Model Transparency Reports
SECURING RELIABILITY: A BRIEF OVERVIEW ON ENHANCING IN-CONTEXT LEARNING FOR FOUNDATION MODELS
EXPLORING THE ADVERSARIAL CAPABILITIES OF LARGE LANGUAGE MODELS
TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification
LLM-Resistant Math Word Problem Generation via Adversarial Attacks
Efficient Black-Box Adversarial Attacks on Neural Text Detectors
Adversarial Preference Optimization
Combating Adversarial Attacks with Multi-Agent Debate
How the Advent of Ubiquitous Large Language Models both Stymie and Turbocharge Dynamic Adversarial Q
L-AutoDA: Leveraging Large Language Models for Automated Decision-based Adversarial Attacks
Hidding the Ghostwriters: An Adversarial Evaluation of AI-Generated Student Essay Detection
What Does the Bot Say? Opportunities and Risks of Large Language Models in Social Media Bot Detectio
Prompted Contextual Vectors for Spear-Phishing Detection
Token-Ensemble Text Generation: On Attacking the Automatic AI-Generated Text Detection
Recursive Chain-of-Feedback Prevents Performance Degradation from Redundant Prompting
Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents
RADAR: Robust AI-Text Detection via Adversarial Learning
OUTFOX: LLM-Generated Essay Detection Through In-Context Learning with Adversarially Generated Examp
Why do universal adversarial attacks work on large language models?: Geometry might be the answer
J-Guard: Journalism Guided Adversarially Robust Detection of AI-generated News
Distilling Adversarial Prompts from Safety Benchmarks: Report for the Adversarial Nibbler Challenge
Detoxifying Large Language Models via Knowledge Editing
Healing Unsafe Dialogue Responses with Weak Supervision Signals