Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study
Abstract
VLMs are more vulnerable to harmful meme-based prompts than to synthetic images, and while multi-turn interactions offer some protection, significant vulnerabilities remain.
Rapid deployment of vision-language models (VLMs) magnifies safety risks, yet most evaluations rely on artificial images. This study asks: How safe are current VLMs when confronted with meme images that ordinary users share? To investigate this question, we introduce MemeSafetyBench, a 50,430-instance benchmark pairing real meme images with both harmful and benign instructions. Using a comprehensive safety taxonomy and LLM-based instruction generation, we assess multiple VLMs across single and multi-turn interactions. We investigate how real-world memes influence harmful outputs, the mitigating effects of conversational context, and the relationship between model scale and safety metrics. Our findings demonstrate that VLMs show greater vulnerability to meme-based harmful prompts than to synthetic or typographic images. Memes significantly increase harmful responses and decrease refusals compared to text-only inputs. Though multi-turn interactions provide partial mitigation, elevated vulnerability persists. These results highlight the need for ecologically valid evaluations and stronger safety mechanisms.
Community
TL;DR: A meme‑based safety evaluation benchmark for Vision-Language Models simulating real‑world user environments.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- REVEAL: Multi-turn Evaluation of Image-Input Harms for Vision LLM (2025)
- PiCo: Jailbreaking Multimodal Large Language Models via Pictorial Code Contextualization (2025)
- DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models (2025)
- BadNAVer: Exploring Jailbreak Attacks On Vision-and-Language Navigation (2025)
- Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs (2025)
- Fooling the LVLM Judges: Visual Biases in LVLM-Based Evaluation (2025)
- FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper