Jailbreaking Vision-Language Models Through the Visual Modality

Abstract

Vision-language models (VLMs) process both images and text, but the visual modality introduces an underexplored attack surface for bypassing safety alignment. We introduce four jailbreak attacks that exploit the vision channel: encoding harmful instructions as visual symbol sequences, replacing harmful objects with benign substitutes in contextual scenes, swapping harmful text in images with placeholders while cultural context preserves meaning, and constructing visual analogy puzzles whose solutions require inferring prohibited concepts. Evaluating across five frontier VLMs, we find that visual attacks achieve comparable — and sometimes substantially higher — success rates than text-only counterparts. For instance, our visual cipher achieves 40.9% attack success on Claude-Haiku-4.5 versus 10.7% for an equivalent textual cipher. These findings demonstrate that robust VLM alignment must treat vision as a first-class target for safety post-training.

Results

Attack Success Rate (%) across six frontier VLMs on HarmBench (best-of-five, majority-vote judging).

Attack	Claude Haiku 4.5	Gemini 3 Flash	GPT-5.2	Qwen3-VL-235B	Qwen3-VL-32B	Gemini 3.1 Pro
Textual Cipher	10.7	89.3	5.7	86.8	84.9	15.1
Visual Cipher	40.9	97.5	8.2	86.2	87.4	14.5

Textual Repl.†	8.1	58.8	16.9	29.5	39.0	19.0
Visual Object Repl.	4.1	52.0	11.5	35.6	41.1	45.6
Visual Text Repl.	12.9	32.8	14.4	51.5	58.1	48.6

Textual Riddle	39.6	67.9	24.5	51.6	62.3	17.0
Visual Riddle	13.8	52.2	13.2	29.6	38.4	6.3

TYPO	5.0	11.9	5.7	33.3	37.7	5.0
SD	6.3	22.2	10.8	48.7	56.3	7.6
SD+TYPO	11.5	20.3	6.1	44.6	60.8	6.8
HADES	9.0	12.0	2.0	11.0	32.0	13.1
FigStep	45.9	10.1	3.8	49.1	11.3	10.1

Bold = best within each attack group per model. †Shared text baseline for object & text replacement. Italicized rows are prior visual jailbreak baselines (TYPO, SD, SD+TYPO from MM-SafetyBench; HADES; FigStep).

Citation

@inproceedings{azulay2026jailbreaking,
  title     = {Jailbreaking Vision-Language Models
               Through the Visual Modality},
  author    = {Azulay, Aharon and Dubi{\'n}ski, Jan
               and Li, Zhuoyun and Mittal, Atharv
               and Gandelsman, Yossi},
  booktitle = {Proceedings of the 43rd International
               Conference on Machine Learning (ICML)},
  year      = {2026},
  series    = {Proceedings of Machine Learning Research},
  publisher = {PMLR}
}