Vision-language models (VLMs) process both images and text, but the visual modality introduces an underexplored attack surface for bypassing safety alignment. We introduce four jailbreak attacks that exploit the vision channel: encoding harmful instructions as visual symbol sequences, replacing harmful objects with benign substitutes in contextual scenes, swapping harmful text in images with placeholders while cultural context preserves meaning, and constructing visual analogy puzzles whose solutions require inferring prohibited concepts. Evaluating across five frontier VLMs, we find that visual attacks achieve comparable — and sometimes substantially higher — success rates than text-only counterparts. For instance, our visual cipher achieves 40.9% attack success on Claude-Haiku-4.5 versus 10.7% for an equivalent textual cipher. These findings demonstrate that robust VLM alignment must treat vision as a first-class target for safety post-training.
Attack Success Rate (%) across six frontier VLMs on HarmBench (best-of-five, majority-vote judging).
| Attack | Claude Haiku 4.5 | Gemini 3 Flash | GPT-5.2 | Qwen3-VL-235B | Qwen3-VL-32B | Gemini 3.1 Pro |
|---|---|---|---|---|---|---|
| Textual Cipher | 10.7 | 89.3 | 5.7 | 86.8 | 84.9 | 15.1 |
| Visual Cipher | 40.9 | 97.5 | 8.2 | 86.2 | 87.4 | 14.5 |
| Textual Repl.† | 8.1 | 58.8 | 16.9 | 29.5 | 39.0 | 19.0 |
| Visual Object Repl. | 4.1 | 52.0 | 11.5 | 35.6 | 41.1 | 45.6 |
| Visual Text Repl. | 12.9 | 32.8 | 14.4 | 51.5 | 58.1 | 48.6 |
| Textual Riddle | 39.6 | 67.9 | 24.5 | 51.6 | 62.3 | 17.0 |
| Visual Riddle | 13.8 | 52.2 | 13.2 | 29.6 | 38.4 | 6.3 |
| TYPO | 5.0 | 11.9 | 5.7 | 33.3 | 37.7 | 5.0 |
| SD | 6.3 | 22.2 | 10.8 | 48.7 | 56.3 | 7.6 |
| SD+TYPO | 11.5 | 20.3 | 6.1 | 44.6 | 60.8 | 6.8 |
| HADES | 9.0 | 12.0 | 2.0 | 11.0 | 32.0 | 13.1 |
| FigStep | 45.9 | 10.1 | 3.8 | 49.1 | 11.3 | 10.1 |
Bold = best within each attack group per model. †Shared text baseline for object & text replacement. Italicized rows are prior visual jailbreak baselines (TYPO, SD, SD+TYPO from MM-SafetyBench; HADES; FigStep).
@inproceedings{azulay2026jailbreaking,
title = {Jailbreaking Vision-Language Models
Through the Visual Modality},
author = {Azulay, Aharon and Dubi{\'n}ski, Jan
and Li, Zhuoyun and Mittal, Atharv
and Gandelsman, Yossi},
booktitle = {Proceedings of the 43rd International
Conference on Machine Learning (ICML)},
year = {2026},
series = {Proceedings of Machine Learning Research},
publisher = {PMLR}
}