ICML 2026

Jailbreaking Vision-Language Models Through the Visual Modality

Aharon Azulay*, Jan Dubinski*, Zhuoyun Li*, Atharv Mittal*, Yossi Gandelsman
* Equal contribution
Paper Code BibTeX
Four visual jailbreak attacks on VLMs
Figure 1. Four attacks exploiting the visual modality: (a) Visual Cipher encodes instructions as glyph sequences, (b) Object Replacement swaps harmful objects with benign substitutes, (c) Text Replacement substitutes harmful words in images, (d) Analogy Riddle encodes prohibited concepts as visual puzzles.

Abstract

Vision-language models (VLMs) process both images and text, but the visual modality introduces an underexplored attack surface for bypassing safety alignment. We introduce four jailbreak attacks that exploit the vision channel: encoding harmful instructions as visual symbol sequences, replacing harmful objects with benign substitutes in contextual scenes, swapping harmful text in images with placeholders while cultural context preserves meaning, and constructing visual analogy puzzles whose solutions require inferring prohibited concepts. Evaluating across five frontier VLMs, we find that visual attacks achieve comparable — and sometimes substantially higher — success rates than text-only counterparts. For instance, our visual cipher achieves 40.9% attack success on Claude-Haiku-4.5 versus 10.7% for an equivalent textual cipher. These findings demonstrate that robust VLM alignment must treat vision as a first-class target for safety post-training.

Results

Attack Success Rate (%) across six frontier VLMs on HarmBench (best-of-five, majority-vote judging).

Attack Claude Haiku 4.5 Gemini 3 Flash GPT-5.2 Qwen3-VL-235B Qwen3-VL-32B Gemini 3.1 Pro
Textual Cipher10.789.35.786.884.915.1
Visual Cipher40.997.58.286.287.414.5
Textual Repl.†8.158.816.929.539.019.0
Visual Object Repl.4.152.011.535.641.145.6
Visual Text Repl.12.932.814.451.558.148.6
Textual Riddle39.667.924.551.662.317.0
Visual Riddle13.852.213.229.638.46.3
TYPO5.011.95.733.337.75.0
SD6.322.210.848.756.37.6
SD+TYPO11.520.36.144.660.86.8
HADES9.012.02.011.032.013.1
FigStep45.910.13.849.111.310.1

Bold = best within each attack group per model. †Shared text baseline for object & text replacement. Italicized rows are prior visual jailbreak baselines (TYPO, SD, SD+TYPO from MM-SafetyBench; HADES; FigStep).

Citation

@inproceedings{azulay2026jailbreaking,
  title     = {Jailbreaking Vision-Language Models
               Through the Visual Modality},
  author    = {Azulay, Aharon and Dubi{\'n}ski, Jan
               and Li, Zhuoyun and Mittal, Atharv
               and Gandelsman, Yossi},
  booktitle = {Proceedings of the 43rd International
               Conference on Machine Learning (ICML)},
  year      = {2026},
  series    = {Proceedings of Machine Learning Research},
  publisher = {PMLR}
}