SemVink: Advancing VLMs' Semantic Understanding of Optical Illusions via Visual Global Thinking

1UC Merced, 2University of Queensland
MY ALT TEXT

Illusional images can contain hidden texts or hidden images within the obvious background scenes.

Abstract

Vision-language models (VLMs) excel in semantic tasks but falter at a core human capability: detecting hidden content in optical illusions or AI-generated images through perceptual adjustments like zooming. We introduce HC-Bench, a benchmark of 112 images with hidden texts, objects, and illusions, revealing that leading VLMs achieve near-zero accuracy (0–5.36%) even with explicit prompting. Humans resolve such ambiguities instinctively, yet VLMs fail due to an overreliance on high-level semantics. Strikingly, we propose SemVink (Semantic Visual Thinking) by simply scaling images to low resolutions, which unlocks over 99% accuracy by eliminating redundant visual noise. This exposes a critical architectural flaw: VLMs prioritize abstract reasoning over low-level visual operations crucial for real-world robustness. Our work urges a shift toward hybrid models integrating multi-scale processing, bridging the gap between computational vision and human cognition for applications in medical imaging, security, and beyond.

Recognition Failure

Human-Like Visual Operations

Zoom-Out Boosts Performance

Visualized Embedding Features

MY ALT TEXT

The visualization of the embeddings of the input prompts with an illusional image. In the conditions of the left one (6 consecutive image tokens as in the consecutive yellow region in the heatmap) and center one (10 consecutive image tokens), VLMs can recognize the hidden content. In the condition of the right one (666 consecutive image tokens), VLMs cannot find the hidden content. This demonstrates the redundant repeated information of the image is the key to obstruct finding the hidden content.

BibTeX


        @misc{li2025semvinkadvancingvlmssemantic,
              title={SemVink: Advancing VLMs' Semantic Understanding of Optical Illusions via Visual Global Thinking}, 
              author={Sifan Li and Yujun Cai and Yiwei Wang},
              year={2025},
              eprint={2506.02803},
              archivePrefix={arXiv},
              primaryClass={cs.CL},
              url={https://arxiv.org/abs/2506.02803}, 
        }