SemVink | Sifan Li

🚨 New Paper Alert 🚨

SemVink: Advancing VLMs’ Semantic Understanding of Optical Illusions via Visual Global Thinking

Vision-Language Models (VLMs) excel in many semantic tasks, yet fail at a fundamental human ability: detecting hidden content in optical illusions and AI-generated images.

🔍 We introduce HC-Bench, a benchmark of 112 hidden-content images (texts & objects), showing state-of-the-art VLMs achieve only 0–5.36% accuracy—even with explicit prompting.

✨ Our solution, SemVink (Semantic Visual Thinking), is surprisingly simple: scaling images down to 32–128 px. This “zoom-out” operation boosts performance to over 99% accuracy across leading VLMs, bridging computational vision with human-like perception.

🌍 Implications: Our findings expose a fundamental architectural flaw in current VLMs and call for multi-scale, hybrid models—critical for applications in medical imaging, security, and real-world robust AI.

👉 Read more & explore the dataset here: https://johnnyzeppelin.github.io/vlm-semvink

Enjoy Reading This Article?