SemVink
🚨 New Paper Alert 🚨
SemVink: Advancing VLMs’ Semantic Understanding of Optical Illusions via Visual Global Thinking
Vision-Language Models (VLMs) excel in many semantic tasks, yet fail at a fundamental human ability: detecting hidden content in optical illusions and AI-generated images.
🔍 We introduce HC-Bench, a benchmark of 112 hidden-content images (texts & objects), showing state-of-the-art VLMs achieve only 0–5.36% accuracy—even with explicit prompting.
✨ Our solution, SemVink (Semantic Visual Thinking), is surprisingly simple: scaling images down to 32–128 px. This “zoom-out” operation boosts performance to over 99% accuracy across leading VLMs, bridging computational vision with human-like perception.
🌍 Implications: Our findings expose a fundamental architectural flaw in current VLMs and call for multi-scale, hybrid models—critical for applications in medical imaging, security, and real-world robust AI.
👉 Read more & explore the dataset here: https://johnnyzeppelin.github.io/vlm-semvink
Enjoy Reading This Article?
Here are some more articles you might like to read next: