ICML 2026 · Oral

If you are interested in our work, we would be glad to meet you in person at ICML 2026 in Seoul, South Korea, at the sessions below.

Oral Tue, Jul 7, 2026 · 10:00–10:15 AM KST Auditorium

Poster Tue, Jul 7, 2026 · 2:00–3:45 PM KST Hall A · #3607

Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination

Vision-Language Models often claim to "check the figure again" during reasoning. Through our VisualSwap framework, we reveal that these reflective statements are largely illusory: models fail to detect image swaps with accuracy dropping by up to 60%, and thinking models are nearly 3× more vulnerable than their instructed counterparts.

¹University of Southern California ²University of California San Diego ³Carnegie Mellon University ⁴University of Illinois Urbana-Champaign
^*Equal contribution

Paper arXiv Code Project Page

VisualSwap teaser: given an image and query, a VLM generates reasoning with a self-reflective trigger. The image is then swapped to a visually similar but semantically distinct image. Despite saying it will re-examine, the model fails to detect the change and produces an incorrect answer. — **The illusion of visual re-examination.** Given image *I_a* and query Q, the VLM generates a reasoning chain ending with a self-reflective trigger ("Wait, let me check the figure again"). When the image is swapped to *I_b*, the model fails to detect the visual discrepancy and produces an incorrect answer anchored in the stale context of *I_a*. Only a model with genuine visual grounding would re-attend and arrive at the correct answer.

Overview

Abstract

Vision-Language Models (VLMs) often produce self-reflective statements like "let me check the figure again" during reasoning. Do such statements trigger genuine visual re-examination, or are they merely learned textual patterns? We investigate this via VisualSwap, an image-swap probing framework: after a model reasons over an image, we replace it with a visually similar but semantically different one and test whether the model notices. We introduce VS-Bench, 800 image pairs curated from MathVista, MathVerse, MathVision, and MMMU-Pro. Experiments on Qwen3-VL, Kimi-VL, and ERNIE-VL reveal a striking failure: models overwhelmingly miss the swap, with accuracy dropping by up to 60%. Counterintuitively, thinking models are nearly 3× more vulnerable than their instructed counterparts, and scaling offers no mitigation. Multi-turn user instructions restore visual grounding, but self-generated reflective statements during continuous generation do not. Attention analysis explains why: user instructions substantially elevate attention to visual tokens, whereas self-reflection does not. Current VLMs tend to say rather than actually see when claiming visual re-examination.

VisualSwap & VS-Bench

A diagnostic framework and a benchmark of 800 carefully curated image pairs that directly test whether self-reflective statements trigger genuine visual grounding.

A Severe Failure Mode

Across 15 VLMs, accuracy drops by up to 60% under image swap. Thinking models exhibit nearly 3× greater vulnerability than instruct counterparts, and scaling does not help.

Mechanistic Explanation

Attention analysis shows self-reflective statements elicit insufficient visual attention, while multi-turn user instructions restore grounding to near-baseline levels.

Approach

Method

Stage 1: Standard Inference

Given the original image I_a and question Q, the VLM generates a reasoning chain R_a through standard inference, grounded in the visual content of I_a.

Stage 2: Re-examination Probe

We append a reflection prompt (e.g., "Wait, let me check the figure again") to R_a, while simultaneously replacing I_a with a visually similar but semantically distinct image I_b. The model continues generation from this point.

Diagnostic Metric

We measure Performance Degradation Δ = Acc_base − Acc_probe. Large Δ reveals that the model anchors to prior text rather than re-attending to the current visual evidence.

Evidence

Results

Across 15 models spanning Qwen3-VL, Qwen2.5-VL, Kimi-VL, and ERNIE-VL families, VisualSwap exposes a consistent failure: models say they re-examine, but they don't. The strongest models suffer the largest drops, and explicit multi-turn user instructions are the only intervention that reliably restores grounding.

Table 1. Main results on VS-Bench — Base and Probe accuracy with the performance drop (Δ) across 15 VLMs.

Model	Variant	MathVista			MathVerse			MathVision			MMMU-Pro			Avg.
Model	Variant	Acc_Base	Acc_Probe	Δ	Acc_Base	Acc_Probe	Δ	Acc_Base	Acc_Probe	Δ	Acc_Base	Acc_Probe	Δ	Acc_Base	Acc_Probe	Δ
Qwen3-VL-8B	Instruct	82.5	55.0	27.5	70.5	44.0	26.5	49.5	31.0	18.5	74.0	56.5	17.5	69.1	46.6	22.5
Qwen3-VL-8B	Thinking	84.5	36.5	48.0	83.0	29.5	53.5	56.0	27.0	29.0	80.5	53.5	27.0	76.0	36.6	39.4
Qwen3-VL-32B	Instruct	87.5	59.0	28.5	84.0	70.5	13.5	60.0	43.5	16.5	87.0	74.0	13.0	79.6	61.8	17.9
Qwen3-VL-32B	Thinking	94.5	33.0	61.5	89.5	24.0	65.5	67.0	32.0	35.0	88.5	57.5	31.0	84.9	36.6	48.3
Qwen3-VL-30B-A3B	Instruct	84.5	49.5	35.0	70.5	55.5	15.0	49.0	29.5	19.5	78.0	64.0	14.0	70.5	49.6	20.9
Qwen3-VL-30B-A3B	Thinking	88.5	18.0	70.5	89.5	10.0	79.5	62.5	24.5	38.0	84.0	54.0	30.0	81.1	26.6	54.5
Qwen3-VL-235B-A22B	Instruct	89.0	62.5	26.5	83.5	63.5	20.0	62.5	40.5	22.0	89.5	78.5	11.0	81.1	61.3	19.9
Qwen3-VL-235B-A22B	Thinking	93.5	29.5	64.0	96.5	22.5	74.0	74.0	31.0	43.0	91.0	53.5	37.5	88.8	34.1	54.6
ERNIE-4.5-VL-28B-A3B	Instruct	76.0	33.0	43.0	71.0	31.5	39.5	33.5	18.5	15.0	72.5	33.0	39.5	63.3	29.0	34.3
ERNIE-4.5-VL-28B-A3B	Thinking	87.5	16.5	71.0	91.0	13.0	78.0	60.5	27.5	33.0	80.5	21.5	59.0	79.9	19.6	60.3
Kimi-VL-A3B	Instruct	75.0	31.0	44.0	43.5	16.0	27.5	26.0	17.0	9.0	49.0	21.5	27.5	48.4	21.4	27.0
Kimi-VL-A3B	Thinking	87.5	28.5	59.0	69.5	19.5	50.0	52.5	26.0	26.5	69.5	35.5	34.0	69.8	27.4	42.4
Qwen2.5-VL-7B	Instruct	72.5	37.5	35.0	49.5	36.0	13.5	25.5	15.5	10.0	47.0	28.5	18.5	48.6	29.4	19.3
OpenVLThinker-7B	Thinking	77.5	42.5	35.0	51.0	35.0	16.0	25.0	15.5	9.5	48.0	21.5	26.5	50.4	28.6	21.8
VL-Rethinker-7B	Thinking	79.0	33.0	46.0	66.0	32.0	34.0	31.5	26.5	5.0	50.5	18.5	32.0	56.8	27.5	29.3

Catastrophic drops under VisualSwap. ERNIE-4.5-VL-Thinking plummets from 79.9% to 19.6% (Δ = 60.3%). Qwen3-VL-235B-A22B-Thinking drops from 88.8% to 34.1%. Thinking variants consistently degrade ~3× more than Instruct counterparts.

Attention analysis showing visual attention during probe versus multi-turn settings — **Attention analysis explains the gap.** Self-reflective statements elicit negligible attention shifts to image tokens during decoding, while explicit multi-turn user instructions trigger a substantial surge in visual attention — under identical context.

Context length ablation showing monotonic accuracy decline — **Reasoning length amplifies the illusion.** As more reasoning context is retained before the swap, probe accuracy declines monotonically. The longer the chain of thought, the more decoupled the model becomes from visual evidence.

Table 2. Per-benchmark accuracy under Base, Probe, and Multi-turn settings.

Model	Variant	MathVista			MathVerse			MathVision			MMMU-Pro
Model	Variant	Base	Probe	Multi	Base	Probe	Multi	Base	Probe	Multi	Base	Probe	Multi
Qwen3-VL-8B	Instruct	82.5	55.0	68.4	70.5	44.0	53.5	49.5	31.0	42.0	74.0	56.5	69.0
Qwen3-VL-8B	Thinking	84.5	36.5	71.5	83.0	29.5	77.0	56.0	27.0	47.0	80.5	53.5	74.5
Qwen3-VL-235B-A22B	Instruct	89.0	62.5	83.2	83.5	63.5	82.0	62.5	40.5	55.0	89.5	78.5	91.5
Qwen3-VL-235B-A22B	Thinking	93.5	29.5	83.2	96.5	22.5	97.0	74.0	31.0	71.5	91.0	53.5	90.0

Multi-turn intervention restores grounding. Qwen3-VL-235B-A22B-Thinking rebounds from 34.1% (Probe) to 85.4% (Multi-turn), nearly matching its 88.8% baseline. The visual capability is intact; what fails is autonomous attentional control.

Media

Case Studies

Representative examples across four domains — Geometry, Chart Understanding, Synthetic Scene VQA, and Function Plots — illustrating both the dominant failure mode (textual inertia) and cases of genuine visual re-grounding. Use the arrows, dots, or ←/→ keys to browse; click any figure to enlarge.

Error case: geometry angle problem where the model ignores the swapped 50-degree figure — **Textual inertia in geometry.** The model computes an angle from a 60° input in *I_a*. After the transparent swap to a 50° figure, it ignores the pixel-level evidence and hallucinates the original value (180° − 60°) to stay consistent with its prior derivation.

Error case: bar chart where the model misses that the Burlywood bar became the minimum — **Stale measurements in a bar chart.** The “Burlywood” bar is shortened to become the new minimum, but the model never notices. It reuses bar lengths extracted before the swap, hallucinating them to justify an incorrect “No”.

Error case: counting task where the model keeps counting an absent purple cube — **Anchored object inventory.** In a counting task, the purple cube of *I_a* is replaced by a cyan one in *I_b*. The model keeps subtracting the now-absent purple cube — reasoning from the memory of *I_a* rather than the current image.

Error case: function plot where the model misses that Soft overtakes Dense — **Linguistic-only reflection.** The curves are altered so “Soft” overtakes “Dense”. Despite the visible crossover, the model restates the textual conclusion from *R_a*, confirming the re-examination was merely linguistic.

Success case: geometry problem where the model correctly reads the swapped 50-degree angle — **Breaking textual inertia.** The model explicitly identifies the new 50° angle in *I_b*, distinguishes it from the 60° in its history, and correctly re-derives the supplementary angle as 130°.

Success case: bar chart where the model detects the shortened Black bar — **Genuine visual update.** The model perceives that the “Black” bar — previously the longest — has been shortened, updates its comparison logic, and reverses its conclusion to “no”.

Success case: counting task where the model detects the red sphere was replaced by green — **Re-grounded arithmetic.** The model detects that the red sphere was replaced by a green one, subtracts zero red items instead of one, and arrives at the correct count of 6.

Success case: function plot where the model adapts from a parabola to an absolute-value function — **Dramatic visual correction.** The model transitions from a parabola (x²) in *I_a* to an absolute-value function (|2x−3|+1) in *I_b*, abandons the derivative comparison, and identifies the constant, equal slopes of the V-shaped graph.

VS-Bench Dataset Evaluation Code Supplementary Material

Reference

Citation

@misc{shi2026vlmsseeingjustsaying,
      title={Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination}, 
      author={Chufan Shi and Cheng Yang and Yaokang Wu and Linhao Jin and Bo Shui and Taylor Berg-Kirkpatrick and Xuezhe Ma},
      year={2026},
      eprint={2605.15864},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.15864}, 
}

Notes

Acknowledgements

This work is funded in part by the Schmidt Foundation and by the National Science Foundation under grant 2146151. We thank the authors of MathVista, MathVerse, MathVision, and MMMU-Pro for releasing the benchmarks that VS-Bench builds upon, and the open-source teams behind Qwen3-VL, Qwen2.5-VL, Kimi-VL, and ERNIE-4.5-VL for making this analysis possible.