Visual Reasoning Tracer: Object-Level Grounded Reasoning Benchmark

1UC Merced 2PKU 3NTU 4CUHK 5WHU 6FDU
Visual Reasoning Tracer teaser

Visual Reasoning Tracer grounds object-level reasoning across diverse visual scenes.

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have significantly improved performance on tasks such as visual grounding and visual question answering. However, the reasoning processes of these models remain largely opaque; they typically output only final predictions without revealing the intermediate steps or fine-grained evidence (e.g., pixels, locations) that lead to the result. This contrasts with human intelligence, which naturally operates through a chain of visual reasoning. To address this limitation, we introduce the Visual Reasoning Tracer (VRT) task, which requires models to not only localize the target object but also explicitly predict the intermediate objects that form the reasoning path. To advance research in this area, we contribute: (1) VRT-Bench, a human-annotated benchmark for evaluating visual reasoning; (2) a new metric for assessing the quality of reasoning traces; and (3) VRT-80k, a large-scale dataset for reasoning model training. Our experiments reveal that while existing models often produce the correct final output, they struggle to ground their intermediate reasoning. In contrast, models trained on VRT-80k achieve substantial improvements in tracing the reasoning path. All benchmarks and code are available here.

VRT-Bench Examples

VRT-Bench is a human-annotated benchmark for evaluating visual reasoning. Unlike traditional tasks that only focus on the final result, VRT-Bench requires models to explicitly predict the intermediate objects that form the reasoning path, providing fine-grained evidence for the final prediction.

VRT-Bench Example 1 VRT-Bench Example 2 VRT-Bench Example 3 VRT-Bench Example 4

Comparison

Our R-Sa2VA model produces a significantly more complete and interpretable reasoning trace compared to the Gemini 2.5 Pro and Qwen3-VL baselines. For the same query, R-Sa2VA sequentially identifies intermediate objects, clearly linking textual reasoning to the corresponding visual traces with dedicated segmentation masks. In contrast, baseline models often only segment the final answer or lack an explicit multi-object reasoning path, leaving intermediate cues ungrounded. These visualizations demonstrate that while existing pipelines can often locate the target, our model exposes a coherent, object-level visual reasoning process.

GT

Ground Truth

R-Sa2VA (Ours)

R-Sa2VA (Ours)

Gemini 2.5 Pro

Gemini 2.5 Pro

Qwen3VL

Qwen3VL

BibTeX

@article{yuan2025vrt,
  author    = {Haobo Yuan and Yueyi Sun and Yanwei Li and Tao Zhang and Xueqing Deng and Henghui Ding and Lu Qi and Anran Wang and Xiangtai Li and Ming-Hsuan Yang},
  title     = {Visual Reasoning Tracer: Object-Level Grounded Reasoning Benchmark},
  journal   = {arXiv pre-print},
  year      = {2025},
}