arxiv:2507.13348

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

Published on Jul 17

· Submitted by

Senqiao on Jul 18

#2 Paper of the day

Upvote

Authors:

Senqiao Yang ,

Junyi Li ,

Jiaya Jia

Abstract

VisionThink dynamically adjusts image resolution and visual token processing for efficient and effective vision-language tasks, improving performance and reducing computational cost.

AI-generated summary

Recent advancements in vision-language models (VLMs) have improved performance by increasing the number of visual tokens, which are often significantly longer than text tokens. However, we observe that most real-world scenarios do not require such an extensive number of visual tokens. While the performance drops significantly in a small subset of OCR-related tasks, models still perform accurately in most other general VQA tasks with only 1/4 resolution. Therefore, we propose to dynamically process distinct samples with different resolutions, and present a new paradigm for visual token compression, namely, VisionThink. It starts with a downsampled image and smartly decides whether it is sufficient for problem solving. Otherwise, the model could output a special token to request the higher-resolution image. Compared to existing Efficient VLM methods that compress tokens using fixed pruning ratios or thresholds, VisionThink autonomously decides whether to compress tokens case by case. As a result, it demonstrates strong fine-grained visual understanding capability on OCR-related tasks, and meanwhile saves substantial visual tokens on simpler tasks. We adopt reinforcement learning and propose the LLM-as-Judge strategy to successfully apply RL to general VQA tasks. Moreover, we carefully design a reward function and penalty mechanism to achieve a stable and reasonable image resize call ratio. Extensive experiments demonstrate the superiority, efficiency, and effectiveness of our method. Our code is available at https://github.com/dvlab-research/VisionThink.

View arXiv page View PDF GitHub 341 Add to collection

Community

Senqiao

Paper author Paper submitter 19 days ago

🎯 Code: https://github.com/dvlab-research/VisionThink
🤗 Models & Datasets: https://huggingface.co/collections/Senqiao/visionthink-6878d839fae02a079c9c7bfe
🌟 Video: https://www.youtube.com/watch?v=DGjbFbA5mBw

Rodeszones

16 days ago

With this logic, can't it be done in the video? Extracting I-frames (also known as keyframes) from h.264 encoded video then extracting other frames based on need?

Senqiao

Paper author 16 days ago

Hi @Rodeszones , thanks for your interest in VisionThink!

Yes, I agree—this idea can naturally extend to video. As you suggested, we could start by inputting only the I-frames (keyframes) to the VLM. If the model determines that more context is needed, it can then request additional frames. Reinforcement learning could even be used to identify which segments are most informative and selectively process those with higher temporal resolution (i.e., higher FPS).

Exciting direction to explore!