FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions Paper • 2509.17177 • Published 3 days ago • 11
FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions Paper • 2509.17177 • Published 3 days ago • 11
CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning Paper • 2401.14011 • Published Jan 25, 2024
Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs Paper • 2505.11842 • Published May 17
FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions Paper • 2509.17177 • Published 3 days ago • 11
FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions Paper • 2509.17177 • Published 3 days ago • 11