OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer Paper β’ 2406.16620 β’ Published Jun 24, 2024 β’ 3
Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head Paper β’ 2403.06892 β’ Published Mar 11, 2024 β’ 2
How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection Paper β’ 2308.13177 β’ Published Aug 25, 2023 β’ 1
RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing Paper β’ 2306.11300 β’ Published Jun 20, 2023 β’ 2
GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection Paper β’ 2312.15043 β’ Published Dec 22, 2023 β’ 2
VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations Paper β’ 2207.00221 β’ Published Jul 1, 2022 β’ 2
OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network Paper β’ 2209.05946 β’ Published Sep 10, 2022 β’ 2
OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding Paper β’ 2407.04923 β’ Published Jul 6, 2024 β’ 2
ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration Paper β’ 2411.16044 β’ Published Nov 25, 2024 β’ 2
omlab/VLM-R1-Qwen2.5VL-3B-Math-0305 Visual Question Answering β’ 4B β’ Updated Apr 14 β’ 1.91k β’ 6
omlab/Qwen2.5VL-3B-VLM-R1-REC-500steps Zero-Shot Object Detection β’ 4B β’ Updated Apr 14 β’ 1.78k β’ 23
KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model Paper β’ 2506.20923 β’ Published Jun 26 β’ 4
HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v2 Feature Extraction β’ 0.5B β’ Updated Jun 28 β’ 605 β’ 25
Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models Paper β’ 2505.04921 β’ Published May 8 β’ 186
Runtime error 72 72 VLM R1 Referral Expression π¬ Mark regions in images based on text descriptions
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model Paper β’ 2504.07615 β’ Published Apr 10 β’ 32
Runtime error 72 72 VLM R1 Referral Expression π¬ Mark regions in images based on text descriptions