Vision Language Models: 2025 Update

sergiopaniego 's Collections

Vision reasoning datasets

GUI Grounding datasets

My vision Spaces

👁 Vision comparison ftw

😎 Awesome vision Spaces

Vision Language Models: 2025 Update

updated May 12

This collection includes all the models, datasets and Spaces mentioned in the blog Vision Language Models: 2025 Update

Upvote

Qwen/Qwen2.5-Omni-7B

Any-to-Any • 11B • Updated Apr 30 • 116k • 1.73k
Running

334

334

Qwen2.5 Omni 7B Demo

🏆

Generate text and speech responses from various inputs
Qwen2.5-Omni Technical Report

Paper • 2503.20215 • Published Mar 26 • 165
openbmb/MiniCPM-o-2_6

Any-to-Any • 9B • Updated Jun 20 • 115k • 1.21k
deepseek-ai/Janus-Pro-7B

Any-to-Any • Updated Feb 1 • 129k • 3.47k
Running on Zero

1.99k

1.99k

Chat With Janus-Pro-7B

🌍

A unified multimodal understanding and generation model.
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Paper • 2501.17811 • Published Jan 29 • 6
Qwen/QVQ-72B-Preview

Image-Text-to-Text • 73B • Updated Jan 12 • 38k • 599
moonshotai/Kimi-VL-A3B-Thinking

Image-Text-to-Text • 16B • Updated Jun 27 • 99.3k • 433
Running on Zero

162

162

Chat with Kimi-VL-A3B-Thinking-2506

🤔

Chat with images, videos, or PDFs to generate text
Kimi-VL Technical Report

Paper • 2504.07491 • Published Apr 10 • 133
moonshotai/MoonViT-SO-400M

Image Feature Extraction • 0.4B • Updated Apr 17 • 498 • 20
google/siglip-so400m-patch14-384

Zero-Shot Image Classification • 0.9B • Updated Sep 26, 2024 • 2.76M • 576
moonshotai/Kimi-VL-A3B-Instruct

Image-Text-to-Text • 16B • Updated 6 days ago • 183k • 226
HuggingFaceTB/SmolVLM-Instruct

Image-Text-to-Text • 2B • Updated Apr 8 • 97.5k • 528
Running on Zero

142

142

SmolVLM

📊

Generate answers by combining text and images
HuggingFaceTB/SmolVLM2-2.2B-Instruct

Image-Text-to-Text • 2B • Updated Apr 8 • 99.3k • 232
SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published Apr 7 • 196
Build error

80

80

SmolVLM

📊

Generate answers by combining text and images
google/gemma-3-27b-it

Image-Text-to-Text • 27B • Updated Mar 21 • 349k • • 1.54k
unsloth/gemma-3-27b-it-GGUF

Image-Text-to-Text • 27B • Updated May 12 • 31k • 145
google/gemma-3-27b-it-qat-q4_0-gguf

Image-Text-to-Text • 27B • Updated Apr 11 • 8.22k • 325
meta-llama/Llama-4-Scout-17B-16E-Instruct

Image-Text-to-Text • 109B • Updated May 22 • 745k • • 1.03k
meta-llama/Llama-4-Maverick-17B-128E-Instruct

Image-Text-to-Text • 402B • Updated May 22 • 48.3k • • 388
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Paper • 2401.15947 • Published Jan 29, 2024 • 54
deepseek-ai/deepseek-vl2

Image-Text-to-Text • 27B • Updated Dec 18, 2024 • 4.18k • 350
Running on Zero

515

515

Chat with DeepSeek-VL2-small

🌍

Generate responses using images and text input
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Paper • 2412.10302 • Published Dec 13, 2024 • 18
lerobot/pi0

Robotics • 4B • Updated Mar 6 • 12.3k • 282
lerobot/pi0fast_base

Robotics • 3B • Updated Mar 31 • 1.62k • 24
nvidia/GR00T-N1-2B

Robotics • 2B • Updated 27 days ago • 1.86k • 324
google/paligemma-3b-pt-224

Image-Text-to-Text • 3B • Updated Sep 21, 2024 • 41.8k • 340
PaliGemma: A versatile 3B VLM for transfer

Paper • 2407.07726 • Published Jul 10, 2024 • 72
Paused

313

313

PaliGemma Demo

🤲

Annotate and describe images with text prompts
PaliGemma 2: A Family of Versatile VLMs for Transfer

Paper • 2412.03555 • Published Dec 4, 2024 • 134
Running on Zero

91

91

Paligemma2 Mix

🌖

Generate text or segment objects from an image
google/paligemma2-10b-mix-448

Image-Text-to-Text • 10B • Updated Feb 7 • 4k • 31
allenai/Molmo-72B-0924

Image-Text-to-Text • 73B • Updated Jun 19 • 2.56k • 287
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Paper • 2409.17146 • Published Sep 25, 2024 • 122
Qwen/Qwen2.5-VL-72B-Instruct

Image-Text-to-Text • 73B • Updated Jun 6 • 603k • • 517
Qwen2.5-VL Technical Report

Paper • 2502.13923 • Published Feb 19 • 199
google/shieldgemma-2-4b-it

Image-Text-to-Text • 4B • Updated Apr 4 • 1.53k • 117
ShieldGemma 2: Robust and Tractable Image Content Moderation

Paper • 2504.01081 • Published Apr 1 • 3
Running on Zero

12

12

ShieldGemma2 VLM

📉

Demo for ShieldGemma 2, multimodal safety model
meta-llama/Llama-Guard-4-12B

Image-Text-to-Text • 12B • Updated Apr 29 • 14.7k • • 50
Runtime error

Llama Guard 4

🦀

Check if text and images are safe
Running

259

259

Qwen2.5 VL 72B Instruct

💻

Interact with a multimodal chatbot using text and images
marco/mcdse-2b-v1

2B • Updated Oct 29, 2024 • 2.63k • 56
vidore/colpali-v1.3

Visual Document Retrieval • Updated Mar 14 • 299k • 63
ColPali: Efficient Document Retrieval with Vision Language Models

Paper • 2407.01449 • Published Jun 27, 2024 • 50
vidore/colqwen2.5-v0.2

Visual Document Retrieval • Updated Jun 16 • 40k • 61
vidore/colsmolvlm-v0.1

Visual Document Retrieval • Updated Mar 14 • 1.21k • 52
Qwen/Qwen2.5-VL-32B-Instruct

Image-Text-to-Text • 33B • Updated Apr 14 • 483k • • 414
Running

141

141

Qwen2.5 VL 32B Instruct Demo

🏃

Chat with images and videos using Qwen
Vision-CAIR/LongVU_Qwen2_7B

Video-Text-to-Text • 8B • Updated Feb 28 • 185 • 72
Running on Zero

84

84

LongVU

🌖

Generate responses to video or image inputs
openbmb/RLAIF-V-Dataset

Viewer • Updated Mar 4 • 74.8k • 2k • 180
HuggingFaceH4/rlaif-v_formatted

Viewer • Updated Jul 2, 2024 • 83.1k • 836 • 11
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

Paper • 2404.16006 • Published Apr 24, 2024
Kaining/MMT-Bench

Viewer • Updated Jun 21, 2024 • 30k • 67 • 10
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

Paper • 2409.02813 • Published Sep 4, 2024 • 32
MMMU/MMMU_Pro

Viewer • Updated Mar 8 • 5.19k • 4.9k • 29
reducto/RolmOCR

Image-to-Text • 8B • Updated Apr 2 • 195k • 464
Alpha-VLLM/Lumina-mGPT-7B-768

Any-to-Any • 7B • Updated Apr 7 • 2.84k • 36
facebook/chameleon-7b

Image-Text-to-Text • 7B • Updated Jul 23, 2024 • 36.1k • 187

Upvote

Collection guide
Browse collections

Qwen2.5 Omni 7B Demo

Chat With Janus-Pro-7B

Chat with Kimi-VL-A3B-Thinking-2506

SmolVLM

SmolVLM

Chat with DeepSeek-VL2-small

PaliGemma Demo

Paligemma2 Mix

ShieldGemma2 VLM

Llama Guard 4

Qwen2.5 VL 72B Instruct

Qwen2.5 VL 32B Instruct Demo

LongVU