1807 277 575

merve PRO

merve

Pingjie's profile picture

Xenova's profile picture

netzkontrast's profile picture

https://github.com/merveenoyan/smol-vision

mervenoyann
merveenoyan
merve.bsky.social

AI & ML interests

I love this website VLMs, vision & co

Recent Activity

liked a Space about 9 hours ago

prithivMLmods/Qwen-Image-Diffusion

updated a dataset about 14 hours ago

huggingface/documentation-images

posted an update about 14 hours ago

massive releases and tons of Flux 1. Krea LoRas past week! here's some of the picks, find more models in collection 🫡 https://huggingface.co/collections/merve/releases-august-2-6890c14248203522b7d0267f LLMs 💬 > Tencent dropped https://huggingface.co/tencent/Hunyuan-7B-Instruct > Qwen released https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct, 30B MoE with 3B params for coding (OS) vision/multimodal > RedNote released https://huggingface.co/rednote-hilab/dots.ocr - 3B OCR model (OS) > Cohere released https://huggingface.co/CohereLabs/command-a-vision-07-2025 - 112B (dense!) VLM for 6 languages > StepFun-AI shipped https://huggingface.co/stepfun-ai/step3 - 321B MoE VLM (OS) > Skywork shipped https://huggingface.co/Skywork/Skywork-UniPic-1.5B - new any-to-any model (image+text → image+text) (OS)

View all activity

Organizations

merve 's collections 67

Releases August 2

stepfun-ai/step3

Image-Text-to-Text • 321B • Updated 3 days ago • 421 • 125
nunchaku-tech/nunchaku-flux.1-krea-dev

Text-to-Image • Updated 4 days ago • 11.2k • 57
fdtn-ai/Foundation-Sec-8B-Instruct

Text Generation • 8B • Updated about 3 hours ago • 1.12k • 18
Wan-AI/Wan2.2-TI2V-5B-Diffusers

Text-to-Video • Updated 8 days ago • 16.2k • 54

Releases July 18

nvidia/OpenReasoning-Nemotron-32B

Text Generation • 33B • Updated 3 days ago • 3.61k • • 109
ByteDance-Seed/Seed-X-RM-7B

Translation • Updated 5 days ago • 148k • 25
LGAI-EXAONE/EXAONE-4.0-32B

Text Generation • 32B • Updated 1 day ago • 550k • 226
vidore/colqwen-omni-v0.1

Visual Document Retrieval • Updated 19 days ago • 5.81k • 86

Releases July 4

apple/DiffuCoder-7B-cpGRPO

8B • Updated Jul 4 • 4.74k • 303
BAAI/MTVCraft

Text-to-Video • Updated 29 days ago • 238 • 35
kyutai/tts-1.6b-en_fr

Text-to-Speech • Updated 28 days ago • 66.8k • 307
apple/DiffuCoder-7B-Base

8B • Updated Jul 4 • 940 • 21

June 20 Releases

moonshotai/Kimi-VL-A3B-Thinking-2506

Image-Text-to-Text • 16B • Updated 5 days ago • 42.1k • 250
mistralai/Mistral-Small-3.2-24B-Instruct-2506

24B • Updated 8 days ago • 138k • 390
kyutai/stt-1b-en_fr

Automatic Speech Recognition • Updated Jun 26 • 79
google/magenta-realtime

Updated 21 days ago • 420 • 463

Releases June 13

ByteDance/LatentSync-1.6

Updated Jun 12 • 9.82k • 34
V-JEPA 2

Collection

A frontier video understanding model developed by FAIR, Meta, which extends the pretraining objectives of https://ai.meta.com/blog/v-jepa-yann • 8 items • Updated Jun 13 • 153
nanonets/Nanonets-OCR-s

Image-Text-to-Text • 4B • Updated Jun 20 • 176k • 1.46k
tencent/Hunyuan3D-2.1

Image-to-3D • Updated 6 days ago • 58k • 603

Releases 30 May

All the releases of the week of 30th May.

deepseek-ai/DeepSeek-R1-0528

Text Generation • 685B • Updated May 29 • 455k • • 2.35k
Running on Zero

200

200

BAGEL

🚀

Demo for BAGEL
tencent/HunyuanPortrait

Image-to-Video • Updated May 27 • 70
XiaomiMiMo/MiMo-7B-RL-0530

Text Generation • 8B • Updated Jun 5 • 7.36k • 38

May 16 Releases

Qwen/WorldPM-72B

Text Classification • 73B • Updated May 17 • 1.09k • 75
Running on Zero

MCP

1.05k

1.05k

LTX Video Fast

🎥

ultra-fast video model, LTX 0.9.8 13B distilled
BLIP3o/BLIP3o-Pretrain-Long-Caption

Viewer • Updated Jun 26 • 27.2M • 21.6k • 41
BLIP3o/BLIP3o-Model-8B

14B • Updated Jun 4 • 1.63k • 101

Any-to-Any Models, Datasets, Spaces

Running

76

76

MMaDA

🌍

Demo for MMaDA: Multimodal Large Diffusion Language Models
Running on Zero

200

200

BAGEL

🚀

Demo for BAGEL
Gen-Verse/MMaDA-8B-Base

Any-to-Any • 8B • Updated May 24 • 7.65k • 82
ByteDance-Seed/BAGEL-7B-MoT

Any-to-Any • 15B • Updated Jun 23 • 1.07k • 1.1k

InternVL3 HF

OpenGVLab/InternVL3-1B-hf

Image-Text-to-Text • 0.9B • Updated Apr 23 • 42.2k • 5
OpenGVLab/InternVL3-2B-hf

Image-Text-to-Text • 2B • Updated Apr 23 • 21.5k • 2
OpenGVLab/InternVL3-8B-hf

Image-Text-to-Text • 8B • Updated Apr 23 • 37k • 8
OpenGVLab/InternVL3-14B-hf

Image-Text-to-Text • 15B • Updated Apr 23 • 7.62k

Multimodal DSE Retrievers

A collection of DSE models for multimodal retrieval

racineai/Flantier-SmolVLM-2B-dse

2B • Updated Jun 18 • 350 • 9
MrLight/dse-qwen2-2b-mrl-v1

Visual Document Retrieval • Updated Feb 26 • 5.99k • 59
marco/mcdse-2b-v1

2B • Updated Oct 29, 2024 • 2.63k • 56
llamaindex/vdr-2b-multi-v1

Image-to-Text • 2B • Updated May 21 • 6.11k • 118

March 28 Releases

deepseek-ai/DeepSeek-V3-0324

Text Generation • 685B • Updated Mar 27 • 459k • • 3.02k
Qwen/Qwen2.5-Omni-7B

Any-to-Any • 11B • Updated Apr 30 • 116k • 1.73k
google/txgemma-27b-chat

Text Generation • 27B • Updated Apr 10 • 1.31k • 54
Running

334

334

Qwen2.5 Omni 7B Demo

🏆

Generate text and speech responses from various inputs

Türkçe VLMler

Qwen/Qwen2-VL-7B-Instruct

Image-Text-to-Text • 8B • Updated Feb 6 • 570k • • 1.22k
Qwen/Qwen2-VL-2B-Instruct

Image-Text-to-Text • 2B • Updated Jan 12 • 689k • 434
CohereLabs/aya-vision-8b

Image-Text-to-Text • 9B • Updated 5 days ago • 29.4k • • 306
CohereLabs/aya-vision-32b

Image-Text-to-Text • 33B • Updated May 14 • 171 • • 212

Feb 7 Releases 🧣

lerobot/pi0

Robotics • 4B • Updated Mar 6 • 12.3k • 282
kyutai/hibiki-2b-pytorch-bf16

Translation • Updated May 28 • 197 • 55
Alpha-VLLM/Lumina-Image-2.0

Text-to-Image • Updated Mar 30 • 11.5k • • 331
adyen/DABstep

Viewer • Updated about 3 hours ago • 490k • 5.67k • 25

Models, Jan 27

Running on Zero

255

255

Qwen2-VL-7B

🔥

Generate text by combining an image and a question
Running

57

57

UI-TARS

🌖

Select coordinates on an image based on instructions
Running

87

87

Qwen2.5-1M Demo

💻

Upload documents and ask questions
Qwen/Qwen2.5-14B-Instruct-1M

Text Generation • 15B • Updated Jan 29 • 21.1k • • 316

Jan 17 Releases ❄️

Models and datasets of the second week of Jan 2025.

openbmb/MiniCPM-o-2_6

Any-to-Any • 9B • Updated Jun 20 • 115k • 1.21k
MiniMaxAI/MiniMax-Text-01

Text Generation • 456B • Updated Jul 3 • 1.68k • 638
OuteAI/OuteTTS-0.3-1B

Text-to-Speech • 1B • Updated Apr 24 • 1.2k • 101
NovaSky-AI/Sky-T1_data_17k

Viewer • Updated Jan 14 • 16.4k • 276 • 184

Dec 6 Releases 🎄

meta-llama/Llama-3.3-70B-Instruct

Text Generation • 71B • Updated Dec 21, 2024 • 386k • • 2.46k
Qwen/Qwen2-VL-72B

Image-Text-to-Text • 73B • Updated Dec 6, 2024 • 810 • 79
google/paligemma2-3b-pt-224

Image-Text-to-Text • 3B • Updated Dec 5, 2024 • 177k • 154
tencent/HunyuanVideo

Text-to-Video • Updated Mar 6 • 1.58k • • 2k

Nov 22 Releases ❄️

mistralai/Pixtral-Large-Instruct-2411

Updated 8 days ago • 81 • 418
microsoft/orca-agentinstruct-1M-v1

Viewer • Updated Nov 1, 2024 • 1.05M • 1.27k • 447
Xkev/Llama-3.2V-11B-cot

Image-Text-to-Text • 11B • Updated Dec 16, 2024 • 3.93k • 153
jinaai/jina-clip-v2

Feature Extraction • 0.9B • Updated Apr 28 • 51k • 268

Nov 1 Releases

Running on Zero

84

84

LongVU

🌖

Generate responses to video or image inputs
facebook/MobileLLM-1B

Text Generation • Updated May 5 • 236 • 120
Vision-CAIR/LongVU_Qwen2_7B

Video-Text-to-Text • 8B • Updated Feb 28 • 185 • 72
Vision-CAIR/LongVU_Llama3_2_3B_img

Updated Feb 28 • 3 • 6

October 25 Releases

ibm-granite/granite-3.0-8b-instruct

Text Generation • 8B • Updated Dec 19, 2024 • 23.4k • 202
ibm-granite/granite-3.0-2b-instruct

Text Generation • 3B • Updated Dec 19, 2024 • 4.25k • 46
CohereLabs/aya-expanse-8b

Text Generation • 8B • Updated 5 days ago • 14.6k • • 390
CohereLabs/aya-expanse-32b

Text Generation • 32B • Updated 5 days ago • 6.54k • • 266

New Depth Models

Recent depth models

Running on Zero

177

177

DepthCrafter

🦀

a super consistent video depth model
Paused

222

222

Depth Pro

🚀

Generate an inverse depth map from an image
Runtime error

75

75

LOTUS Depth

🚀

Generate depth maps from images and videos
apple/DepthPro

Depth Estimation • Updated Feb 28 • 7.33k • 458

Computer Vision Backbones 🧩

Collection of useful computer vision backbones to fine-tune. It also includes large image classification models, that can be used as backbone.

microsoft/resnet-50

Image Classification • 0.0B • Updated Feb 13, 2024 • 151k • • 431
google/vit-base-patch16-224-in21k

Image Feature Extraction • 0.1B • Updated Feb 5, 2024 • 3.13M • 360
google/vit-base-patch32-224-in21k

Image Feature Extraction • 0.1B • Updated Dec 8, 2022 • 7.87k • 19
facebook/dinov2-large

Image Feature Extraction • 0.3B • Updated Sep 6, 2023 • 960k • 88

Object Detection Models 🥥

facebook/detr-resnet-50

Object Detection • 0.0B • Updated Apr 10, 2024 • 412k • • 879
facebook/detr-resnet-101-dc5

Object Detection • 0.1B • Updated Sep 6, 2023 • 5.86k • 19
facebook/detr-resnet-50-dc5

Object Detection • 0.0B • Updated Sep 7, 2023 • 1.65k • 6
google/owlvit-base-patch32

Zero-Shot Object Detection • 0.2B • Updated Dec 12, 2023 • 134k • 135

Zero-shot Image Classification Models 🖼️

This is a collection for models that can be used for zero-shot image classification.

openai/clip-vit-large-patch14

Zero-Shot Image Classification • 0.4B • Updated Sep 15, 2023 • 11.2M • 1.82k
openai/clip-vit-base-patch32

Zero-Shot Image Classification • Updated Feb 29, 2024 • 17.4M • 732
laion/CLIP-ViT-bigG-14-laion2B-39B-b160k

Zero-Shot Image Classification • Updated Jan 22 • 1.17M • 285
kakaobrain/align-base

Zero-Shot Image Classification • Updated Mar 8, 2023 • 31.2k • 26

Video Classification Models 📺

microsoft/xclip-base-patch32

Video Classification • 0.2B • Updated Feb 4, 2024 • 251k • 96
facebook/timesformer-base-finetuned-k400

Video Classification • Updated Jan 2, 2023 • 22.1k • 42
facebook/timesformer-base-finetuned-k600

Video Classification • Updated Dec 12, 2022 • 10.1k • 12
google/vivit-b-16x2

Video Classification • Updated Aug 3, 2023 • 459 • 11

Text-to-Image Models 🥑

stabilityai/stable-diffusion-xl-base-1.0

Text-to-Image • Updated Oct 30, 2023 • 2.46M • • 6.8k
warp-ai/wuerstchen

Text-to-Image • Updated Mar 12, 2024 • 457 • 174
Deci/DeciDiffusion-v1-0

Text-to-Image • Updated Feb 15, 2024 • 10 • 138
stabilityai/stable-diffusion-xl-refiner-1.0

Image-to-Image • Updated Sep 25, 2023 • 488k • 1.94k

Segment Anything Model

This collection contains models and demos of SAM and it's smaller friends.

facebook/sam-vit-huge

Mask Generation • 0.6B • Updated Jan 11, 2024 • 135k • 174
facebook/sam-vit-base

Mask Generation • 0.1B • Updated Jan 11, 2024 • 274k • 144
facebook/sam-vit-large

Mask Generation • 0.3B • Updated Jan 11, 2024 • 247k • 28
Runtime error

43

43

Grounded SAM

💩

SigLIP

A collection dedicated to SigLIP applications

Running on Zero

71

71

Draw To Search Art

🐠

Draw/upload image and search among WikiART using SigLIP
Running on CPU Upgrade

22

22

Compare Clip Siglip

🏃

Compare strong zero-shot image classification models
Running on Zero

13

13

Multilingual Zero Shot Image Clf

🏢

Comparing powerful multilingual zero-shot image clf models
BAAI/bunny-phi-2-siglip-lora

Text Generation • Updated Mar 28, 2024 • 21 • 48

SegGPT

A collection of everything SegGPT.

Images Speak in Images: A Generalist Painter for In-Context Visual Learning

Paper • 2212.02499 • Published Dec 5, 2022
SegGPT: Segmenting Everything In Context

Paper • 2304.03284 • Published Apr 6, 2023 • 1
BAAI/seggpt-vit-large

0.4B • Updated Feb 22, 2024 • 2.4k • 4
BAAI/SegGPT

Updated Apr 21, 2023 • 19

gvhf/owl

google/owlvit-base-patch32

Zero-Shot Object Detection • 0.2B • Updated Dec 12, 2023 • 134k • 135
google/owlvit-base-patch16

Zero-Shot Object Detection • Updated Dec 12, 2023 • 7.22k • 12
google/owlvit-large-patch14

Zero-Shot Object Detection • Updated Dec 12, 2023 • 45.1k • 25
google/owlv2-base-patch16

Zero-Shot Object Detection • 0.2B • Updated Apr 15, 2024 • 67.9k • 27

merve/owl2

google/owlvit-base-patch32

Zero-Shot Object Detection • 0.2B • Updated Dec 12, 2023 • 134k • 135
google/owlvit-base-patch16

Zero-Shot Object Detection • Updated Dec 12, 2023 • 7.22k • 12
google/owlvit-large-patch14

Zero-Shot Object Detection • Updated Dec 12, 2023 • 45.1k • 25
google/owlv2-base-patch16

Zero-Shot Object Detection • 0.2B • Updated Apr 15, 2024 • 67.9k • 27

Document VLM Papers

VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

Paper • 2407.12594 • Published Jul 17, 2024 • 19

Video Language Models

A collection of video-language models

Running

21

21

Video Llava

🐨

Generate descriptions by uploading images or videos
llava-hf/LLaVA-NeXT-Video-7B-hf

Video-Text-to-Text • 7B • Updated 14 days ago • 70.6k • 105
llava-hf/LLaVA-NeXT-Video-7B-DPO-hf

Video-Text-to-Text • 7B • Updated 14 days ago • 1.74k • 9
llava-hf/LLaVA-NeXT-Video-7B-32K-hf

Image-Text-to-Text • 8B • Updated Feb 23 • 826 • 7

NVEagle

NVEagle/Eagle-X5-13B

Image-Text-to-Text • 15B • Updated Sep 16, 2024 • 54 • 15
NVEagle/Eagle-X5-13B-Chat

Image-Text-to-Text • 15B • Updated Sep 16, 2024 • 870 • 28
NVEagle/Eagle-X5-7B

Image-Text-to-Text • 9B • Updated Sep 16, 2024 • 1.86k • 26
Running on Zero

64

64

Eagle X5 13B Chat

🚀

Combine text and images to generate responses

Zero-shot Segmentation

sam-hq-team/SegInW

Updated Jul 13, 2023 • 1
xdecoder/X-Decoder

Updated Dec 27, 2023 • 5
xdecoder/SEEM

Updated Dec 30, 2023 • 8
Sleeping

60

60

OWLSAM2

🏃

Releases July 25

Wan-AI/Wan2.2-I2V-A14B

Image-to-Video • Updated 8 days ago • • 144
allenai/olmOCR-7B-0725

Image-to-Text • 8B • Updated 12 days ago • 1.84k • 39
Wan-AI/Wan2.2-T2V-A14B

Text-to-Video • Updated 8 days ago • • 167
Qwen/Qwen3-235B-A22B-Thinking-2507

Text Generation • 235B • Updated 5 days ago • 9.08k • • 279

Releases July 11

HuggingFaceTB/SmolLM3-3B

Text Generation • 3B • Updated 8 days ago • 768k • • 632
moonshotai/Kimi-K2-Instruct

Text Generation • Updated 8 days ago • 395k • • 2.01k
fal/Realism-Detailer-Kontext-Dev-LoRA

Image-to-Image • Updated 29 days ago • 2.39k • • 36
Alibaba-NLP/WebSailor-3B

3B • Updated 26 days ago • 688 • 65

Releases June 27

nari-labs/Dia-1.6B-0626

Text-to-Speech • 2B • Updated Jul 3 • 90.9k • 68
google/gemma-3n-E4B-it

Image-Text-to-Text • 8B • Updated 22 days ago • 153k • 681
ByteDance/XVerse

Text-to-Image • Updated Jul 1 • 800 • 88
nvidia/llama-nemoretriever-colembed-3b-v1

Visual Document Retrieval • 4B • Updated 25 days ago • 475 • 36

OCR Models & Datasets

opendatalab/OmniDocBench

Viewer • Updated Feb 11 • 984 • 5.94k • 30
nanonets/Nanonets-OCR-s

Image-Text-to-Text • 4B • Updated Jun 20 • 176k • 1.46k
echo840/MonkeyOCR

Image-Text-to-Text • Updated 20 days ago • 13.5k • 496
Running on Zero

MCP

116

116

OCR2

💻

monkey ocr / nanonets ocr / smoldocling / typhoon ocr

Releases June 6

Qwen/Qwen3-Reranker-4B

Text Ranking • 4B • Updated Jun 9 • 33.8k • 79
echo840/MonkeyOCR

Image-Text-to-Text • Updated 20 days ago • 13.5k • 496
openbmb/MiniCPM4-8B

Text Generation • 8B • Updated Jun 17 • 3.46k • 275
arcee-ai/Homunculus

Text Generation • 12B • Updated Jun 3 • 122 • 97

Releases 23 May

ByteDance-Seed/BAGEL-7B-MoT

Any-to-Any • 15B • Updated Jun 23 • 1.07k • 1.1k
mistralai/Devstral-Small-2505

24B • Updated 8 days ago • 34k • 839
ByteDance/Dolphin

Image-Text-to-Text • 0.4B • Updated 20 days ago • 16.3k • 440
moondream/moondream-2b-2025-04-14-4bit

Image-Text-to-Text • 1B • Updated May 22 • 7.21k • 52

May 9 Releases

tencent/HunyuanCustom

Image-to-Video • Updated Jun 6 • 187
stepfun-ai/Step1X-3D

Updated May 13 • 95
cognition-ai/Kevin-32B

33B • Updated May 6 • 1.45k • 147
ServiceNow-AI/Apriel-Nemotron-15b-Thinker

Text Generation • 15B • Updated May 15 • 3.92k • 91

Releases Apr 21 & May 2

facebook/EdgeTAM

Updated Apr 30 • 9
nvidia/parakeet-tdt-0.6b-v2

Automatic Speech Recognition • 0.6B • Updated Jun 26 • 624k • 1.27k
deepseek-ai/DeepSeek-Prover-V2-671B

Text Generation • 685B • Updated Apr 30 • 1.75k • • 805
Qwen/Qwen2.5-Omni-3B

Any-to-Any • 6B • Updated Apr 30 • 198k • 263

April 16 Releases

giskardai/realharm

Viewer • Updated Apr 16 • 136 • 56 • 9
Junfeng5/Liquid_V1_7B

Any-to-Any • 9B • Updated Mar 20 • 1.13k • 96

April 11 Releases

moonshotai/Kimi-VL-A3B-Thinking

Image-Text-to-Text • 16B • Updated Jun 27 • 99.3k • 433
agentica-org/DeepCoder-14B-Preview

Text Generation • 15B • Updated May 11 • 32.3k • • 669
HiDream-ai/HiDream-I1-Full

Text-to-Image • Updated 19 days ago • 253k • • 952
OpenGVLab/InternVL3-78B

Image-Text-to-Text • 78B • Updated May 29 • 106k • 211

March 21 Releases

ds4sd/SmolDocling-256M-preview

Image-Text-to-Text • 0.3B • Updated May 16 • 67.1k • 1.52k
sesame/csm-1b

Text-to-Speech • Updated 13 days ago • 28.5k • 2.16k
mistralai/Mistral-Small-3.1-24B-Instruct-2503

24B • Updated 8 days ago • 234k • 1.3k
tencent/Hunyuan3D-2mini

Image-to-3D • Updated 6 days ago • 7.71k • 89

Feb 14 Releases 💌

OpenGVLab/InternVideo2_5_Chat_8B

Video-Text-to-Text • 8B • Updated about 20 hours ago • 23k • 73
AIDC-AI/Ovis2-34B

Image-Text-to-Text • 35B • Updated Feb 27 • 581 • 150
open-r1/OpenR1-Qwen-7B

Text Generation • 8B • Updated May 28 • 938 • • 53
nomic-ai/nomic-embed-text-v2-moe

Sentence Similarity • 0.5B • Updated Apr 1 • 153k • 417

January 31 Releases 🧤

allenai/Llama-3.1-Tulu-3-405B

Text Generation • 406B • Updated Feb 10 • 455 • 107
Qwen/Qwen2.5-VL-72B-Instruct

Image-Text-to-Text • 73B • Updated Jun 6 • 603k • • 517
mistralai/Mistral-Small-24B-Instruct-2501

24B • Updated 8 days ago • 69k • 934
deepseek-ai/Janus-Pro-7B

Any-to-Any • Updated Feb 1 • 129k • 3.47k

Jan 24 Releases

ostris/Flex.1-alpha

Text-to-Image • Updated Jan 19 • 1.49k • 468
Qwen/Qwen2.5-Math-PRM-72B

Text Classification • 73B • Updated Jan 17 • 864 • 72
HuggingFaceTB/SmolVLM-500M-Instruct

Image-Text-to-Text • 0.5B • Updated Apr 8 • 24.2k • 166
deepseek-ai/DeepSeek-R1

Text Generation • 685B • Updated Mar 27 • 843k • • 12.6k

Jan 10 Releases 🌨️

vikhyatk/moondream2

Image-Text-to-Text • 2B • Updated 29 days ago • 381k • 1.23k
DAMO-NLP-SG/multimodal_textbook

Updated Mar 17 • 766 • 145
ByteDance/Sa2VA-1B

Image-Text-to-Text • 1B • Updated Mar 19 • 659 • 25
nvidia/Cosmos-1.0-Autoregressive-4B

Updated Feb 11 • 24 • 53

Nov 29 Releases 🌲🌲

HuggingFaceTB/SmolVLM-Instruct

Image-Text-to-Text • 2B • Updated Apr 8 • 97.5k • 528
Qwen/QwQ-32B-Preview

Text Generation • 33B • Updated Jan 12 • 39.6k • • 1.74k
nvidia/Hymba-1.5B-Base

Text Generation • 2B • Updated Jan 2 • 2.74k • 146
vidore/colsmolvlm-v0.1

Visual Document Retrieval • Updated Mar 14 • 1.21k • 52

Nov 15 Releases 🍂

microsoft/LLM2CLIP-EVA02-L-14-336

Zero-Shot Image Classification • Updated Nov 22, 2024 • 70 • 58
microsoft/LLM2CLIP-EVA02-B-16

Updated Feb 8 • 11 • 10
PleIAs/common_corpus

Viewer • Updated Jun 10 • 470M • 26k • 304
Qwen/Qwen2.5-Coder-32B-Instruct

Text Generation • 33B • Updated Jan 12 • 72.8k • • 1.91k

MIT Talk 31/10 Papers

NVLM: Open Frontier-Class Multimodal LLMs

Paper • 2409.11402 • Published Sep 17, 2024 • 75
BRAVE: Broadening the visual encoding of vision-language models

Paper • 2404.07204 • Published Apr 10, 2024 • 19
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Paper • 2403.18814 • Published Mar 27, 2024 • 48
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Paper • 2409.17146 • Published Sep 25, 2024 • 122

LOTUS 🪷

Runtime error

101

101

LOTUS Normal

🌍

Generate high-quality predictions from images
Runtime error

75

75

LOTUS Depth

🚀

Generate depth maps from images and videos
jingheya/lotus-depth-g-v1-0

Depth Estimation • Updated Oct 5, 2024 • 15.2k • 24
jingheya/lotus-depth-d-v1-0

Depth Estimation • Updated Oct 5, 2024 • 356 • 5

BRAVE Models 🦁

Models mentioned in https://huggingface.co/papers/2404.07204

facebook/dinov2-large

Image Feature Extraction • 0.3B • Updated Sep 6, 2023 • 960k • 88
google/flan-t5-xl

3B • Updated Nov 28, 2023 • 296k • 513
google/siglip-large-patch16-384

Zero-Shot Image Classification • 0.7B • Updated Sep 26, 2024 • 21.3k • 9
google/vit-huge-patch14-224-in21k

Image Feature Extraction • 0.6B • Updated Feb 14, 2024 • 29.2k • 21

Image Classification Models 🐶 🐱

facebook/deit-base-distilled-patch16-384

Image Classification • 0.1B • Updated Sep 12, 2023 • 1.42k • 5
facebook/convnextv2-base-1k-224

Image Classification • 0.1B • Updated Feb 17 • 606 • • 3
facebook/deit-base-distilled-patch16-224

Image Classification • Updated Jul 13, 2022 • 15.2k • • 27
google/vit-base-patch32-384

Image Classification • 0.1B • Updated Sep 11, 2023 • 3.66k • • 23

Image Segmentation Models 💜

A collection of instance/semantic/panoptic segmentation models.

facebook/maskformer-swin-large-coco

Image Segmentation • 0.2B • Updated Sep 11, 2023 • 86.2k • • 26
nvidia/segformer-b0-finetuned-ade-512-512

Image Segmentation • 0.0B • Updated Jan 14, 2024 • 280k • • 163
facebook/detr-resnet-50-dc5-panoptic

Image Segmentation • 0.0B • Updated Sep 11, 2023 • 106 • • 3
nvidia/segformer-b5-finetuned-cityscapes-1024-1024

Image Segmentation • Updated Aug 9, 2022 • 143k • • 30

Image-to-Image Models 🎨

Collection of image to image editing, image enhancement (SR, deblur, brighten) and text-to-image adapter models.

timbrooks/instruct-pix2pix

Image-to-Image • Updated Jul 5, 2023 • 39.9k • 1.13k
TencentARC/t2i-adapter-canny-sdxl-1.0

Image-to-Image • Updated Sep 7, 2023 • 4.61k • 52
TencentARC/t2i-adapter-sketch-sdxl-1.0

Image-to-Image • Updated Sep 8, 2023 • 5.23k • 76
CrucibleAI/ControlNetMediaPipeFace

Image-to-Image • Updated May 19, 2023 • 960 • 571

Image-to-Text Models 📝

This collection contains image captioning and OCR models.

Salesforce/blip-image-captioning-large

Image-to-Text • 0.5B • Updated Feb 3 • 1.65M • 1.38k
Salesforce/blip-image-captioning-base

Image-to-Text • Updated Feb 3 • 1.96M • 763
microsoft/trocr-base-handwritten

Image-to-Text • 0.3B • Updated Feb 11 • 562k • 424
microsoft/git-large-coco

Image-to-Text • 0.4B • Updated Jun 26, 2023 • 3.19k • 104

Foundation Models for Vision 🧩

Foundation models for computer vision.

Running

85

85

Grounding DINO Demo

💻

Cutting edge open-vocabulary object detection app
Running

88

88

Owlv2

👀

State-of-the-art Zero-shot Object Detection
Runtime error

41

41

BLIP2 with transformers

🌖

BLIP2 (cutting edge image captioning) in 🤗transformers
Runtime error

377

377

IDEFICS Playground

🐨

OWL-series 🦉

Models and applications of OWL-ViT and OWLv2.

Running

88

88

Owlv2

👀

State-of-the-art Zero-shot Object Detection
Running on Zero

64

64

Owl Tracking

⚡

Powerful foundation model for zero-shot object tracking
Running

25

25

Search and Detect (CLIP/OWL-ViT)

🦉

Search and detect objects in images using text queries
Running on Zero

102

102

OWLSAM

😻

State-of-the-art open-vocabulary image segmentation ⚡️

Awesome Document AI

A collection of open-source document AI 📄 📝 📈

Runtime error

84

84

UDOP

🏃

Generate text from document images
Runtime error

40

40

Pix2struct

📚

Play with all the pix2struct variants in this d
Running

25

25

Compare Docvqa Models

🦀

Compare different visual question answering
Runtime error

289

289

DocQuery — Document Query Engine

🦉

Vision Language Models Papers 🖼️💬📝

Papers about vision-language models, most important ones are on top of the list.

Improved Baselines with Visual Instruction Tuning

Paper • 2310.03744 • Published Oct 5, 2023 • 38
DeepSeek-VL: Towards Real-World Vision-Language Understanding

Paper • 2403.05525 • Published Mar 8, 2024 • 47
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities

Paper • 2308.12966 • Published Aug 24, 2023 • 9
LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model

Paper • 2404.01331 • Published Mar 29, 2024 • 28

gv-hf/owl

google/owlvit-base-patch32

Zero-Shot Object Detection • 0.2B • Updated Dec 12, 2023 • 134k • 135
google/owlvit-base-patch16

Zero-Shot Object Detection • Updated Dec 12, 2023 • 7.22k • 12
google/owlvit-large-patch14

Zero-Shot Object Detection • Updated Dec 12, 2023 • 45.1k • 25
google/owlv2-base-patch16

Zero-Shot Object Detection • 0.2B • Updated Apr 15, 2024 • 67.9k • 27

Depth Anything v2 Release

A comprehensive collection on DAv2

depth-anything/Depth-Anything-V2-Small

Depth Estimation • Updated Jul 8, 2024 • 9.8k • 69
depth-anything/Depth-Anything-V2-Large

Depth Estimation • Updated Jul 8, 2024 • 91.4k • 111
Running on Zero

495

495

Depth Anything V2

🌖

Generate depth maps from images
depth-anything/DA-2K

Viewer • Updated Jun 14, 2024 • 1.04k • 679 • 12

Vision Language Leaderboards

This collection has all the vision language leaderboards.

Running

168

168

Vidore Leaderboard

🥇

Explore visual document retrieval benchmark results
Running on CPU Upgrade

842

842

Open VLM Leaderboard

🌎

VLMEvalKit Evaluation Results Collection
Running

552

552

Vision Arena (Testing VLMs side-by-side)

🖼

Analyze images to detect and label objects
Running

85

85

SEED-Bench Leaderboard

🏆

SAM2

All the models and demos for SAM2

merve/sam2-hiera-tiny

Mask Generation • Updated Aug 2, 2024 • 30
merve/sam2-hiera-small

Mask Generation • Updated Aug 2, 2024 • 60 • 1
merve/sam2-hiera-large

Mask Generation • Updated Aug 2, 2024 • 1.21k • 2
merve/sam2-hiera-base-plus

Mask Generation • Updated Aug 2, 2024 • 46

Multimodal RAG

vidore/colpali-v1.2

Visual Document Retrieval • Updated Mar 14 • 37.5k • 109
Qwen/Qwen2-VL-7B-Instruct

Image-Text-to-Text • 8B • Updated Feb 6 • 570k • • 1.22k
Qwen/Qwen2-VL-2B-Instruct

Image-Text-to-Text • 2B • Updated Jan 12 • 689k • 434
Qwen/Qwen2-72B-Instruct

Text Generation • 73B • Updated Oct 8, 2024 • 45.6k • • 715

Releases August 2

stepfun-ai/step3

Image-Text-to-Text • 321B • Updated 3 days ago • 421 • 125
nunchaku-tech/nunchaku-flux.1-krea-dev

Text-to-Image • Updated 4 days ago • 11.2k • 57
fdtn-ai/Foundation-Sec-8B-Instruct

Text Generation • 8B • Updated about 3 hours ago • 1.12k • 18
Wan-AI/Wan2.2-TI2V-5B-Diffusers

Text-to-Video • Updated 8 days ago • 16.2k • 54

Releases July 25

Wan-AI/Wan2.2-I2V-A14B

Image-to-Video • Updated 8 days ago • • 144
allenai/olmOCR-7B-0725

Image-to-Text • 8B • Updated 12 days ago • 1.84k • 39
Wan-AI/Wan2.2-T2V-A14B

Text-to-Video • Updated 8 days ago • • 167
Qwen/Qwen3-235B-A22B-Thinking-2507

Text Generation • 235B • Updated 5 days ago • 9.08k • • 279

Releases July 18

nvidia/OpenReasoning-Nemotron-32B

Text Generation • 33B • Updated 3 days ago • 3.61k • • 109
ByteDance-Seed/Seed-X-RM-7B

Translation • Updated 5 days ago • 148k • 25
LGAI-EXAONE/EXAONE-4.0-32B

Text Generation • 32B • Updated 1 day ago • 550k • 226
vidore/colqwen-omni-v0.1

Visual Document Retrieval • Updated 19 days ago • 5.81k • 86

Releases July 11

HuggingFaceTB/SmolLM3-3B

Text Generation • 3B • Updated 8 days ago • 768k • • 632
moonshotai/Kimi-K2-Instruct

Text Generation • Updated 8 days ago • 395k • • 2.01k
fal/Realism-Detailer-Kontext-Dev-LoRA

Image-to-Image • Updated 29 days ago • 2.39k • • 36
Alibaba-NLP/WebSailor-3B

3B • Updated 26 days ago • 688 • 65

Releases July 4

apple/DiffuCoder-7B-cpGRPO

8B • Updated Jul 4 • 4.74k • 303
BAAI/MTVCraft

Text-to-Video • Updated 29 days ago • 238 • 35
kyutai/tts-1.6b-en_fr

Text-to-Speech • Updated 28 days ago • 66.8k • 307
apple/DiffuCoder-7B-Base

8B • Updated Jul 4 • 940 • 21

Releases June 27

nari-labs/Dia-1.6B-0626

Text-to-Speech • 2B • Updated Jul 3 • 90.9k • 68
google/gemma-3n-E4B-it

Image-Text-to-Text • 8B • Updated 22 days ago • 153k • 681
ByteDance/XVerse

Text-to-Image • Updated Jul 1 • 800 • 88
nvidia/llama-nemoretriever-colembed-3b-v1

Visual Document Retrieval • 4B • Updated 25 days ago • 475 • 36

June 20 Releases

moonshotai/Kimi-VL-A3B-Thinking-2506

Image-Text-to-Text • 16B • Updated 5 days ago • 42.1k • 250
mistralai/Mistral-Small-3.2-24B-Instruct-2506

24B • Updated 8 days ago • 138k • 390
kyutai/stt-1b-en_fr

Automatic Speech Recognition • Updated Jun 26 • 79
google/magenta-realtime

Updated 21 days ago • 420 • 463

OCR Models & Datasets

opendatalab/OmniDocBench

Viewer • Updated Feb 11 • 984 • 5.94k • 30
nanonets/Nanonets-OCR-s

Image-Text-to-Text • 4B • Updated Jun 20 • 176k • 1.46k
echo840/MonkeyOCR

Image-Text-to-Text • Updated 20 days ago • 13.5k • 496
Running on Zero

MCP

116

116

OCR2

💻

monkey ocr / nanonets ocr / smoldocling / typhoon ocr

Releases June 13

ByteDance/LatentSync-1.6

Updated Jun 12 • 9.82k • 34
V-JEPA 2

Collection

A frontier video understanding model developed by FAIR, Meta, which extends the pretraining objectives of https://ai.meta.com/blog/v-jepa-yann • 8 items • Updated Jun 13 • 153
nanonets/Nanonets-OCR-s

Image-Text-to-Text • 4B • Updated Jun 20 • 176k • 1.46k
tencent/Hunyuan3D-2.1

Image-to-3D • Updated 6 days ago • 58k • 603

Releases June 6

Qwen/Qwen3-Reranker-4B

Text Ranking • 4B • Updated Jun 9 • 33.8k • 79
echo840/MonkeyOCR

Image-Text-to-Text • Updated 20 days ago • 13.5k • 496
openbmb/MiniCPM4-8B

Text Generation • 8B • Updated Jun 17 • 3.46k • 275
arcee-ai/Homunculus

Text Generation • 12B • Updated Jun 3 • 122 • 97

Releases 30 May

All the releases of the week of 30th May.

deepseek-ai/DeepSeek-R1-0528

Text Generation • 685B • Updated May 29 • 455k • • 2.35k
Running on Zero

200

200

BAGEL

🚀

Demo for BAGEL
tencent/HunyuanPortrait

Image-to-Video • Updated May 27 • 70
XiaomiMiMo/MiMo-7B-RL-0530

Text Generation • 8B • Updated Jun 5 • 7.36k • 38

Releases 23 May

ByteDance-Seed/BAGEL-7B-MoT

Any-to-Any • 15B • Updated Jun 23 • 1.07k • 1.1k
mistralai/Devstral-Small-2505

24B • Updated 8 days ago • 34k • 839
ByteDance/Dolphin

Image-Text-to-Text • 0.4B • Updated 20 days ago • 16.3k • 440
moondream/moondream-2b-2025-04-14-4bit

Image-Text-to-Text • 1B • Updated May 22 • 7.21k • 52

May 16 Releases

Qwen/WorldPM-72B

Text Classification • 73B • Updated May 17 • 1.09k • 75
Running on Zero

MCP

1.05k

1.05k

LTX Video Fast

🎥

ultra-fast video model, LTX 0.9.8 13B distilled
BLIP3o/BLIP3o-Pretrain-Long-Caption

Viewer • Updated Jun 26 • 27.2M • 21.6k • 41
BLIP3o/BLIP3o-Model-8B

14B • Updated Jun 4 • 1.63k • 101

May 9 Releases

tencent/HunyuanCustom

Image-to-Video • Updated Jun 6 • 187
stepfun-ai/Step1X-3D

Updated May 13 • 95
cognition-ai/Kevin-32B

33B • Updated May 6 • 1.45k • 147
ServiceNow-AI/Apriel-Nemotron-15b-Thinker

Text Generation • 15B • Updated May 15 • 3.92k • 91

Any-to-Any Models, Datasets, Spaces

Running

76

76

MMaDA

🌍

Demo for MMaDA: Multimodal Large Diffusion Language Models
Running on Zero

200

200

BAGEL

🚀

Demo for BAGEL
Gen-Verse/MMaDA-8B-Base

Any-to-Any • 8B • Updated May 24 • 7.65k • 82
ByteDance-Seed/BAGEL-7B-MoT

Any-to-Any • 15B • Updated Jun 23 • 1.07k • 1.1k

Releases Apr 21 & May 2

facebook/EdgeTAM

Updated Apr 30 • 9
nvidia/parakeet-tdt-0.6b-v2

Automatic Speech Recognition • 0.6B • Updated Jun 26 • 624k • 1.27k
deepseek-ai/DeepSeek-Prover-V2-671B

Text Generation • 685B • Updated Apr 30 • 1.75k • • 805
Qwen/Qwen2.5-Omni-3B

Any-to-Any • 6B • Updated Apr 30 • 198k • 263

InternVL3 HF

OpenGVLab/InternVL3-1B-hf

Image-Text-to-Text • 0.9B • Updated Apr 23 • 42.2k • 5
OpenGVLab/InternVL3-2B-hf

Image-Text-to-Text • 2B • Updated Apr 23 • 21.5k • 2
OpenGVLab/InternVL3-8B-hf

Image-Text-to-Text • 8B • Updated Apr 23 • 37k • 8
OpenGVLab/InternVL3-14B-hf

Image-Text-to-Text • 15B • Updated Apr 23 • 7.62k

April 16 Releases

giskardai/realharm

Viewer • Updated Apr 16 • 136 • 56 • 9
Junfeng5/Liquid_V1_7B

Any-to-Any • 9B • Updated Mar 20 • 1.13k • 96

Multimodal DSE Retrievers

A collection of DSE models for multimodal retrieval

racineai/Flantier-SmolVLM-2B-dse

2B • Updated Jun 18 • 350 • 9
MrLight/dse-qwen2-2b-mrl-v1

Visual Document Retrieval • Updated Feb 26 • 5.99k • 59
marco/mcdse-2b-v1

2B • Updated Oct 29, 2024 • 2.63k • 56
llamaindex/vdr-2b-multi-v1

Image-to-Text • 2B • Updated May 21 • 6.11k • 118

April 11 Releases

moonshotai/Kimi-VL-A3B-Thinking

Image-Text-to-Text • 16B • Updated Jun 27 • 99.3k • 433
agentica-org/DeepCoder-14B-Preview

Text Generation • 15B • Updated May 11 • 32.3k • • 669
HiDream-ai/HiDream-I1-Full

Text-to-Image • Updated 19 days ago • 253k • • 952
OpenGVLab/InternVL3-78B

Image-Text-to-Text • 78B • Updated May 29 • 106k • 211

March 28 Releases

deepseek-ai/DeepSeek-V3-0324

Text Generation • 685B • Updated Mar 27 • 459k • • 3.02k
Qwen/Qwen2.5-Omni-7B

Any-to-Any • 11B • Updated Apr 30 • 116k • 1.73k
google/txgemma-27b-chat

Text Generation • 27B • Updated Apr 10 • 1.31k • 54
Running

334

334

Qwen2.5 Omni 7B Demo

🏆

Generate text and speech responses from various inputs

March 21 Releases

ds4sd/SmolDocling-256M-preview

Image-Text-to-Text • 0.3B • Updated May 16 • 67.1k • 1.52k
sesame/csm-1b

Text-to-Speech • Updated 13 days ago • 28.5k • 2.16k
mistralai/Mistral-Small-3.1-24B-Instruct-2503

24B • Updated 8 days ago • 234k • 1.3k
tencent/Hunyuan3D-2mini

Image-to-3D • Updated 6 days ago • 7.71k • 89

Türkçe VLMler

Qwen/Qwen2-VL-7B-Instruct

Image-Text-to-Text • 8B • Updated Feb 6 • 570k • • 1.22k
Qwen/Qwen2-VL-2B-Instruct

Image-Text-to-Text • 2B • Updated Jan 12 • 689k • 434
CohereLabs/aya-vision-8b

Image-Text-to-Text • 9B • Updated 5 days ago • 29.4k • • 306
CohereLabs/aya-vision-32b

Image-Text-to-Text • 33B • Updated May 14 • 171 • • 212

Feb 14 Releases 💌

OpenGVLab/InternVideo2_5_Chat_8B

Video-Text-to-Text • 8B • Updated about 20 hours ago • 23k • 73
AIDC-AI/Ovis2-34B

Image-Text-to-Text • 35B • Updated Feb 27 • 581 • 150
open-r1/OpenR1-Qwen-7B

Text Generation • 8B • Updated May 28 • 938 • • 53
nomic-ai/nomic-embed-text-v2-moe

Sentence Similarity • 0.5B • Updated Apr 1 • 153k • 417

Feb 7 Releases 🧣

lerobot/pi0

Robotics • 4B • Updated Mar 6 • 12.3k • 282
kyutai/hibiki-2b-pytorch-bf16

Translation • Updated May 28 • 197 • 55
Alpha-VLLM/Lumina-Image-2.0

Text-to-Image • Updated Mar 30 • 11.5k • • 331
adyen/DABstep

Viewer • Updated about 3 hours ago • 490k • 5.67k • 25

January 31 Releases 🧤

allenai/Llama-3.1-Tulu-3-405B

Text Generation • 406B • Updated Feb 10 • 455 • 107
Qwen/Qwen2.5-VL-72B-Instruct

Image-Text-to-Text • 73B • Updated Jun 6 • 603k • • 517
mistralai/Mistral-Small-24B-Instruct-2501

24B • Updated 8 days ago • 69k • 934
deepseek-ai/Janus-Pro-7B

Any-to-Any • Updated Feb 1 • 129k • 3.47k

Models, Jan 27

Running on Zero

255

255

Qwen2-VL-7B

🔥

Generate text by combining an image and a question
Running

57

57

UI-TARS

🌖

Select coordinates on an image based on instructions
Running

87

87

Qwen2.5-1M Demo

💻

Upload documents and ask questions
Qwen/Qwen2.5-14B-Instruct-1M

Text Generation • 15B • Updated Jan 29 • 21.1k • • 316

Jan 24 Releases

ostris/Flex.1-alpha

Text-to-Image • Updated Jan 19 • 1.49k • 468
Qwen/Qwen2.5-Math-PRM-72B

Text Classification • 73B • Updated Jan 17 • 864 • 72
HuggingFaceTB/SmolVLM-500M-Instruct

Image-Text-to-Text • 0.5B • Updated Apr 8 • 24.2k • 166
deepseek-ai/DeepSeek-R1

Text Generation • 685B • Updated Mar 27 • 843k • • 12.6k

Jan 17 Releases ❄️

Models and datasets of the second week of Jan 2025.

openbmb/MiniCPM-o-2_6

Any-to-Any • 9B • Updated Jun 20 • 115k • 1.21k
MiniMaxAI/MiniMax-Text-01

Text Generation • 456B • Updated Jul 3 • 1.68k • 638
OuteAI/OuteTTS-0.3-1B

Text-to-Speech • 1B • Updated Apr 24 • 1.2k • 101
NovaSky-AI/Sky-T1_data_17k

Viewer • Updated Jan 14 • 16.4k • 276 • 184

Jan 10 Releases 🌨️

vikhyatk/moondream2

Image-Text-to-Text • 2B • Updated 29 days ago • 381k • 1.23k
DAMO-NLP-SG/multimodal_textbook

Updated Mar 17 • 766 • 145
ByteDance/Sa2VA-1B

Image-Text-to-Text • 1B • Updated Mar 19 • 659 • 25
nvidia/Cosmos-1.0-Autoregressive-4B

Updated Feb 11 • 24 • 53

Dec 6 Releases 🎄

meta-llama/Llama-3.3-70B-Instruct

Text Generation • 71B • Updated Dec 21, 2024 • 386k • • 2.46k
Qwen/Qwen2-VL-72B

Image-Text-to-Text • 73B • Updated Dec 6, 2024 • 810 • 79
google/paligemma2-3b-pt-224

Image-Text-to-Text • 3B • Updated Dec 5, 2024 • 177k • 154
tencent/HunyuanVideo

Text-to-Video • Updated Mar 6 • 1.58k • • 2k

Nov 29 Releases 🌲🌲

HuggingFaceTB/SmolVLM-Instruct

Image-Text-to-Text • 2B • Updated Apr 8 • 97.5k • 528
Qwen/QwQ-32B-Preview

Text Generation • 33B • Updated Jan 12 • 39.6k • • 1.74k
nvidia/Hymba-1.5B-Base

Text Generation • 2B • Updated Jan 2 • 2.74k • 146
vidore/colsmolvlm-v0.1

Visual Document Retrieval • Updated Mar 14 • 1.21k • 52

Nov 22 Releases ❄️

mistralai/Pixtral-Large-Instruct-2411

Updated 8 days ago • 81 • 418
microsoft/orca-agentinstruct-1M-v1

Viewer • Updated Nov 1, 2024 • 1.05M • 1.27k • 447
Xkev/Llama-3.2V-11B-cot

Image-Text-to-Text • 11B • Updated Dec 16, 2024 • 3.93k • 153
jinaai/jina-clip-v2

Feature Extraction • 0.9B • Updated Apr 28 • 51k • 268

Nov 15 Releases 🍂

microsoft/LLM2CLIP-EVA02-L-14-336

Zero-Shot Image Classification • Updated Nov 22, 2024 • 70 • 58
microsoft/LLM2CLIP-EVA02-B-16

Updated Feb 8 • 11 • 10
PleIAs/common_corpus

Viewer • Updated Jun 10 • 470M • 26k • 304
Qwen/Qwen2.5-Coder-32B-Instruct

Text Generation • 33B • Updated Jan 12 • 72.8k • • 1.91k

Nov 1 Releases

Running on Zero

84

84

LongVU

🌖

Generate responses to video or image inputs
facebook/MobileLLM-1B

Text Generation • Updated May 5 • 236 • 120
Vision-CAIR/LongVU_Qwen2_7B

Video-Text-to-Text • 8B • Updated Feb 28 • 185 • 72
Vision-CAIR/LongVU_Llama3_2_3B_img

Updated Feb 28 • 3 • 6

MIT Talk 31/10 Papers

NVLM: Open Frontier-Class Multimodal LLMs

Paper • 2409.11402 • Published Sep 17, 2024 • 75
BRAVE: Broadening the visual encoding of vision-language models

Paper • 2404.07204 • Published Apr 10, 2024 • 19
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Paper • 2403.18814 • Published Mar 27, 2024 • 48
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Paper • 2409.17146 • Published Sep 25, 2024 • 122

October 25 Releases

ibm-granite/granite-3.0-8b-instruct

Text Generation • 8B • Updated Dec 19, 2024 • 23.4k • 202
ibm-granite/granite-3.0-2b-instruct

Text Generation • 3B • Updated Dec 19, 2024 • 4.25k • 46
CohereLabs/aya-expanse-8b

Text Generation • 8B • Updated 5 days ago • 14.6k • • 390
CohereLabs/aya-expanse-32b

Text Generation • 32B • Updated 5 days ago • 6.54k • • 266

LOTUS 🪷

Runtime error

101

101

LOTUS Normal

🌍

Generate high-quality predictions from images
Runtime error

75

75

LOTUS Depth

🚀

Generate depth maps from images and videos
jingheya/lotus-depth-g-v1-0

Depth Estimation • Updated Oct 5, 2024 • 15.2k • 24
jingheya/lotus-depth-d-v1-0

Depth Estimation • Updated Oct 5, 2024 • 356 • 5

New Depth Models

Recent depth models

Running on Zero

177

177

DepthCrafter

🦀

a super consistent video depth model
Paused

222

222

Depth Pro

🚀

Generate an inverse depth map from an image
Runtime error

75

75

LOTUS Depth

🚀

Generate depth maps from images and videos
apple/DepthPro

Depth Estimation • Updated Feb 28 • 7.33k • 458

BRAVE Models 🦁

Models mentioned in https://huggingface.co/papers/2404.07204

facebook/dinov2-large

Image Feature Extraction • 0.3B • Updated Sep 6, 2023 • 960k • 88
google/flan-t5-xl

3B • Updated Nov 28, 2023 • 296k • 513
google/siglip-large-patch16-384

Zero-Shot Image Classification • 0.7B • Updated Sep 26, 2024 • 21.3k • 9
google/vit-huge-patch14-224-in21k

Image Feature Extraction • 0.6B • Updated Feb 14, 2024 • 29.2k • 21

Computer Vision Backbones 🧩

Collection of useful computer vision backbones to fine-tune. It also includes large image classification models, that can be used as backbone.

microsoft/resnet-50

Image Classification • 0.0B • Updated Feb 13, 2024 • 151k • • 431
google/vit-base-patch16-224-in21k

Image Feature Extraction • 0.1B • Updated Feb 5, 2024 • 3.13M • 360
google/vit-base-patch32-224-in21k

Image Feature Extraction • 0.1B • Updated Dec 8, 2022 • 7.87k • 19
facebook/dinov2-large

Image Feature Extraction • 0.3B • Updated Sep 6, 2023 • 960k • 88

Image Classification Models 🐶 🐱

facebook/deit-base-distilled-patch16-384

Image Classification • 0.1B • Updated Sep 12, 2023 • 1.42k • 5
facebook/convnextv2-base-1k-224

Image Classification • 0.1B • Updated Feb 17 • 606 • • 3
facebook/deit-base-distilled-patch16-224

Image Classification • Updated Jul 13, 2022 • 15.2k • • 27
google/vit-base-patch32-384

Image Classification • 0.1B • Updated Sep 11, 2023 • 3.66k • • 23

Object Detection Models 🥥

facebook/detr-resnet-50

Object Detection • 0.0B • Updated Apr 10, 2024 • 412k • • 879
facebook/detr-resnet-101-dc5

Object Detection • 0.1B • Updated Sep 6, 2023 • 5.86k • 19
facebook/detr-resnet-50-dc5

Object Detection • 0.0B • Updated Sep 7, 2023 • 1.65k • 6
google/owlvit-base-patch32

Zero-Shot Object Detection • 0.2B • Updated Dec 12, 2023 • 134k • 135

Image Segmentation Models 💜

A collection of instance/semantic/panoptic segmentation models.

facebook/maskformer-swin-large-coco

Image Segmentation • 0.2B • Updated Sep 11, 2023 • 86.2k • • 26
nvidia/segformer-b0-finetuned-ade-512-512

Image Segmentation • 0.0B • Updated Jan 14, 2024 • 280k • • 163
facebook/detr-resnet-50-dc5-panoptic

Image Segmentation • 0.0B • Updated Sep 11, 2023 • 106 • • 3
nvidia/segformer-b5-finetuned-cityscapes-1024-1024

Image Segmentation • Updated Aug 9, 2022 • 143k • • 30

Zero-shot Image Classification Models 🖼️

This is a collection for models that can be used for zero-shot image classification.

openai/clip-vit-large-patch14

Zero-Shot Image Classification • 0.4B • Updated Sep 15, 2023 • 11.2M • 1.82k
openai/clip-vit-base-patch32

Zero-Shot Image Classification • Updated Feb 29, 2024 • 17.4M • 732
laion/CLIP-ViT-bigG-14-laion2B-39B-b160k

Zero-Shot Image Classification • Updated Jan 22 • 1.17M • 285
kakaobrain/align-base

Zero-Shot Image Classification • Updated Mar 8, 2023 • 31.2k • 26

Image-to-Image Models 🎨

Collection of image to image editing, image enhancement (SR, deblur, brighten) and text-to-image adapter models.

timbrooks/instruct-pix2pix

Image-to-Image • Updated Jul 5, 2023 • 39.9k • 1.13k
TencentARC/t2i-adapter-canny-sdxl-1.0

Image-to-Image • Updated Sep 7, 2023 • 4.61k • 52
TencentARC/t2i-adapter-sketch-sdxl-1.0

Image-to-Image • Updated Sep 8, 2023 • 5.23k • 76
CrucibleAI/ControlNetMediaPipeFace

Image-to-Image • Updated May 19, 2023 • 960 • 571

Video Classification Models 📺

microsoft/xclip-base-patch32

Video Classification • 0.2B • Updated Feb 4, 2024 • 251k • 96
facebook/timesformer-base-finetuned-k400

Video Classification • Updated Jan 2, 2023 • 22.1k • 42
facebook/timesformer-base-finetuned-k600

Video Classification • Updated Dec 12, 2022 • 10.1k • 12
google/vivit-b-16x2

Video Classification • Updated Aug 3, 2023 • 459 • 11

Image-to-Text Models 📝

This collection contains image captioning and OCR models.

Salesforce/blip-image-captioning-large

Image-to-Text • 0.5B • Updated Feb 3 • 1.65M • 1.38k
Salesforce/blip-image-captioning-base

Image-to-Text • Updated Feb 3 • 1.96M • 763
microsoft/trocr-base-handwritten

Image-to-Text • 0.3B • Updated Feb 11 • 562k • 424
microsoft/git-large-coco

Image-to-Text • 0.4B • Updated Jun 26, 2023 • 3.19k • 104

Text-to-Image Models 🥑

stabilityai/stable-diffusion-xl-base-1.0

Text-to-Image • Updated Oct 30, 2023 • 2.46M • • 6.8k
warp-ai/wuerstchen

Text-to-Image • Updated Mar 12, 2024 • 457 • 174
Deci/DeciDiffusion-v1-0

Text-to-Image • Updated Feb 15, 2024 • 10 • 138
stabilityai/stable-diffusion-xl-refiner-1.0

Image-to-Image • Updated Sep 25, 2023 • 488k • 1.94k

Foundation Models for Vision 🧩

Foundation models for computer vision.

Running

85

85

Grounding DINO Demo

💻

Cutting edge open-vocabulary object detection app
Running

88

88

Owlv2

👀

State-of-the-art Zero-shot Object Detection
Runtime error

41

41

BLIP2 with transformers

🌖

BLIP2 (cutting edge image captioning) in 🤗transformers
Runtime error

377

377

IDEFICS Playground

🐨

Segment Anything Model

This collection contains models and demos of SAM and it's smaller friends.

facebook/sam-vit-huge

Mask Generation • 0.6B • Updated Jan 11, 2024 • 135k • 174
facebook/sam-vit-base

Mask Generation • 0.1B • Updated Jan 11, 2024 • 274k • 144
facebook/sam-vit-large

Mask Generation • 0.3B • Updated Jan 11, 2024 • 247k • 28
Runtime error

43

43

Grounded SAM

💩

OWL-series 🦉

Models and applications of OWL-ViT and OWLv2.

Running

88

88

Owlv2

👀

State-of-the-art Zero-shot Object Detection
Running on Zero

64

64

Owl Tracking

⚡

Powerful foundation model for zero-shot object tracking
Running

25

25

Search and Detect (CLIP/OWL-ViT)

🦉

Search and detect objects in images using text queries
Running on Zero

102

102

OWLSAM

😻

State-of-the-art open-vocabulary image segmentation ⚡️

SigLIP

A collection dedicated to SigLIP applications

Running on Zero

71

71

Draw To Search Art

🐠

Draw/upload image and search among WikiART using SigLIP
Running on CPU Upgrade

22

22

Compare Clip Siglip

🏃

Compare strong zero-shot image classification models
Running on Zero

13

13

Multilingual Zero Shot Image Clf

🏢

Comparing powerful multilingual zero-shot image clf models
BAAI/bunny-phi-2-siglip-lora

Text Generation • Updated Mar 28, 2024 • 21 • 48

Awesome Document AI

A collection of open-source document AI 📄 📝 📈

Runtime error

84

84

UDOP

🏃

Generate text from document images
Runtime error

40

40

Pix2struct

📚

Play with all the pix2struct variants in this d
Running

25

25

Compare Docvqa Models

🦀

Compare different visual question answering
Runtime error

289

289

DocQuery — Document Query Engine

🦉

SegGPT

A collection of everything SegGPT.

Images Speak in Images: A Generalist Painter for In-Context Visual Learning

Paper • 2212.02499 • Published Dec 5, 2022
SegGPT: Segmenting Everything In Context

Paper • 2304.03284 • Published Apr 6, 2023 • 1
BAAI/seggpt-vit-large

0.4B • Updated Feb 22, 2024 • 2.4k • 4
BAAI/SegGPT

Updated Apr 21, 2023 • 19

Vision Language Models Papers 🖼️💬📝

Papers about vision-language models, most important ones are on top of the list.

Improved Baselines with Visual Instruction Tuning

Paper • 2310.03744 • Published Oct 5, 2023 • 38
DeepSeek-VL: Towards Real-World Vision-Language Understanding

Paper • 2403.05525 • Published Mar 8, 2024 • 47
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities

Paper • 2308.12966 • Published Aug 24, 2023 • 9
LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model

Paper • 2404.01331 • Published Mar 29, 2024 • 28

gvhf/owl

google/owlvit-base-patch32

Zero-Shot Object Detection • 0.2B • Updated Dec 12, 2023 • 134k • 135
google/owlvit-base-patch16

Zero-Shot Object Detection • Updated Dec 12, 2023 • 7.22k • 12
google/owlvit-large-patch14

Zero-Shot Object Detection • Updated Dec 12, 2023 • 45.1k • 25
google/owlv2-base-patch16

Zero-Shot Object Detection • 0.2B • Updated Apr 15, 2024 • 67.9k • 27

gv-hf/owl

google/owlvit-base-patch32

Zero-Shot Object Detection • 0.2B • Updated Dec 12, 2023 • 134k • 135
google/owlvit-base-patch16

Zero-Shot Object Detection • Updated Dec 12, 2023 • 7.22k • 12
google/owlvit-large-patch14

Zero-Shot Object Detection • Updated Dec 12, 2023 • 45.1k • 25
google/owlv2-base-patch16

Zero-Shot Object Detection • 0.2B • Updated Apr 15, 2024 • 67.9k • 27

merve/owl2

google/owlvit-base-patch32

Zero-Shot Object Detection • 0.2B • Updated Dec 12, 2023 • 134k • 135
google/owlvit-base-patch16

Zero-Shot Object Detection • Updated Dec 12, 2023 • 7.22k • 12
google/owlvit-large-patch14

Zero-Shot Object Detection • Updated Dec 12, 2023 • 45.1k • 25
google/owlv2-base-patch16

Zero-Shot Object Detection • 0.2B • Updated Apr 15, 2024 • 67.9k • 27

Depth Anything v2 Release

A comprehensive collection on DAv2

depth-anything/Depth-Anything-V2-Small

Depth Estimation • Updated Jul 8, 2024 • 9.8k • 69
depth-anything/Depth-Anything-V2-Large

Depth Estimation • Updated Jul 8, 2024 • 91.4k • 111
Running on Zero

495

495

Depth Anything V2

🌖

Generate depth maps from images
depth-anything/DA-2K

Viewer • Updated Jun 14, 2024 • 1.04k • 679 • 12

Document VLM Papers

VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

Paper • 2407.12594 • Published Jul 17, 2024 • 19

Vision Language Leaderboards

This collection has all the vision language leaderboards.

Running

168

168

Vidore Leaderboard

🥇

Explore visual document retrieval benchmark results
Running on CPU Upgrade

842

842

Open VLM Leaderboard

🌎

VLMEvalKit Evaluation Results Collection
Running

552

552

Vision Arena (Testing VLMs side-by-side)

🖼

Analyze images to detect and label objects
Running

85

85

SEED-Bench Leaderboard

🏆

Video Language Models

A collection of video-language models

Running

21

21

Video Llava

🐨

Generate descriptions by uploading images or videos
llava-hf/LLaVA-NeXT-Video-7B-hf

Video-Text-to-Text • 7B • Updated 14 days ago • 70.6k • 105
llava-hf/LLaVA-NeXT-Video-7B-DPO-hf

Video-Text-to-Text • 7B • Updated 14 days ago • 1.74k • 9
llava-hf/LLaVA-NeXT-Video-7B-32K-hf

Image-Text-to-Text • 8B • Updated Feb 23 • 826 • 7

SAM2

All the models and demos for SAM2

merve/sam2-hiera-tiny

Mask Generation • Updated Aug 2, 2024 • 30
merve/sam2-hiera-small

Mask Generation • Updated Aug 2, 2024 • 60 • 1
merve/sam2-hiera-large

Mask Generation • Updated Aug 2, 2024 • 1.21k • 2
merve/sam2-hiera-base-plus

Mask Generation • Updated Aug 2, 2024 • 46

NVEagle

NVEagle/Eagle-X5-13B

Image-Text-to-Text • 15B • Updated Sep 16, 2024 • 54 • 15
NVEagle/Eagle-X5-13B-Chat

Image-Text-to-Text • 15B • Updated Sep 16, 2024 • 870 • 28
NVEagle/Eagle-X5-7B

Image-Text-to-Text • 9B • Updated Sep 16, 2024 • 1.86k • 26
Running on Zero

64

64

Eagle X5 13B Chat

🚀

Combine text and images to generate responses

Multimodal RAG

vidore/colpali-v1.2

Visual Document Retrieval • Updated Mar 14 • 37.5k • 109
Qwen/Qwen2-VL-7B-Instruct

Image-Text-to-Text • 8B • Updated Feb 6 • 570k • • 1.22k
Qwen/Qwen2-VL-2B-Instruct

Image-Text-to-Text • 2B • Updated Jan 12 • 689k • 434
Qwen/Qwen2-72B-Instruct

Text Generation • 73B • Updated Oct 8, 2024 • 45.6k • • 715

Zero-shot Segmentation

sam-hq-team/SegInW

Updated Jul 13, 2023 • 1
xdecoder/X-Decoder

Updated Dec 27, 2023 • 5
xdecoder/SEEM

Updated Dec 30, 2023 • 8
Sleeping

60

60

OWLSAM2

🏃

merve PRO

AI & ML interests

Recent Activity

Organizations

merve 's collections 67

BAGEL

LTX Video Fast

MMaDA

BAGEL

Qwen2.5 Omni 7B Demo

Qwen2-VL-7B

UI-TARS

Qwen2.5-1M Demo

LongVU

DepthCrafter

Depth Pro

LOTUS Depth

Grounded SAM

Draw To Search Art

Compare Clip Siglip

Multilingual Zero Shot Image Clf

Video Llava

Eagle X5 13B Chat

OWLSAM2