Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2503.21776

about 5 hours ago

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 28
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 14
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 44
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 23

2025 LLM Papers on Hugging Face with Japanese Memos

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

Paper • 2501.02955 • Published Jan 6 • 44
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Paper • 2501.00958 • Published Jan 1 • 106
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

Paper • 2501.12380 • Published Jan 21 • 85
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos

Paper • 2501.09781 • Published Jan 16 • 28

To Read collection

interesting papers to read

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Paper • 2503.24290 • Published Mar 31 • 62
I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders

Paper • 2503.18878 • Published Mar 24 • 120
START: Self-taught Reasoner with Tools

Paper • 2503.04625 • Published Mar 6 • 113
DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Paper • 2503.14476 • Published Mar 18 • 141

DiffPortrait360: Consistent Portrait Diffusion for 360 View Synthesis

Paper • 2503.15667 • Published Mar 19 • 8
Video-R1: Reinforcing Video Reasoning in MLLMs

Paper • 2503.21776 • Published Mar 27 • 79

Multimodal Reasoning

InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning

Paper • 2502.11573 • Published Feb 17 • 9
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking

Paper • 2502.02339 • Published Feb 4 • 22
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model

Paper • 2502.11775 • Published Feb 17 • 9
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search

Paper • 2412.18319 • Published Dec 24, 2024 • 39

MLLM-as-a-Judge for Image Safety without Human Labeling

Paper • 2501.00192 • Published Dec 31, 2024 • 31
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Paper • 2501.00958 • Published Jan 1 • 106
Xmodel-2 Technical Report

Paper • 2412.19638 • Published Dec 27, 2024 • 26
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

Paper • 2412.18925 • Published Dec 25, 2024 • 104

Video Understanding

Token-Efficient Long Video Understanding for Multimodal LLMs

Paper • 2503.04130 • Published Mar 6 • 94
Video-R1: Reinforcing Video Reasoning in MLLMs

Paper • 2503.21776 • Published Mar 27 • 79
Seedance 1.0: Exploring the Boundaries of Video Generation Models

Paper • 2506.09113 • Published Jun 10 • 101

Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models

Paper • 2503.21380 • Published Mar 27 • 38
Video-R1: Reinforcing Video Reasoning in MLLMs

Paper • 2503.21776 • Published Mar 27 • 79
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks

Paper • 2503.21696 • Published Mar 27 • 23
M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models

Paper • 2504.10449 • Published Apr 14 • 15

VLM RL Reasoning

OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement

Paper • 2503.17352 • Published Mar 21 • 24
When Less is Enough: Adaptive Token Reduction for Efficient Image Representation

Paper • 2503.16660 • Published Mar 20 • 72
CoMP: Continual Multimodal Pre-training for Vision Foundation Models

Paper • 2503.18931 • Published Mar 24 • 30
MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding

Paper • 2503.13964 • Published Mar 18 • 20

paper maybe useful

Light-A-Video: Training-free Video Relighting via Progressive Light Fusion

Paper • 2502.08590 • Published Feb 12 • 43
Distillation Scaling Laws

Paper • 2502.08606 • Published Feb 12 • 48
Soundwave: Less is More for Speech-Text Alignment in LLMs

Paper • 2502.12900 • Published Feb 18 • 85
Alias-Free Latent Diffusion Models:Improving Fractional Shift Equivariance of Diffusion Latent Space

Paper • 2503.09419 • Published Mar 12 • 6

about 5 hours ago

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 28
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 14
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 44
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 23

MLLM-as-a-Judge for Image Safety without Human Labeling

Paper • 2501.00192 • Published Dec 31, 2024 • 31
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Paper • 2501.00958 • Published Jan 1 • 106
Xmodel-2 Technical Report

Paper • 2412.19638 • Published Dec 27, 2024 • 26
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

Paper • 2412.18925 • Published Dec 25, 2024 • 104

2025 LLM Papers on Hugging Face with Japanese Memos

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

Paper • 2501.02955 • Published Jan 6 • 44
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Paper • 2501.00958 • Published Jan 1 • 106
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

Paper • 2501.12380 • Published Jan 21 • 85
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos

Paper • 2501.09781 • Published Jan 16 • 28

Video Understanding

Token-Efficient Long Video Understanding for Multimodal LLMs

Paper • 2503.04130 • Published Mar 6 • 94
Video-R1: Reinforcing Video Reasoning in MLLMs

Paper • 2503.21776 • Published Mar 27 • 79
Seedance 1.0: Exploring the Boundaries of Video Generation Models

Paper • 2506.09113 • Published Jun 10 • 101

To Read collection

interesting papers to read

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Paper • 2503.24290 • Published Mar 31 • 62
I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders

Paper • 2503.18878 • Published Mar 24 • 120
START: Self-taught Reasoner with Tools

Paper • 2503.04625 • Published Mar 6 • 113
DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Paper • 2503.14476 • Published Mar 18 • 141

Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models

Paper • 2503.21380 • Published Mar 27 • 38
Video-R1: Reinforcing Video Reasoning in MLLMs

Paper • 2503.21776 • Published Mar 27 • 79
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks

Paper • 2503.21696 • Published Mar 27 • 23
M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models

Paper • 2504.10449 • Published Apr 14 • 15

DiffPortrait360: Consistent Portrait Diffusion for 360 View Synthesis

Paper • 2503.15667 • Published Mar 19 • 8
Video-R1: Reinforcing Video Reasoning in MLLMs

Paper • 2503.21776 • Published Mar 27 • 79

VLM RL Reasoning

OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement

Paper • 2503.17352 • Published Mar 21 • 24
When Less is Enough: Adaptive Token Reduction for Efficient Image Representation

Paper • 2503.16660 • Published Mar 20 • 72
CoMP: Continual Multimodal Pre-training for Vision Foundation Models

Paper • 2503.18931 • Published Mar 24 • 30
MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding

Paper • 2503.13964 • Published Mar 18 • 20

Multimodal Reasoning

InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning

Paper • 2502.11573 • Published Feb 17 • 9
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking

Paper • 2502.02339 • Published Feb 4 • 22
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model

Paper • 2502.11775 • Published Feb 17 • 9
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search

Paper • 2412.18319 • Published Dec 24, 2024 • 39

paper maybe useful

Light-A-Video: Training-free Video Relighting via Progressive Light Fusion

Paper • 2502.08590 • Published Feb 12 • 43
Distillation Scaling Laws

Paper • 2502.08606 • Published Feb 12 • 48
Soundwave: Less is More for Speech-Text Alignment in LLMs

Paper • 2502.12900 • Published Feb 18 • 85
Alias-Free Latent Diffusion Models:Improving Fractional Shift Equivariance of Diffusion Latent Space

Paper • 2503.09419 • Published Mar 12 • 6

Previous
1
2
Next

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs