Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2504.17789

about 8 hours ago

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 29
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 13
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 45
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 24

Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models

Paper • 2504.17789 • Published Apr 24 • 23
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Paper • 2505.09568 • Published May 14 • 97

Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models

Paper • 2504.17789 • Published Apr 24 • 23

Auto-regressive

Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation

Paper • 2502.20388 • Published Feb 27 • 16
Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models

Paper • 2504.17789 • Published Apr 24 • 23

How to Synthesize Text Data without Model Collapse?

Paper • 2412.14689 • Published Dec 19, 2024 • 53
SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator

Paper • 2412.12094 • Published Dec 16, 2024 • 11
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

Paper • 2306.07691 • Published Jun 13, 2023 • 11
iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform

Paper • 2203.02395 • Published Mar 4, 2022 • 1

Image Generation

OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation

Paper • 2506.07977 • Published Jun 9 • 41
Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers

Paper • 2506.07986 • Published Jun 9 • 19
STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis

Paper • 2506.06276 • Published Jun 6 • 22
Aligning Latent Spaces with Flow Priors

Paper • 2506.05240 • Published Jun 5 • 27

generation-token

Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models

Paper • 2504.17789 • Published Apr 24 • 23

CoRAG: Collaborative Retrieval-Augmented Generation

Paper • 2504.01883 • Published Apr 2 • 10
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Paper • 2504.08837 • Published Apr 10 • 43
Mavors: Multi-granularity Video Representation for Multimodal Large Language Model

Paper • 2504.10068 • Published Apr 14 • 30
xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

Paper • 2504.10481 • Published Apr 14 • 84

paper maybe useful

Light-A-Video: Training-free Video Relighting via Progressive Light Fusion

Paper • 2502.08590 • Published Feb 12 • 44
Distillation Scaling Laws

Paper • 2502.08606 • Published Feb 12 • 49
Soundwave: Less is More for Speech-Text Alignment in LLMs

Paper • 2502.12900 • Published Feb 18 • 86
Alias-Free Latent Diffusion Models:Improving Fractional Shift Equivariance of Diffusion Latent Space

Paper • 2503.09419 • Published Mar 12 • 6

General Multimodal Learning

Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities

Paper • 2401.14405 • Published Jan 25, 2024 • 13
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

Paper • 2406.18521 • Published Jun 26, 2024 • 30
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Paper • 2408.12590 • Published Aug 22, 2024 • 37
Law of Vision Representation in MLLMs

Paper • 2408.16357 • Published Aug 29, 2024 • 96

about 8 hours ago

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 29
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 13
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 45
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 24

Image Generation

OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation

Paper • 2506.07977 • Published Jun 9 • 41
Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers

Paper • 2506.07986 • Published Jun 9 • 19
STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis

Paper • 2506.06276 • Published Jun 6 • 22
Aligning Latent Spaces with Flow Priors

Paper • 2506.05240 • Published Jun 5 • 27

Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models

Paper • 2504.17789 • Published Apr 24 • 23
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Paper • 2505.09568 • Published May 14 • 97

generation-token

Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models

Paper • 2504.17789 • Published Apr 24 • 23

Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models

Paper • 2504.17789 • Published Apr 24 • 23

CoRAG: Collaborative Retrieval-Augmented Generation

Paper • 2504.01883 • Published Apr 2 • 10
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Paper • 2504.08837 • Published Apr 10 • 43
Mavors: Multi-granularity Video Representation for Multimodal Large Language Model

Paper • 2504.10068 • Published Apr 14 • 30
xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

Paper • 2504.10481 • Published Apr 14 • 84

Auto-regressive

Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation

Paper • 2502.20388 • Published Feb 27 • 16
Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models

Paper • 2504.17789 • Published Apr 24 • 23

paper maybe useful

Light-A-Video: Training-free Video Relighting via Progressive Light Fusion

Paper • 2502.08590 • Published Feb 12 • 44
Distillation Scaling Laws

Paper • 2502.08606 • Published Feb 12 • 49
Soundwave: Less is More for Speech-Text Alignment in LLMs

Paper • 2502.12900 • Published Feb 18 • 86
Alias-Free Latent Diffusion Models:Improving Fractional Shift Equivariance of Diffusion Latent Space

Paper • 2503.09419 • Published Mar 12 • 6

How to Synthesize Text Data without Model Collapse?

Paper • 2412.14689 • Published Dec 19, 2024 • 53
SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator

Paper • 2412.12094 • Published Dec 16, 2024 • 11
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

Paper • 2306.07691 • Published Jun 13, 2023 • 11
iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform

Paper • 2203.02395 • Published Mar 4, 2022 • 1

General Multimodal Learning

Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities

Paper • 2401.14405 • Published Jan 25, 2024 • 13
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

Paper • 2406.18521 • Published Jun 26, 2024 • 30
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Paper • 2408.12590 • Published Aug 22, 2024 • 37
Law of Vision Representation in MLLMs

Paper • 2408.16357 • Published Aug 29, 2024 • 96

Previous
1
2
Next

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs