-
MLLM-as-a-Judge for Image Safety without Human Labeling
Paper • 2501.00192 • Published • 31 -
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Paper • 2501.00958 • Published • 106 -
Xmodel-2 Technical Report
Paper • 2412.19638 • Published • 26 -
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
Paper • 2412.18925 • Published • 104
Collections
Discover the best community collections!
Collections including paper arxiv:2504.05741
-
OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation
Paper • 2506.07977 • Published • 41 -
Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers
Paper • 2506.07986 • Published • 19 -
STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis
Paper • 2506.06276 • Published • 22 -
Aligning Latent Spaces with Flow Priors
Paper • 2506.05240 • Published • 27
-
Boosting Generative Image Modeling via Joint Image-Feature Synthesis
Paper • 2504.16064 • Published • 14 -
LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models
Paper • 2504.14032 • Published • 4 -
Towards Understanding Camera Motions in Any Video
Paper • 2504.15376 • Published • 158 -
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
Paper • 2504.17192 • Published • 116
-
Byte Latent Transformer: Patches Scale Better Than Tokens
Paper • 2412.09871 • Published • 108 -
Causal Diffusion Transformers for Generative Modeling
Paper • 2412.12095 • Published • 23 -
Tensor Product Attention Is All You Need
Paper • 2501.06425 • Published • 88 -
TransMLA: Multi-head Latent Attention Is All You Need
Paper • 2502.07864 • Published • 56
-
Depth Anything V2
Paper • 2406.09414 • Published • 103 -
An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels
Paper • 2406.09415 • Published • 51 -
Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion
Paper • 2406.04338 • Published • 39 -
SAM 2: Segment Anything in Images and Videos
Paper • 2408.00714 • Published • 116
-
DDT: Decoupled Diffusion Transformer
Paper • 2504.05741 • Published • 76 -
DreamID: High-Fidelity and Fast diffusion-based Face Swapping via Triplet ID Group Learning
Paper • 2504.14509 • Published • 50 -
Flow-GRPO: Training Flow Matching Models via Online RL
Paper • 2505.05470 • Published • 83 -
Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off
Paper • 2508.04825 • Published • 58
-
LLM Pruning and Distillation in Practice: The Minitron Approach
Paper • 2408.11796 • Published • 57 -
TableBench: A Comprehensive and Complex Benchmark for Table Question Answering
Paper • 2408.09174 • Published • 52 -
To Code, or Not To Code? Exploring Impact of Code in Pre-training
Paper • 2408.10914 • Published • 43 -
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications
Paper • 2408.11878 • Published • 63
-
FLAME: Factuality-Aware Alignment for Large Language Models
Paper • 2405.01525 • Published • 28 -
DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data
Paper • 2405.14333 • Published • 41 -
Transformers Can Do Arithmetic with the Right Embeddings
Paper • 2405.17399 • Published • 54 -
EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture
Paper • 2405.18991 • Published • 12
-
MLLM-as-a-Judge for Image Safety without Human Labeling
Paper • 2501.00192 • Published • 31 -
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Paper • 2501.00958 • Published • 106 -
Xmodel-2 Technical Report
Paper • 2412.19638 • Published • 26 -
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
Paper • 2412.18925 • Published • 104
-
OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation
Paper • 2506.07977 • Published • 41 -
Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers
Paper • 2506.07986 • Published • 19 -
STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis
Paper • 2506.06276 • Published • 22 -
Aligning Latent Spaces with Flow Priors
Paper • 2506.05240 • Published • 27
-
Boosting Generative Image Modeling via Joint Image-Feature Synthesis
Paper • 2504.16064 • Published • 14 -
LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models
Paper • 2504.14032 • Published • 4 -
Towards Understanding Camera Motions in Any Video
Paper • 2504.15376 • Published • 158 -
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
Paper • 2504.17192 • Published • 116
-
DDT: Decoupled Diffusion Transformer
Paper • 2504.05741 • Published • 76 -
DreamID: High-Fidelity and Fast diffusion-based Face Swapping via Triplet ID Group Learning
Paper • 2504.14509 • Published • 50 -
Flow-GRPO: Training Flow Matching Models via Online RL
Paper • 2505.05470 • Published • 83 -
Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off
Paper • 2508.04825 • Published • 58
-
Byte Latent Transformer: Patches Scale Better Than Tokens
Paper • 2412.09871 • Published • 108 -
Causal Diffusion Transformers for Generative Modeling
Paper • 2412.12095 • Published • 23 -
Tensor Product Attention Is All You Need
Paper • 2501.06425 • Published • 88 -
TransMLA: Multi-head Latent Attention Is All You Need
Paper • 2502.07864 • Published • 56
-
LLM Pruning and Distillation in Practice: The Minitron Approach
Paper • 2408.11796 • Published • 57 -
TableBench: A Comprehensive and Complex Benchmark for Table Question Answering
Paper • 2408.09174 • Published • 52 -
To Code, or Not To Code? Exploring Impact of Code in Pre-training
Paper • 2408.10914 • Published • 43 -
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications
Paper • 2408.11878 • Published • 63
-
Depth Anything V2
Paper • 2406.09414 • Published • 103 -
An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels
Paper • 2406.09415 • Published • 51 -
Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion
Paper • 2406.04338 • Published • 39 -
SAM 2: Segment Anything in Images and Videos
Paper • 2408.00714 • Published • 116
-
FLAME: Factuality-Aware Alignment for Large Language Models
Paper • 2405.01525 • Published • 28 -
DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data
Paper • 2405.14333 • Published • 41 -
Transformers Can Do Arithmetic with the Right Embeddings
Paper • 2405.17399 • Published • 54 -
EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture
Paper • 2405.18991 • Published • 12