Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

Lanorman's picture

1

Lanorman

Lavico

karbon0x's profile picture

·

AI & ML interests

None yet

Organizations

None yet

Lavico 's collections 68

MiniMax-01: Scaling Foundation Models with Lightning Attention

Paper • 2501.08313 • Published Jan 14 • 297

An Empirical Study of Autoregressive Pre-training from Videos

Paper • 2501.05453 • Published Jan 9 • 41

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

Paper • 2501.02955 • Published Jan 6 • 44

camera_object_control_gen

ObjCtrl-2.5D: Training-free Object Control with Camera Poses

Paper • 2412.07721 • Published Dec 10, 2024 • 8

MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation

Paper • 2412.03558 • Published Dec 4, 2024 • 19

DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving

Paper • 2411.15139 • Published Nov 22, 2024 • 15
A Noise is Worth Diffusion Guidance

Paper • 2412.03895 • Published Dec 5, 2024 • 30

CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models

Paper • 2411.18613 • Published Nov 27, 2024 • 58
4Real-Video: Learning Generalizable Photo-Realistic 4D Video Diffusion

Paper • 2412.04462 • Published Dec 5, 2024 • 8

Generating Compositional Scenes via Text-to-image RGBA Instance Generation

Paper • 2411.10913 • Published Nov 16, 2024 • 4
ROICtrl: Boosting Instance Control for Visual Generation

Paper • 2411.17949 • Published Nov 27, 2024 • 87
Pathways on the Image Manifold: Image Editing via Video Generation

Paper • 2411.16819 • Published Nov 25, 2024 • 37

Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation

Paper • 2411.14384 • Published Nov 21, 2024 • 9
DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation

Paper • 2501.16764 • Published Jan 28 • 22

Fashion-VDM: Video Diffusion Model for Virtual Try-On

Paper • 2411.00225 • Published Oct 31, 2024 • 11
GarVerseLOD: High-Fidelity 3D Garment Reconstruction from a Single In-the-Wild Image using a Dataset with Levels of Details

Paper • 2411.03047 • Published Nov 5, 2024 • 9
FitDiT: Advancing the Authentic Garment Details for High-fidelity Virtual Try-on

Paper • 2411.10499 • Published Nov 15, 2024 • 13
TryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction using Diffusion Models

Paper • 2411.18350 • Published Nov 27, 2024 • 29

MotionCLR: Motion Generation and Training-free Editing via Understanding Attention Mechanisms

Paper • 2410.18977 • Published Oct 24, 2024 • 15

SpectroMotion: Dynamic 3D Reconstruction of Specular Scenes

Paper • 2410.17249 • Published Oct 22, 2024 • 42
VeGaS: Video Gaussian Splatting

Paper • 2411.11024 • Published Nov 17, 2024 • 7
MoDec-GS: Global-to-Local Motion Decomposition and Temporal Interval Adjustment for Compact Dynamic 3D Gaussian Splatting

Paper • 2501.03714 • Published Jan 7 • 9

Video-Gen Training-Free

VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

Paper • 2410.04364 • Published Oct 6, 2024 • 29

Subsurface Scattering for 3D Gaussian Splatting

Paper • 2408.12282 • Published Aug 22, 2024 • 7
Instant Facial Gaussians Translator for Relightable and Interactable Facial Rendering

Paper • 2409.07441 • Published Sep 11, 2024 • 11
GS^3: Efficient Relighting with Triple Gaussian Splatting

Paper • 2410.11419 • Published Oct 15, 2024 • 12
SpotLight: Shadow-Guided Object Relighting via Diffusion

Paper • 2411.18665 • Published Nov 27, 2024 • 3

ExtraNeRF: Visibility-Aware View Extrapolation of Neural Radiance Fields with Diffusion Models

Paper • 2406.06133 • Published Jun 10, 2024 • 12
Style-NeRF2NeRF: 3D Style Transfer From Style-Aligned Multi-View Images

Paper • 2406.13393 • Published Jun 19, 2024 • 5
BoostMVSNeRFs: Boosting MVS-based NeRFs to Generalizable View Synthesis in Large-scale Scenes

Paper • 2407.15848 • Published Jul 22, 2024 • 17
NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields

Paper • 2404.01300 • Published Apr 1, 2024 • 4

DreamCatalyst: Fast and High-Quality 3D Editing via Controlling Editability and Identity Preservation

Paper • 2407.11394 • Published Jul 16, 2024 • 12
Jointly Generating Multi-view Consistent PBR Textures using Collaborative Control

Paper • 2410.06985 • Published Oct 9, 2024 • 5
ZeroComp: Zero-shot Object Compositing from Image Intrinsics via Diffusion

Paper • 2410.08168 • Published Oct 10, 2024 • 9
MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D

Paper • 2411.02336 • Published Nov 4, 2024 • 24

YouDream: Generating Anatomically Controllable Consistent Text-to-3D Animals

Paper • 2406.16273 • Published Jun 24, 2024 • 43
Tailor3D: Customized 3D Assets Editing and Generation with Dual-Side Images

Paper • 2407.06191 • Published Jul 8, 2024 • 14
PlacidDreamer: Advancing Harmony in Text-to-3D Generation

Paper • 2407.13976 • Published Jul 19, 2024 • 5
MARVEL-40M+: Multi-Level Visual Elaboration for High-Fidelity Text-to-3D Content Creation

Paper • 2411.17945 • Published Nov 26, 2024 • 27

Image-Gen Personalization

pOps: Photo-Inspired Diffusion Operators

Paper • 2406.01300 • Published Jun 3, 2024 • 18
AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising

Paper • 2406.06911 • Published Jun 11, 2024 • 12
Interpreting the Weight Space of Customized Diffusion Models

Paper • 2406.09413 • Published Jun 13, 2024 • 20
EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts

Paper • 2406.09162 • Published Jun 13, 2024 • 14

Image-Gen Lightweight

LiteVAE: Lightweight and Efficient Variational Autoencoders for Latent Diffusion Models

Paper • 2405.14477 • Published May 23, 2024 • 20
Phased Consistency Model

Paper • 2405.18407 • Published May 28, 2024 • 48
SVDQunat: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

Paper • 2411.05007 • Published Nov 7, 2024 • 22
SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training

Paper • 2412.09619 • Published Dec 12, 2024 • 27

Zero-shot Image Editing with Reference Imitation

Paper • 2406.07547 • Published Jun 11, 2024 • 33
The Devil is in the Details: StyleFeatureEditor for Detail-Rich StyleGAN Inversion and High Quality Image Editing

Paper • 2406.10601 • Published Jun 15, 2024 • 70
UltraEdit: Instruction-based Fine-Grained Image Editing at Scale

Paper • 2407.05282 • Published Jul 7, 2024 • 15
Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model

Paper • 2407.16982 • Published Jul 24, 2024 • 42

FontStudio: Shape-Adaptive Diffusion Model for Coherent and Consistent Font Effect Generation

Paper • 2406.08392 • Published Jun 12, 2024 • 21
Glyph-ByT5-v2: A Strong Aesthetic Baseline for Accurate Multilingual Visual Text Rendering

Paper • 2406.10208 • Published Jun 14, 2024 • 22
AMO Sampler: Enhancing Text Rendering with Overshooting

Paper • 2411.19415 • Published Nov 28, 2024 • 5

Image-Gen StyleInject

Magic Insert: Style-Aware Drag-and-Drop

Paper • 2407.02489 • Published Jul 2, 2024 • 22
ZePo: Zero-Shot Portrait Stylization with Faster Sampling

Paper • 2408.05492 • Published Aug 10, 2024 • 7
CSGO: Content-Style Composition in Text-to-Image Generation

Paper • 2408.16766 • Published Aug 29, 2024 • 18
Style-Friendly SNR Sampler for Style-Driven Generation

Paper • 2411.14793 • Published Nov 22, 2024 • 39

I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models

Paper • 2405.16537 • Published May 26, 2024 • 17
ReVideo: Remake a Video with Motion and Content Control

Paper • 2405.13865 • Published May 22, 2024 • 25
FreeTraj: Tuning-Free Trajectory Control in Video Diffusion Models

Paper • 2406.16863 • Published Jun 24, 2024 • 11
Portrait Video Editing Empowered by Multimodal Generative Priors

Paper • 2409.13591 • Published Sep 20, 2024 • 17

Video-Gen Customization

Still-Moving: Customized Video Generation without Customized Video Data

Paper • 2407.08674 • Published Jul 11, 2024 • 13
CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities

Paper • 2408.13239 • Published Aug 23, 2024 • 12
MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling

Paper • 2409.16160 • Published Sep 24, 2024 • 33
Disco4D: Disentangled 4D Human Generation and Animation from a Single Image

Paper • 2409.17280 • Published Sep 25, 2024 • 11

Video-Gen Benchmark

T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation

Paper • 2407.14505 • Published Jul 19, 2024 • 26

Video-Gen LLM-based

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

Paper • 2406.04325 • Published Jun 6, 2024 • 75
Pandora: Towards General World Model with Natural Language Actions and Video States

Paper • 2406.09455 • Published Jun 12, 2024 • 16

Animate3D: Animating Any 3D Model with Multi-view Video Diffusion

Paper • 2407.11398 • Published Jul 16, 2024 • 10
VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control

Paper • 2407.12781 • Published Jul 17, 2024 • 13
V^3: Viewing Volumetric Videos on Mobiles via Streamable 2D Dynamic Gaussians

Paper • 2409.13648 • Published Sep 20, 2024 • 12
DressRecon: Freeform 4D Human Reconstruction from Monocular Video

Paper • 2409.20563 • Published Sep 30, 2024 • 9

Video-Gen Dataset

VidGen-1M: A Large-Scale Dataset for Text-to-video Generation

Paper • 2408.02629 • Published Aug 5, 2024 • 15
Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling

Paper • 2408.03695 • Published Aug 7, 2024 • 13

Image-Captioning

Understanding Retrieval Robustness for Retrieval-Augmented Image Captioning

Paper • 2406.02265 • Published Jun 4, 2024 • 7

Surgical SAM 2: Real-time Segment Anything in Surgical Video by Efficient Frame Pruning

Paper • 2408.07931 • Published Aug 15, 2024 • 22
SAM2Point: Segment Any 3D as Videos in Zero-shot and Promptable Manners

Paper • 2408.16768 • Published Aug 29, 2024 • 28
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

Paper • 2410.16268 • Published Oct 21, 2024 • 69

Lessons from Learning to Spin "Pens"

Paper • 2407.18902 • Published Jul 26, 2024 • 21
Diffusion Models Are Real-Time Game Engines

Paper • 2408.14837 • Published Aug 27, 2024 • 126
WALL-E: World Alignment by Rule Learning Improves World Model-based LLM Agents

Paper • 2410.07484 • Published Oct 9, 2024 • 51
Autonomous Character-Scene Interaction Synthesis from Text Instruction

Paper • 2410.03187 • Published Oct 4, 2024 • 7

VGGHeads: A Large-Scale Synthetic Dataset for 3D Human Heads

Paper • 2407.18245 • Published Jul 25, 2024 • 11
AIM 2024 Sparse Neural Rendering Challenge: Dataset and Benchmark

Paper • 2409.15041 • Published Sep 23, 2024 • 14
LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models

Paper • 2410.09732 • Published Oct 13, 2024 • 55
MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures

Paper • 2410.13754 • Published Oct 17, 2024 • 75

Training-free Long Video Generation with Chain of Diffusion Model Experts

Paper • 2408.13423 • Published Aug 24, 2024 • 23
Loong: Generating Minute-level Long Videos with Autoregressive Language Models

Paper • 2410.02757 • Published Oct 3, 2024 • 36
MagicDriveDiT: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control

Paper • 2411.13807 • Published Nov 21, 2024 • 11
TAPTRv3: Spatial and Temporal Context Foster Robust Tracking of Any Point in Long Video

Paper • 2411.18671 • Published Nov 27, 2024 • 20

Omni-Generation

OmniGen: Unified Image Generation

Paper • 2409.11340 • Published Sep 17, 2024 • 115
Video-Guided Foley Sound Generation with Multimodal Controls

Paper • 2411.17698 • Published Nov 26, 2024 • 10
FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait

Paper • 2412.01064 • Published Dec 2, 2024 • 46
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows

Paper • 2412.01169 • Published Dec 2, 2024 • 13

The GAN is dead; long live the GAN! A Modern GAN Baseline

Paper • 2501.05441 • Published Jan 9 • 95

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

Paper • 2501.03895 • Published Jan 7 • 52
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7 • 46
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives

Paper • 2501.04003 • Published Jan 7 • 27
VideoRAG: Retrieval-Augmented Generation over Video Corpus

Paper • 2501.05874 • Published Jan 10 • 73

3D_GEN_lightweight

Turbo3D: Ultra-fast Text-to-3D Generation

Paper • 2412.04470 • Published Dec 5, 2024 • 4

Light Memory Diffusion

Mobile Video Diffusion

Paper • 2412.07583 • Published Dec 10, 2024 • 20
MoViE: Mobile Diffusion for Video Editing

Paper • 2412.06578 • Published Dec 9, 2024 • 18

Pose_Estimation

Edge Weight Prediction For Category-Agnostic Pose Estimation

Paper • 2411.16665 • Published Nov 25, 2024 • 6

Make-It-Animatable: An Efficient Framework for Authoring Animation-Ready 3D Characters

Paper • 2411.18197 • Published Nov 27, 2024 • 14
SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters

Paper • 2412.00174 • Published Nov 29, 2024 • 23
One Shot, One Talk: Whole-body Talking Avatar from a Single Image

Paper • 2412.01106 • Published Dec 2, 2024 • 23
DisPose: Disentangling Pose Guidance for Controllable Human Image Animation

Paper • 2412.09349 • Published Dec 12, 2024 • 8

SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers

Paper • 2411.10510 • Published Nov 15, 2024 • 8

ORID: Organ-Regional Information Driven Framework for Radiology Report Generation

Paper • 2411.13025 • Published Nov 20, 2024 • 2

Training-free Regional Prompting for Diffusion Transformers

Paper • 2411.02395 • Published Nov 4, 2024 • 25
Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement

Paper • 2411.06558 • Published Nov 10, 2024 • 36
FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers

Paper • 2412.09611 • Published Dec 12, 2024 • 10
3DIS-FLUX: simple and efficient multi-instance generation with DiT rendering

Paper • 2501.05131 • Published Jan 9 • 37

Image_Restoration

DreamClear: High-Capacity Real-World Image Restoration with Privacy-Safe Dataset Curation

Paper • 2410.18666 • Published Oct 24, 2024 • 19
Arbitrary-steps Image Super-resolution via Diffusion Inversion

Paper • 2412.09013 • Published Dec 12, 2024 • 13

DynamicCity: Large-Scale LiDAR Generation from Dynamic Scenes

Paper • 2410.18084 • Published Oct 23, 2024 • 14
KMM: Key Frame Mask Mamba for Extended Motion Generation

Paper • 2411.06481 • Published Nov 10, 2024 • 5

Animate-X: Universal Character Image Animation with Enhanced Motion Representation

Paper • 2410.10306 • Published Oct 14, 2024 • 56

An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

Paper • 2406.09415 • Published Jun 13, 2024 • 51
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

Paper • 2406.09406 • Published Jun 13, 2024 • 15
VideoGUI: A Benchmark for GUI Automation from Instructional Videos

Paper • 2406.10227 • Published Jun 14, 2024 • 9
What If We Recaption Billions of Web Images with LLaMA-3?

Paper • 2406.08478 • Published Jun 12, 2024 • 41

GECO: Generative Image-to-3D within a SECOnd

Paper • 2405.20327 • Published May 30, 2024 • 11
Ouroboros3D: Image-to-3D Generation via 3D-aware Recursive Diffusion

Paper • 2406.03184 • Published Jun 5, 2024 • 22
NPGA: Neural Parametric Gaussian Avatars

Paper • 2405.19331 • Published May 29, 2024 • 10
Unified Text-to-Image Generation and Retrieval

Paper • 2406.05814 • Published Jun 9, 2024 • 16

GFlow: Recovering 4D World from Monocular Video

Paper • 2405.18426 • Published May 28, 2024 • 17
3DitScene: Editing Any Scene via Language-guided Disentangled Gaussian Splatting

Paper • 2405.18424 • Published May 28, 2024 • 9
HDR-GS: Efficient High Dynamic Range Novel View Synthesis at 1000x Speed via Gaussian Splatting

Paper • 2405.15125 • Published May 24, 2024 • 8
PLA4D: Pixel-Level Alignments for Text-to-4D Gaussian Splatting

Paper • 2405.19957 • Published May 30, 2024 • 10

SpaRP: Fast 3D Object Reconstruction and Pose Estimation from Sparse Views

Paper • 2408.10195 • Published Aug 19, 2024 • 13
MVInpainter: Learning Multi-View Consistent Inpainting to Bridge 2D and 3D Editing

Paper • 2408.08000 • Published Aug 15, 2024 • 9
DC3DO: Diffusion Classifier for 3D Objects

Paper • 2408.06693 • Published Aug 13, 2024 • 11
MagicMan: Generative Novel View Synthesis of Humans with 3D-Aware Diffusion and Iterative Refinement

Paper • 2408.14211 • Published Aug 26, 2024 • 11

Image-Gen Theoretical

Guiding a Diffusion Model with a Bad Version of Itself

Paper • 2406.02507 • Published Jun 4, 2024 • 17
Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step

Paper • 2406.04314 • Published Jun 6, 2024 • 30
An Image is Worth 32 Tokens for Reconstruction and Generation

Paper • 2406.07550 • Published Jun 11, 2024 • 59
Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?

Paper • 2406.07546 • Published Jun 11, 2024 • 9

Image-Gen Accelerator(Distill)

MLCM: Multistep Consistency Distillation of Latent Diffusion Model

Paper • 2406.05768 • Published Jun 9, 2024 • 13
Invertible Consistency Distillation for Text-Guided Image Editing in Around 7 Steps

Paper • 2406.14539 • Published Jun 20, 2024 • 27
HarmoniCa: Harmonizing Training and Inference for Better Feature Cache in Diffusion Transformer Acceleration

Paper • 2410.01723 • Published Oct 2, 2024 • 5
NitroFusion: High-Fidelity Single-Step Diffusion through Dynamic Adversarial Training

Paper • 2412.02030 • Published Dec 2, 2024 • 19

Image-Gen Autoregressive

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Paper • 2406.06525 • Published Jun 10, 2024 • 71
Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling

Paper • 2405.21048 • Published May 31, 2024 • 16
Scalable Autoregressive Image Generation with Mamba

Paper • 2408.12245 • Published Aug 22, 2024 • 26
DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation

Paper • 2410.08159 • Published Oct 10, 2024 • 26

Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models

Paper • 2406.09416 • Published Jun 13, 2024 • 29
Wavelets Are All You Need for Autoregressive Image Generation

Paper • 2406.19997 • Published Jun 28, 2024 • 31
ViPer: Visual Personalization of Generative Models via Individual Preference Learning

Paper • 2407.17365 • Published Jul 24, 2024 • 13
MegaFusion: Extend Diffusion Models towards Higher-resolution Image Generation without Further Tuning

Paper • 2408.11001 • Published Aug 20, 2024 • 13

Image-Gen Story

SEED-Story: Multimodal Long Story Generation with Large Language Model

Paper • 2407.08683 • Published Jul 11, 2024 • 25
Story-Adapter: A Training-free Iterative Framework for Long Story Visualization

Paper • 2410.06244 • Published Oct 8, 2024 • 19
Unbounded: A Generative Infinite Game of Character Life Simulation

Paper • 2410.18975 • Published Oct 24, 2024 • 37
Generative AI for Cel-Animation: A Survey

Paper • 2501.06250 • Published Jan 8 • 13

MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model

Paper • 2405.20222 • Published May 30, 2024 • 11
ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation

Paper • 2406.00908 • Published Jun 3, 2024 • 12
CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation

Paper • 2406.02509 • Published Jun 4, 2024 • 10
I4VGen: Image as Stepping Stone for Text-to-Video Generation

Paper • 2406.02230 • Published Jun 4, 2024 • 18

Video_Gen lightweight

T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback

Paper • 2405.18750 • Published May 29, 2024 • 21
FLoD: Integrating Flexible Level of Detail into 3D Gaussian Splatting for Customizable Rendering

Paper • 2408.12894 • Published Aug 23, 2024 • 6

Video_Gen Translation

Looking Backward: Streaming Video-to-Video Translation with Feature Banks

Paper • 2405.15757 • Published May 24, 2024 • 15

Video-Gen Trajectory

Tora: Trajectory-oriented Diffusion Transformer for Video Generation

Paper • 2407.21705 • Published Jul 31, 2024 • 27
TrackGo: A Flexible and Efficient Method for Controllable Video Generation

Paper • 2408.11475 • Published Aug 21, 2024 • 18
TVG: A Training-free Transition Video Generation Method with Diffusion Models

Paper • 2408.13413 • Published Aug 24, 2024 • 14
PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation

Paper • 2409.18964 • Published Sep 27, 2024 • 26

VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control

Paper • 2407.12781 • Published Jul 17, 2024 • 13
Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models

Paper • 2409.07452 • Published Sep 11, 2024 • 21
Novel View Extrapolation with Video Diffusion Priors

Paper • 2411.14208 • Published Nov 21, 2024 • 10
World-consistent Video Diffusion with Explicit 3D Modeling

Paper • 2412.01821 • Published Dec 2, 2024 • 4

Video-Gen Diffusion_4D(DiT etc)

Human4DiT: Free-view Human Video Generation with 4D Diffusion Transformer

Paper • 2405.17405 • Published May 27, 2024 • 16
EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture

Paper • 2405.18991 • Published May 29, 2024 • 12
4Diffusion: Multi-view Video Diffusion Model for 4D Generation

Paper • 2405.20674 • Published May 31, 2024 • 15
4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models

Paper • 2406.07472 • Published Jun 11, 2024 • 13

ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer

Paper • 2408.03284 • Published Aug 6, 2024 • 11
Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency

Paper • 2409.02634 • Published Sep 4, 2024 • 97

SAM 2: Segment Anything in Images and Videos

Paper • 2408.00714 • Published Aug 1, 2024 • 116
Rethinking Open-Vocabulary Segmentation of Radiance Fields in 3D Space

Paper • 2408.07416 • Published Aug 14, 2024 • 7
SMITE: Segment Me In TimE

Paper • 2410.18538 • Published Oct 24, 2024 • 16
ReferEverything: Towards Segmenting Everything We Can Speak of in Videos

Paper • 2410.23287 • Published Oct 30, 2024 • 19

Depth Anything V2

Paper • 2406.09414 • Published Jun 13, 2024 • 103
Depth Anywhere: Enhancing 360 Monocular Depth Estimation via Perspective Distillation and Unlabeled Data Augmentation

Paper • 2406.12849 • Published Jun 18, 2024 • 50
BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation

Paper • 2407.17952 • Published Jul 25, 2024 • 32
Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction

Paper • 2409.18124 • Published Sep 26, 2024 • 33

RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

Paper • 2406.02523 • Published Jun 4, 2024 • 12
UniT: Unified Tactile Representation for Robot Learning

Paper • 2408.06481 • Published Aug 12, 2024 • 10
Latent Action Pretraining from Videos

Paper • 2410.11758 • Published Oct 15, 2024 • 3
Neural Fields in Robotics: A Survey

Paper • 2410.20220 • Published Oct 26, 2024 • 5

WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences

Paper • 2406.11069 • Published Jun 16, 2024 • 14

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

Paper • 2409.02889 • Published Sep 4, 2024 • 54
MVLLaVA: An Intelligent Agent for Unified and Flexible Novel View Synthesis

Paper • 2409.07129 • Published Sep 11, 2024 • 8
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness

Paper • 2409.18125 • Published Sep 26, 2024 • 34
FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity

Paper • 2411.15411 • Published Nov 23, 2024 • 8

Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection

Paper • 2409.08513 • Published Sep 13, 2024 • 15

MiniMax-01: Scaling Foundation Models with Lightning Attention

Paper • 2501.08313 • Published Jan 14 • 297

The GAN is dead; long live the GAN! A Modern GAN Baseline

Paper • 2501.05441 • Published Jan 9 • 95

An Empirical Study of Autoregressive Pre-training from Videos

Paper • 2501.05453 • Published Jan 9 • 41

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

Paper • 2501.03895 • Published Jan 7 • 52
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7 • 46
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives

Paper • 2501.04003 • Published Jan 7 • 27
VideoRAG: Retrieval-Augmented Generation over Video Corpus

Paper • 2501.05874 • Published Jan 10 • 73

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

Paper • 2501.02955 • Published Jan 6 • 44

3D_GEN_lightweight

Turbo3D: Ultra-fast Text-to-3D Generation

Paper • 2412.04470 • Published Dec 5, 2024 • 4

camera_object_control_gen

ObjCtrl-2.5D: Training-free Object Control with Camera Poses

Paper • 2412.07721 • Published Dec 10, 2024 • 8

Light Memory Diffusion

Mobile Video Diffusion

Paper • 2412.07583 • Published Dec 10, 2024 • 20
MoViE: Mobile Diffusion for Video Editing

Paper • 2412.06578 • Published Dec 9, 2024 • 18

MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation

Paper • 2412.03558 • Published Dec 4, 2024 • 19

Pose_Estimation

Edge Weight Prediction For Category-Agnostic Pose Estimation

Paper • 2411.16665 • Published Nov 25, 2024 • 6

DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving

Paper • 2411.15139 • Published Nov 22, 2024 • 15
A Noise is Worth Diffusion Guidance

Paper • 2412.03895 • Published Dec 5, 2024 • 30

Make-It-Animatable: An Efficient Framework for Authoring Animation-Ready 3D Characters

Paper • 2411.18197 • Published Nov 27, 2024 • 14
SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters

Paper • 2412.00174 • Published Nov 29, 2024 • 23
One Shot, One Talk: Whole-body Talking Avatar from a Single Image

Paper • 2412.01106 • Published Dec 2, 2024 • 23
DisPose: Disentangling Pose Guidance for Controllable Human Image Animation

Paper • 2412.09349 • Published Dec 12, 2024 • 8

CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models

Paper • 2411.18613 • Published Nov 27, 2024 • 58
4Real-Video: Learning Generalizable Photo-Realistic 4D Video Diffusion

Paper • 2412.04462 • Published Dec 5, 2024 • 8

SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers

Paper • 2411.10510 • Published Nov 15, 2024 • 8

Generating Compositional Scenes via Text-to-image RGBA Instance Generation

Paper • 2411.10913 • Published Nov 16, 2024 • 4
ROICtrl: Boosting Instance Control for Visual Generation

Paper • 2411.17949 • Published Nov 27, 2024 • 87
Pathways on the Image Manifold: Image Editing via Video Generation

Paper • 2411.16819 • Published Nov 25, 2024 • 37

ORID: Organ-Regional Information Driven Framework for Radiology Report Generation

Paper • 2411.13025 • Published Nov 20, 2024 • 2

Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation

Paper • 2411.14384 • Published Nov 21, 2024 • 9
DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation

Paper • 2501.16764 • Published Jan 28 • 22

Training-free Regional Prompting for Diffusion Transformers

Paper • 2411.02395 • Published Nov 4, 2024 • 25
Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement

Paper • 2411.06558 • Published Nov 10, 2024 • 36
FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers

Paper • 2412.09611 • Published Dec 12, 2024 • 10
3DIS-FLUX: simple and efficient multi-instance generation with DiT rendering

Paper • 2501.05131 • Published Jan 9 • 37

Fashion-VDM: Video Diffusion Model for Virtual Try-On

Paper • 2411.00225 • Published Oct 31, 2024 • 11
GarVerseLOD: High-Fidelity 3D Garment Reconstruction from a Single In-the-Wild Image using a Dataset with Levels of Details

Paper • 2411.03047 • Published Nov 5, 2024 • 9
FitDiT: Advancing the Authentic Garment Details for High-fidelity Virtual Try-on

Paper • 2411.10499 • Published Nov 15, 2024 • 13
TryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction using Diffusion Models

Paper • 2411.18350 • Published Nov 27, 2024 • 29

Image_Restoration

DreamClear: High-Capacity Real-World Image Restoration with Privacy-Safe Dataset Curation

Paper • 2410.18666 • Published Oct 24, 2024 • 19
Arbitrary-steps Image Super-resolution via Diffusion Inversion

Paper • 2412.09013 • Published Dec 12, 2024 • 13

MotionCLR: Motion Generation and Training-free Editing via Understanding Attention Mechanisms

Paper • 2410.18977 • Published Oct 24, 2024 • 15

DynamicCity: Large-Scale LiDAR Generation from Dynamic Scenes

Paper • 2410.18084 • Published Oct 23, 2024 • 14
KMM: Key Frame Mask Mamba for Extended Motion Generation

Paper • 2411.06481 • Published Nov 10, 2024 • 5

SpectroMotion: Dynamic 3D Reconstruction of Specular Scenes

Paper • 2410.17249 • Published Oct 22, 2024 • 42
VeGaS: Video Gaussian Splatting

Paper • 2411.11024 • Published Nov 17, 2024 • 7
MoDec-GS: Global-to-Local Motion Decomposition and Temporal Interval Adjustment for Compact Dynamic 3D Gaussian Splatting

Paper • 2501.03714 • Published Jan 7 • 9

Animate-X: Universal Character Image Animation with Enhanced Motion Representation

Paper • 2410.10306 • Published Oct 14, 2024 • 56

Video-Gen Training-Free

VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

Paper • 2410.04364 • Published Oct 6, 2024 • 29

An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

Paper • 2406.09415 • Published Jun 13, 2024 • 51
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

Paper • 2406.09406 • Published Jun 13, 2024 • 15
VideoGUI: A Benchmark for GUI Automation from Instructional Videos

Paper • 2406.10227 • Published Jun 14, 2024 • 9
What If We Recaption Billions of Web Images with LLaMA-3?

Paper • 2406.08478 • Published Jun 12, 2024 • 41

Subsurface Scattering for 3D Gaussian Splatting

Paper • 2408.12282 • Published Aug 22, 2024 • 7
Instant Facial Gaussians Translator for Relightable and Interactable Facial Rendering

Paper • 2409.07441 • Published Sep 11, 2024 • 11
GS^3: Efficient Relighting with Triple Gaussian Splatting

Paper • 2410.11419 • Published Oct 15, 2024 • 12
SpotLight: Shadow-Guided Object Relighting via Diffusion

Paper • 2411.18665 • Published Nov 27, 2024 • 3

GECO: Generative Image-to-3D within a SECOnd

Paper • 2405.20327 • Published May 30, 2024 • 11
Ouroboros3D: Image-to-3D Generation via 3D-aware Recursive Diffusion

Paper • 2406.03184 • Published Jun 5, 2024 • 22
NPGA: Neural Parametric Gaussian Avatars

Paper • 2405.19331 • Published May 29, 2024 • 10
Unified Text-to-Image Generation and Retrieval

Paper • 2406.05814 • Published Jun 9, 2024 • 16

ExtraNeRF: Visibility-Aware View Extrapolation of Neural Radiance Fields with Diffusion Models

Paper • 2406.06133 • Published Jun 10, 2024 • 12
Style-NeRF2NeRF: 3D Style Transfer From Style-Aligned Multi-View Images

Paper • 2406.13393 • Published Jun 19, 2024 • 5
BoostMVSNeRFs: Boosting MVS-based NeRFs to Generalizable View Synthesis in Large-scale Scenes

Paper • 2407.15848 • Published Jul 22, 2024 • 17
NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields

Paper • 2404.01300 • Published Apr 1, 2024 • 4

GFlow: Recovering 4D World from Monocular Video

Paper • 2405.18426 • Published May 28, 2024 • 17
3DitScene: Editing Any Scene via Language-guided Disentangled Gaussian Splatting

Paper • 2405.18424 • Published May 28, 2024 • 9
HDR-GS: Efficient High Dynamic Range Novel View Synthesis at 1000x Speed via Gaussian Splatting

Paper • 2405.15125 • Published May 24, 2024 • 8
PLA4D: Pixel-Level Alignments for Text-to-4D Gaussian Splatting

Paper • 2405.19957 • Published May 30, 2024 • 10

DreamCatalyst: Fast and High-Quality 3D Editing via Controlling Editability and Identity Preservation

Paper • 2407.11394 • Published Jul 16, 2024 • 12
Jointly Generating Multi-view Consistent PBR Textures using Collaborative Control

Paper • 2410.06985 • Published Oct 9, 2024 • 5
ZeroComp: Zero-shot Object Compositing from Image Intrinsics via Diffusion

Paper • 2410.08168 • Published Oct 10, 2024 • 9
MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D

Paper • 2411.02336 • Published Nov 4, 2024 • 24

SpaRP: Fast 3D Object Reconstruction and Pose Estimation from Sparse Views

Paper • 2408.10195 • Published Aug 19, 2024 • 13
MVInpainter: Learning Multi-View Consistent Inpainting to Bridge 2D and 3D Editing

Paper • 2408.08000 • Published Aug 15, 2024 • 9
DC3DO: Diffusion Classifier for 3D Objects

Paper • 2408.06693 • Published Aug 13, 2024 • 11
MagicMan: Generative Novel View Synthesis of Humans with 3D-Aware Diffusion and Iterative Refinement

Paper • 2408.14211 • Published Aug 26, 2024 • 11

YouDream: Generating Anatomically Controllable Consistent Text-to-3D Animals

Paper • 2406.16273 • Published Jun 24, 2024 • 43
Tailor3D: Customized 3D Assets Editing and Generation with Dual-Side Images

Paper • 2407.06191 • Published Jul 8, 2024 • 14
PlacidDreamer: Advancing Harmony in Text-to-3D Generation

Paper • 2407.13976 • Published Jul 19, 2024 • 5
MARVEL-40M+: Multi-Level Visual Elaboration for High-Fidelity Text-to-3D Content Creation

Paper • 2411.17945 • Published Nov 26, 2024 • 27

Image-Gen Theoretical

Guiding a Diffusion Model with a Bad Version of Itself

Paper • 2406.02507 • Published Jun 4, 2024 • 17
Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step

Paper • 2406.04314 • Published Jun 6, 2024 • 30
An Image is Worth 32 Tokens for Reconstruction and Generation

Paper • 2406.07550 • Published Jun 11, 2024 • 59
Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?

Paper • 2406.07546 • Published Jun 11, 2024 • 9

Image-Gen Personalization

pOps: Photo-Inspired Diffusion Operators

Paper • 2406.01300 • Published Jun 3, 2024 • 18
AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising

Paper • 2406.06911 • Published Jun 11, 2024 • 12
Interpreting the Weight Space of Customized Diffusion Models

Paper • 2406.09413 • Published Jun 13, 2024 • 20
EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts

Paper • 2406.09162 • Published Jun 13, 2024 • 14

Image-Gen Accelerator(Distill)

MLCM: Multistep Consistency Distillation of Latent Diffusion Model

Paper • 2406.05768 • Published Jun 9, 2024 • 13
Invertible Consistency Distillation for Text-Guided Image Editing in Around 7 Steps

Paper • 2406.14539 • Published Jun 20, 2024 • 27
HarmoniCa: Harmonizing Training and Inference for Better Feature Cache in Diffusion Transformer Acceleration

Paper • 2410.01723 • Published Oct 2, 2024 • 5
NitroFusion: High-Fidelity Single-Step Diffusion through Dynamic Adversarial Training

Paper • 2412.02030 • Published Dec 2, 2024 • 19

Image-Gen Lightweight

LiteVAE: Lightweight and Efficient Variational Autoencoders for Latent Diffusion Models

Paper • 2405.14477 • Published May 23, 2024 • 20
Phased Consistency Model

Paper • 2405.18407 • Published May 28, 2024 • 48
SVDQunat: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

Paper • 2411.05007 • Published Nov 7, 2024 • 22
SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training

Paper • 2412.09619 • Published Dec 12, 2024 • 27

Image-Gen Autoregressive

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Paper • 2406.06525 • Published Jun 10, 2024 • 71
Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling

Paper • 2405.21048 • Published May 31, 2024 • 16
Scalable Autoregressive Image Generation with Mamba

Paper • 2408.12245 • Published Aug 22, 2024 • 26
DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation

Paper • 2410.08159 • Published Oct 10, 2024 • 26

Zero-shot Image Editing with Reference Imitation

Paper • 2406.07547 • Published Jun 11, 2024 • 33
The Devil is in the Details: StyleFeatureEditor for Detail-Rich StyleGAN Inversion and High Quality Image Editing

Paper • 2406.10601 • Published Jun 15, 2024 • 70
UltraEdit: Instruction-based Fine-Grained Image Editing at Scale

Paper • 2407.05282 • Published Jul 7, 2024 • 15
Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model

Paper • 2407.16982 • Published Jul 24, 2024 • 42

Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models

Paper • 2406.09416 • Published Jun 13, 2024 • 29
Wavelets Are All You Need for Autoregressive Image Generation

Paper • 2406.19997 • Published Jun 28, 2024 • 31
ViPer: Visual Personalization of Generative Models via Individual Preference Learning

Paper • 2407.17365 • Published Jul 24, 2024 • 13
MegaFusion: Extend Diffusion Models towards Higher-resolution Image Generation without Further Tuning

Paper • 2408.11001 • Published Aug 20, 2024 • 13

FontStudio: Shape-Adaptive Diffusion Model for Coherent and Consistent Font Effect Generation

Paper • 2406.08392 • Published Jun 12, 2024 • 21
Glyph-ByT5-v2: A Strong Aesthetic Baseline for Accurate Multilingual Visual Text Rendering

Paper • 2406.10208 • Published Jun 14, 2024 • 22
AMO Sampler: Enhancing Text Rendering with Overshooting

Paper • 2411.19415 • Published Nov 28, 2024 • 5

Image-Gen Story

SEED-Story: Multimodal Long Story Generation with Large Language Model

Paper • 2407.08683 • Published Jul 11, 2024 • 25
Story-Adapter: A Training-free Iterative Framework for Long Story Visualization

Paper • 2410.06244 • Published Oct 8, 2024 • 19
Unbounded: A Generative Infinite Game of Character Life Simulation

Paper • 2410.18975 • Published Oct 24, 2024 • 37
Generative AI for Cel-Animation: A Survey

Paper • 2501.06250 • Published Jan 8 • 13

Image-Gen StyleInject

Magic Insert: Style-Aware Drag-and-Drop

Paper • 2407.02489 • Published Jul 2, 2024 • 22
ZePo: Zero-Shot Portrait Stylization with Faster Sampling

Paper • 2408.05492 • Published Aug 10, 2024 • 7
CSGO: Content-Style Composition in Text-to-Image Generation

Paper • 2408.16766 • Published Aug 29, 2024 • 18
Style-Friendly SNR Sampler for Style-Driven Generation

Paper • 2411.14793 • Published Nov 22, 2024 • 39

MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model

Paper • 2405.20222 • Published May 30, 2024 • 11
ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation

Paper • 2406.00908 • Published Jun 3, 2024 • 12
CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation

Paper • 2406.02509 • Published Jun 4, 2024 • 10
I4VGen: Image as Stepping Stone for Text-to-Video Generation

Paper • 2406.02230 • Published Jun 4, 2024 • 18

I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models

Paper • 2405.16537 • Published May 26, 2024 • 17
ReVideo: Remake a Video with Motion and Content Control

Paper • 2405.13865 • Published May 22, 2024 • 25
FreeTraj: Tuning-Free Trajectory Control in Video Diffusion Models

Paper • 2406.16863 • Published Jun 24, 2024 • 11
Portrait Video Editing Empowered by Multimodal Generative Priors

Paper • 2409.13591 • Published Sep 20, 2024 • 17

Video_Gen lightweight

T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback

Paper • 2405.18750 • Published May 29, 2024 • 21
FLoD: Integrating Flexible Level of Detail into 3D Gaussian Splatting for Customizable Rendering

Paper • 2408.12894 • Published Aug 23, 2024 • 6

Video-Gen Customization

Still-Moving: Customized Video Generation without Customized Video Data

Paper • 2407.08674 • Published Jul 11, 2024 • 13
CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities

Paper • 2408.13239 • Published Aug 23, 2024 • 12
MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling

Paper • 2409.16160 • Published Sep 24, 2024 • 33
Disco4D: Disentangled 4D Human Generation and Animation from a Single Image

Paper • 2409.17280 • Published Sep 25, 2024 • 11

Video_Gen Translation

Looking Backward: Streaming Video-to-Video Translation with Feature Banks

Paper • 2405.15757 • Published May 24, 2024 • 15

Video-Gen Benchmark

T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation

Paper • 2407.14505 • Published Jul 19, 2024 • 26

Video-Gen Trajectory

Tora: Trajectory-oriented Diffusion Transformer for Video Generation

Paper • 2407.21705 • Published Jul 31, 2024 • 27
TrackGo: A Flexible and Efficient Method for Controllable Video Generation

Paper • 2408.11475 • Published Aug 21, 2024 • 18
TVG: A Training-free Transition Video Generation Method with Diffusion Models

Paper • 2408.13413 • Published Aug 24, 2024 • 14
PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation

Paper • 2409.18964 • Published Sep 27, 2024 • 26

Video-Gen LLM-based

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

Paper • 2406.04325 • Published Jun 6, 2024 • 75
Pandora: Towards General World Model with Natural Language Actions and Video States

Paper • 2406.09455 • Published Jun 12, 2024 • 16

VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control

Paper • 2407.12781 • Published Jul 17, 2024 • 13
Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models

Paper • 2409.07452 • Published Sep 11, 2024 • 21
Novel View Extrapolation with Video Diffusion Priors

Paper • 2411.14208 • Published Nov 21, 2024 • 10
World-consistent Video Diffusion with Explicit 3D Modeling

Paper • 2412.01821 • Published Dec 2, 2024 • 4

Animate3D: Animating Any 3D Model with Multi-view Video Diffusion

Paper • 2407.11398 • Published Jul 16, 2024 • 10
VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control

Paper • 2407.12781 • Published Jul 17, 2024 • 13
V^3: Viewing Volumetric Videos on Mobiles via Streamable 2D Dynamic Gaussians

Paper • 2409.13648 • Published Sep 20, 2024 • 12
DressRecon: Freeform 4D Human Reconstruction from Monocular Video

Paper • 2409.20563 • Published Sep 30, 2024 • 9

Video-Gen Diffusion_4D(DiT etc)

Human4DiT: Free-view Human Video Generation with 4D Diffusion Transformer

Paper • 2405.17405 • Published May 27, 2024 • 16
EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture

Paper • 2405.18991 • Published May 29, 2024 • 12
4Diffusion: Multi-view Video Diffusion Model for 4D Generation

Paper • 2405.20674 • Published May 31, 2024 • 15
4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models

Paper • 2406.07472 • Published Jun 11, 2024 • 13

Video-Gen Dataset

VidGen-1M: A Large-Scale Dataset for Text-to-video Generation

Paper • 2408.02629 • Published Aug 5, 2024 • 15
Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling

Paper • 2408.03695 • Published Aug 7, 2024 • 13

ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer

Paper • 2408.03284 • Published Aug 6, 2024 • 11
Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency

Paper • 2409.02634 • Published Sep 4, 2024 • 97

Image-Captioning

Understanding Retrieval Robustness for Retrieval-Augmented Image Captioning

Paper • 2406.02265 • Published Jun 4, 2024 • 7

SAM 2: Segment Anything in Images and Videos

Paper • 2408.00714 • Published Aug 1, 2024 • 116
Rethinking Open-Vocabulary Segmentation of Radiance Fields in 3D Space

Paper • 2408.07416 • Published Aug 14, 2024 • 7
SMITE: Segment Me In TimE

Paper • 2410.18538 • Published Oct 24, 2024 • 16
ReferEverything: Towards Segmenting Everything We Can Speak of in Videos

Paper • 2410.23287 • Published Oct 30, 2024 • 19

Surgical SAM 2: Real-time Segment Anything in Surgical Video by Efficient Frame Pruning

Paper • 2408.07931 • Published Aug 15, 2024 • 22
SAM2Point: Segment Any 3D as Videos in Zero-shot and Promptable Manners

Paper • 2408.16768 • Published Aug 29, 2024 • 28
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

Paper • 2410.16268 • Published Oct 21, 2024 • 69

Depth Anything V2

Paper • 2406.09414 • Published Jun 13, 2024 • 103
Depth Anywhere: Enhancing 360 Monocular Depth Estimation via Perspective Distillation and Unlabeled Data Augmentation

Paper • 2406.12849 • Published Jun 18, 2024 • 50
BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation

Paper • 2407.17952 • Published Jul 25, 2024 • 32
Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction

Paper • 2409.18124 • Published Sep 26, 2024 • 33

Lessons from Learning to Spin "Pens"

Paper • 2407.18902 • Published Jul 26, 2024 • 21
Diffusion Models Are Real-Time Game Engines

Paper • 2408.14837 • Published Aug 27, 2024 • 126
WALL-E: World Alignment by Rule Learning Improves World Model-based LLM Agents

Paper • 2410.07484 • Published Oct 9, 2024 • 51
Autonomous Character-Scene Interaction Synthesis from Text Instruction

Paper • 2410.03187 • Published Oct 4, 2024 • 7

RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

Paper • 2406.02523 • Published Jun 4, 2024 • 12
UniT: Unified Tactile Representation for Robot Learning

Paper • 2408.06481 • Published Aug 12, 2024 • 10
Latent Action Pretraining from Videos

Paper • 2410.11758 • Published Oct 15, 2024 • 3
Neural Fields in Robotics: A Survey

Paper • 2410.20220 • Published Oct 26, 2024 • 5

VGGHeads: A Large-Scale Synthetic Dataset for 3D Human Heads

Paper • 2407.18245 • Published Jul 25, 2024 • 11
AIM 2024 Sparse Neural Rendering Challenge: Dataset and Benchmark

Paper • 2409.15041 • Published Sep 23, 2024 • 14
LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models

Paper • 2410.09732 • Published Oct 13, 2024 • 55
MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures

Paper • 2410.13754 • Published Oct 17, 2024 • 75

WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences

Paper • 2406.11069 • Published Jun 16, 2024 • 14

Training-free Long Video Generation with Chain of Diffusion Model Experts

Paper • 2408.13423 • Published Aug 24, 2024 • 23
Loong: Generating Minute-level Long Videos with Autoregressive Language Models

Paper • 2410.02757 • Published Oct 3, 2024 • 36
MagicDriveDiT: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control

Paper • 2411.13807 • Published Nov 21, 2024 • 11
TAPTRv3: Spatial and Temporal Context Foster Robust Tracking of Any Point in Long Video

Paper • 2411.18671 • Published Nov 27, 2024 • 20

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

Paper • 2409.02889 • Published Sep 4, 2024 • 54
MVLLaVA: An Intelligent Agent for Unified and Flexible Novel View Synthesis

Paper • 2409.07129 • Published Sep 11, 2024 • 8
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness

Paper • 2409.18125 • Published Sep 26, 2024 • 34
FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity

Paper • 2411.15411 • Published Nov 23, 2024 • 8

Omni-Generation

OmniGen: Unified Image Generation

Paper • 2409.11340 • Published Sep 17, 2024 • 115
Video-Guided Foley Sound Generation with Multimodal Controls

Paper • 2411.17698 • Published Nov 26, 2024 • 10
FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait

Paper • 2412.01064 • Published Dec 2, 2024 • 46
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows

Paper • 2412.01169 • Published Dec 2, 2024 • 13

Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection

Paper • 2409.08513 • Published Sep 13, 2024 • 15

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs