Spaces:
Build error
Build error
title: GEN3C Project (built from existing Docker image) | |
emoji: 🫁 | |
colorFrom: green | |
colorTo: blue | |
sdk: docker | |
image: elungky/gen3c:latest | |
# app_port: 7860 # Remove or comment this line as the image handles the port | |
# GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control | |
<!-- Note: this video is hosted by GitHub and gets embedded automatically when viewing in the GitHub UI --> | |
https://github.com/user-attachments/assets/247e1719-9f8f-4504-bfa3-f9706bd8682d | |
**GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control**<br> | |
[Xuanchi Ren*](https://xuanchiren.com/), | |
[Tianchang Shen*](https://www.cs.toronto.edu/~shenti11/), | |
[Jiahui Huang](https://huangjh-pub.github.io/), | |
[Huan Ling](https://www.cs.toronto.edu/~linghuan/), | |
[Yifan Lu](https://yifanlu0227.github.io/), | |
[Merlin Nimier-David](https://merlin.nimierdavid.fr/), | |
[Thomas Müller](https://research.nvidia.com/person/thomas-muller), | |
[Alexander Keller](https://research.nvidia.com/person/alex-keller), | |
[Sanja Fidler](https://www.cs.toronto.edu/~fidler/), | |
[Jun Gao](https://www.cs.toronto.edu/~jungao/) <br> | |
\* indicates equal contribution <br> | |
**[Paper](https://arxiv.org/pdf/2503.03751), [Project Page](https://research.nvidia.com/labs/toronto-ai/GEN3C/), [HuggingFace](https://huggingface.co/collections/nvidia/gen3c-683f3f9540a8f9c98cf46a8d)** | |
Abstract: We present GEN3C, a generative video model with precise Camera Control and | |
temporal 3D Consistency. Prior video models already generate realistic videos, | |
but they tend to leverage little 3D information, leading to inconsistencies, | |
such as objects popping in and out of existence. Camera control, if implemented | |
at all, is imprecise, because camera parameters are mere inputs to the neural | |
network which must then infer how the video depends on the camera. In contrast, | |
GEN3C is guided by a 3D cache: point clouds obtained by predicting the | |
pixel-wise depth of seed images or previously generated frames. When generating | |
the next frames, GEN3C is conditioned on the 2D renderings of the 3D cache with | |
the new camera trajectory provided by the user. Crucially, this means that | |
GEN3C neither has to remember what it previously generated nor does it have to | |
infer the image structure from the camera pose. The model, instead, can focus | |
all its generative power on previously unobserved regions, as well as advancing | |
the scene state to the next frame. Our results demonstrate more precise camera | |
control than prior work, as well as state-of-the-art results in sparse-view | |
novel view synthesis, even in challenging settings such as driving scenes and | |
monocular dynamic video. Results are best viewed in videos. | |
For business inquiries, please visit our website and submit the form: [NVIDIA Research Licensing](https://www.nvidia.com/en-us/research/inquiries/). | |
For any other questions related to the model, please contact Xuanchi, Tianchang or Jun. | |
## News | |
- 2025-06-06 Code and model released! In a future update, we plan to include the pipeline for jointly predicting depth and camera pose from video, as well as a driving-finetuned model. Stay tuned! | |
## Installation | |
Please follow the "Inference" section in [INSTALL.md](INSTALL.md) to set up your environment. | |
## Inference | |
### Download checkpoints | |
1. Generate a [Hugging Face](https://huggingface.co/settings/tokens) access token (if you haven't done so already). Set the access token to `Read` permission (default is `Fine-grained`). | |
2. Log in to Hugging Face with the access token: | |
```bash | |
huggingface-cli login | |
``` | |
3. Download the GEN3C model weights from [Hugging Face](https://huggingface.co/nvidia/GEN3C-Cosmos-7B): | |
```bash | |
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python scripts/download_gen3c_checkpoints.py --checkpoint_dir checkpoints | |
``` | |
### Interactive GUI usage | |
<div align="center"> | |
<img src="gui/assets/gui_preview.webp" alt="GEN3C interactive GUI" width="1080px"/> | |
</div> | |
GEN3C can be used through an interactive GUI, allowing to visualize the inputs in 3D, author arbitrary camera trajectories, and start inference from a single window. | |
Please see the [dedicated instructions](gui/README.md). | |
### Command-line usage | |
GEN3C supports both images and videos as input. Below are examples of running GEN3C on single images and videos with predefined camera trajectory patterns. | |
### Example 1: Single Image to Video Generation | |
#### Single GPU | |
Generate a 121-frame video from a single image: | |
```bash | |
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/gen3c_single_image.py \ | |
--checkpoint_dir checkpoints \ | |
--input_image_path assets/diffusion/000000.png \ | |
--video_save_name test_single_image \ | |
--guidance 1 \ | |
--foreground_masking | |
``` | |
#### Multi-GPU (8 GPUs) | |
```bash | |
NUM_GPUS=8 | |
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) torchrun --nproc_per_node=${NUM_GPUS} cosmos_predict1/diffusion/inference/gen3c_single_image.py \ | |
--checkpoint_dir checkpoints \ | |
--input_image_path assets/diffusion/000000.png \ | |
--video_save_name test_single_image_multigpu \ | |
--num_gpus ${NUM_GPUS} \ | |
--guidance 1 \ | |
--foreground_masking | |
``` | |
#### Additional Options | |
- To generate longer videos autoregressively, specify the number of frames using `--num_video_frames`. The number of frames must follow the pattern: 121 * N - 1 (e.g., 241, 361, etc.) | |
- To save buffer images alongside the output video, add the `--save_buffer` flag | |
- You can control camera trajectories using `--trajectory`, `--camera_rotation`, and `--movement_distance` arguments. See the "Camera Movement Options" section below for details. | |
#### Camera Movement Options | |
##### Trajectory Types | |
The `--trajectory` argument controls the path the camera takes during video generation. Available options: | |
| Option | Description | | |
|--------|-------------| | |
| `left` | Camera moves to the left (default) | | |
| `right` | Camera moves to the right | | |
| `up` | Camera moves upward | | |
| `down` | Camera moves downward | | |
| `zoom_in` | Camera moves closer to the scene | | |
| `zoom_out` | Camera moves away from the scene | | |
| `clockwise` | Camera moves in a clockwise circular path | | |
| `counterclockwise` | Camera moves in a counterclockwise circular path | | |
##### Camera Rotation Modes | |
The `--camera_rotation` argument controls how the camera rotates during movement. Available options: | |
| Option | Description | | |
|--------|-------------| | |
| `center_facing` | Camera always rotates to look at the (estimated) center of the scene (default) | | |
| `no_rotation` | Camera maintains its original orientation while moving | | |
| `trajectory_aligned` | Camera rotates to align with the direction of movement | | |
##### Movement Distance | |
The `--movement_distance` argument controls how far the camera moves from its initial position. The default value is 0.3. A larger value will result in more dramatic camera movement, while a smaller value will create more subtle movement. | |
##### GPU Memory Requirements | |
We have tested GEN3C only on H100 and A100 GPUs. For GPUs with limited memory, you can fully offload all models by appending the following flags to your command: | |
```bash | |
--offload_diffusion_transformer \ | |
--offload_tokenizer \ | |
--offload_text_encoder_model \ | |
--offload_prompt_upsampler \ | |
--offload_guardrail_models \ | |
--disable_guardrail \ | |
--disable_prompt_encoder | |
``` | |
Maximum observed memory during inference with full offloading: ~43GB. Note: Memory usage may vary depending on system specifications and is provided for reference only. | |
### Example 2: Video to Video Generation | |
For video input, GEN3C requires additional depth information, camera intrinsics, and extrinsics. These can be obtained using your choice of SLAM packages. For testing purposes, we provide example data. | |
First, you need to download the test samples: | |
```bash | |
# Download test samples from Hugging Face | |
huggingface-cli download nvidia/GEN3C-Testing-Example --repo-type dataset --local-dir assets/diffusion/dynamic_video_samples | |
``` | |
#### Single GPU | |
```bash | |
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/gen3c_dynamic.py \ | |
--checkpoint_dir checkpoints \ | |
--input_image_path assets/diffusion/dynamic_video_samples/batch_0000 \ | |
--video_save_name test_dynamic_video \ | |
--guidance 1 | |
``` | |
#### Multi-GPU (8 GPUs) | |
```bash | |
NUM_GPUS=8 | |
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) torchrun --nproc_per_node=${NUM_GPUS} cosmos_predict1/diffusion/inference/gen3c_dynamic.py \ | |
--checkpoint_dir checkpoints \ | |
--input_image_path assets/diffusion/dynamic_video_samples/batch_0000 \ | |
--video_save_name test_dynamic_video_multigpu \ | |
--num_gpus ${NUM_GPUS} \ | |
--guidance 1 | |
``` | |
## Gallery | |
- **GEN3C** can be easily applied to video/scene creation from a single image | |
<div align="center"> | |
<img src="assets/demo_3.gif" alt="" width="1100" /> | |
</div> | |
- ... or sparse-view images (we use 5 images here) | |
<div align="center"> | |
<img src="assets/demo_2.gif" alt="" width="1100" /> | |
</div> | |
- .. and dynamic videos | |
<div align="center"> | |
<img src="assets/demo_dynamic.gif" alt="" width="1100" /> | |
</div> | |
## Acknowledgement | |
Our model is based on [NVIDIA Cosmos](https://github.com/NVIDIA/Cosmos) and [Stable Video Diffusion](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid). | |
We are also grateful to several other open-source repositories that we drew inspiration from or built upon during the development of our pipeline: | |
- [MoGe](https://github.com/microsoft/MoGe) | |
- [TrajectoryCrafter](https://github.com/TrajectoryCrafter/TrajectoryCrafter) | |
- [DimensionX](https://github.com/wenqsun/DimensionX) | |
- [Depth Anything V2](https://github.com/DepthAnything/Depth-Anything-V2) | |
- [Video Depth Anything](https://github.com/DepthAnything/Video-Depth-Anything) | |
## Citation | |
``` | |
@inproceedings{ren2025gen3c, | |
title={GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control}, | |
author={Ren, Xuanchi and Shen, Tianchang and Huang, Jiahui and Ling, Huan and | |
Lu, Yifan and Nimier-David, Merlin and Müller, Thomas and Keller, Alexander and | |
Fidler, Sanja and Gao, Jun}, | |
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, | |
year={2025} | |
} | |
``` | |
## License and Contact | |
This project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use. | |
GEN3C source code is released under the [Apache 2 License](https://www.apache.org/licenses/LICENSE-2.0). | |
GEN3C models are released under the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license). For a custom license, please visit our website and submit the form: [NVIDIA Research Licensing](https://www.nvidia.com/en-us/research/inquiries/). | |
======= | |
title: Gen3c | |
emoji: 🌍 | |
colorFrom: indigo | |
colorTo: blue | |
======= | |
title: Gen3c | |
emoji: 📊 | |
colorFrom: red | |
colorTo: indigo | |
>>>>>>> 94787fdfbc451bf05950a5871140e87dbbcfe0df | |
sdk: docker | |
pinned: false | |
--- | |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference |