Add model card metadata (pipeline, library, tags) and abstract
Browse filesThis PR enhances the model card for `ARC-Hunyuan-Video-7B` by:
* Adding `pipeline_tag: video-text-to-text` to ensure the model is discoverable under the appropriate task filter on the Hugging Face Hub.
* Adding `library_name: transformers` to indicate compatibility with the Hugging Face `transformers` library, allowing users to easily load and use the model with standard `transformers` API calls.
* Adding additional `tags` (`multimodal`, `video-understanding`, `video-qa`, `video-captioning`, `audio-understanding`) for better categorization and searchability.
* Including the paper abstract in a dedicated `
README.md
CHANGED
@@ -1,3 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
# ARC-Hunyuan-Video-7B
|
2 |
|
3 |
[](https://arxiv.org/abs/2507.20939)
|
@@ -11,6 +22,8 @@ Please note that in our Demo, ARC-Hunyuan-Video-7B is the model consistent with
|
|
11 |
Due to API file size limits, our demo uses compressed input video resolutions, which may cause slight performance differences from the paper. For original results, please run locally.
|
12 |
</span>
|
13 |
|
|
|
|
|
14 |
|
15 |
## Introduction
|
16 |
|
@@ -24,10 +37,10 @@ inference accelerated by the vLLM framework.
|
|
24 |
|
25 |
Compared to prior arts, we introduces a new paradigm of **Structured Video Comprehension**, with capabilities including:
|
26 |
|
27 |
-
-
|
28 |
-
-
|
29 |
-
-
|
30 |
-
-
|
31 |
|
32 |
The model is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and
|
33 |
video reasoning as below,
|
@@ -38,11 +51,11 @@ video reasoning as below,
|
|
38 |
|
39 |
Specifically, ARC-Hunyuan-Video-7B is built on top of the Hunyuan-7B vision-language model with the following key designs to meet the requirements of effective structured video comprehension:
|
40 |
|
41 |
-
-
|
42 |
-
-
|
43 |
-
-
|
44 |
-
-
|
45 |
-
|
46 |
|
47 |
<p align="center">
|
48 |
<img src="https://github.com/TencentARC/ARC-Hunyuan-Video-7B/blob/master/figures/method.jpg?raw=true" width="95%"/>
|
@@ -50,13 +63,13 @@ Specifically, ARC-Hunyuan-Video-7B is built on top of the Hunyuan-7B vision-lang
|
|
50 |
|
51 |
## News
|
52 |
|
53 |
-
-
|
54 |
-
-
|
55 |
|
56 |
## Usage
|
57 |
### Dependencies
|
58 |
-
-
|
59 |
-
-
|
60 |
### Installation
|
61 |
|
62 |
Clone the repo and install dependent packages
|
@@ -87,7 +100,7 @@ pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.
|
|
87 |
|
88 |
### Model Weights
|
89 |
|
90 |
-
-
|
91 |
|
92 |
### Inference
|
93 |
```bash
|
@@ -136,6 +149,4 @@ If you find the work helpful, please consider citing:
|
|
136 |
journal={arXiv preprint arXiv:2507.20939},
|
137 |
year={2025}
|
138 |
}
|
139 |
-
```
|
140 |
-
|
141 |
-
|
|
|
1 |
+
---
|
2 |
+
pipeline_tag: video-text-to-text
|
3 |
+
library_name: transformers
|
4 |
+
tags:
|
5 |
+
- multimodal
|
6 |
+
- video-understanding
|
7 |
+
- video-qa
|
8 |
+
- video-captioning
|
9 |
+
- audio-understanding
|
10 |
+
---
|
11 |
+
|
12 |
# ARC-Hunyuan-Video-7B
|
13 |
|
14 |
[](https://arxiv.org/abs/2507.20939)
|
|
|
22 |
Due to API file size limits, our demo uses compressed input video resolutions, which may cause slight performance differences from the paper. For original results, please run locally.
|
23 |
</span>
|
24 |
|
25 |
+
## Abstract
|
26 |
+
Real-world user-generated short videos, especially those distributed on platforms such as WeChat Channel and TikTok, dominate the mobile internet. However, current large multimodal models lack essential temporally-structured, detailed, and in-depth video comprehension capabilities, which are the cornerstone of effective video search and recommendation, as well as emerging video applications. Understanding real-world shorts is actually challenging due to their complex visual elements, high information density in both visuals and audio, and fast pacing that focuses on emotional expression and viewpoint delivery. This requires advanced reasoning to effectively integrate multimodal information, including visual, audio, and text. In this work, we introduce ARC-Hunyuan-Video, a multimodal model that processes visual, audio, and textual signals from raw video inputs end-to-end for structured comprehension. The model is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and video reasoning. Leveraging high-quality data from an automated annotation pipeline, our compact 7B-parameter model is trained through a comprehensive regimen: pre-training, instruction fine-tuning, cold start, reinforcement learning (RL) post-training, and final instruction fine-tuning. Quantitative evaluations on our introduced benchmark ShortVid-Bench and qualitative comparisons demonstrate its strong performance in real-world video comprehension, and it supports zero-shot or fine-tuning with a few samples for diverse downstream applications. The real-world production deployment of our model has yielded tangible and measurable improvements in user engagement and satisfaction, a success supported by its remarkable efficiency, with stress tests indicating an inference time of just 10 seconds for a one-minute video on H20 GPU.
|
27 |
|
28 |
## Introduction
|
29 |
|
|
|
37 |
|
38 |
Compared to prior arts, we introduces a new paradigm of **Structured Video Comprehension**, with capabilities including:
|
39 |
|
40 |
+
- **Deep Understanding of Real-World Short Videos:** ARC-Hunyuan-Video-7B excels at analyzing user-generated content from platforms like WeChat Channels and TikTok. It goes beyond surface-level descriptions to grasp the creator's intent, emotional expression, and core message by processing complex visual elements, dense audio cues, and rapid pacing.
|
41 |
+
- **Synchronized Audio-Visual Reasoning:** The synchronization of raw visual and audio signals allows our model to answer complex questions that are impossible to solve with only one modality, such as understanding humor in a skit or details in a product review.
|
42 |
+
- **Precise Temporal Awareness:** ARC-Hunyuan-Video-7B knows not just _what_ happens, but _when_ it happens. It supports multi-granularity timestamped captioning, temporal video grounding, and detailed event summarization, making it perfect for applications like video search, highlight generation, and content analysis.
|
43 |
+
- **Advanced Reasoning and Application Versatility:** Leveraging a comprehensive multi-stage training regimen including Reinforcement Learning (RL), ARC-Hunyuan-Video-7B demonstrates strong reasoning capabilities. It supports zero-shot or few-shot fine-tuning for diverse downstream applications like video tagging, recommendation, and retrieval.
|
44 |
|
45 |
The model is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and
|
46 |
video reasoning as below,
|
|
|
51 |
|
52 |
Specifically, ARC-Hunyuan-Video-7B is built on top of the Hunyuan-7B vision-language model with the following key designs to meet the requirements of effective structured video comprehension:
|
53 |
|
54 |
+
- An extra audio encoder with fine-grained visual-audio synchronization for temporally aligned visual-audio inputs
|
55 |
+
- A timestamp overlay mechanism on visual frames that explicitly provides the model with temporal awareness
|
56 |
+
- Millions of real-world videos with a totally automated bootstrapped annotation pipeline
|
57 |
+
- A comprehensive training regimen based on the finding that grounding the model in objective
|
58 |
+
tasks with RL is key to unlocking high-quality, subjective understanding
|
59 |
|
60 |
<p align="center">
|
61 |
<img src="https://github.com/TencentARC/ARC-Hunyuan-Video-7B/blob/master/figures/method.jpg?raw=true" width="95%"/>
|
|
|
63 |
|
64 |
## News
|
65 |
|
66 |
+
- 2025.07.25: We release the [model checkpoint](https://huggingface.co/TencentARC/ARC-Hunyuan-Video-7B) and inference code of ARC-Hunyuan-Video-7B including [vLLM](https://github.com/vllm-project/vllm) version.
|
67 |
+
- 2025.07.25: We release the [API service](https://arc.tencent.com/zh/document/ARC-Hunyuan-Video-7B) of ARC-Hunyuan-Video-7B, which is supported by [vLLM](https://github.com/vllm-project/vllm). We release two versions: one is V0, which only supports video description and summarization in Chinese; the other is the version consistent with the model checkpoint and the one described in the paper.
|
68 |
|
69 |
## Usage
|
70 |
### Dependencies
|
71 |
+
- Our inference can be performed on a single NVIDIA A100 40GB GPU.
|
72 |
+
- For the vLLM deployment version, we recommend using two NVIDIA A100 40GB GPUs.
|
73 |
### Installation
|
74 |
|
75 |
Clone the repo and install dependent packages
|
|
|
100 |
|
101 |
### Model Weights
|
102 |
|
103 |
+
- Download [ARC-Hunyuan-Video-7B](https://huggingface.co/TencentARC/ARC-Hunyuan-Video-7B) including ViT and LLM and the original [whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) .
|
104 |
|
105 |
### Inference
|
106 |
```bash
|
|
|
149 |
journal={arXiv preprint arXiv:2507.20939},
|
150 |
year={2025}
|
151 |
}
|
152 |
+
```
|
|
|
|