nielsr HF Staff commited on
Commit
29a8a97
·
verified ·
1 Parent(s): bbc9a99

Add model card metadata (pipeline, library, tags) and abstract

Browse files

This PR enhances the model card for `ARC-Hunyuan-Video-7B` by:

* Adding `pipeline_tag: video-text-to-text` to ensure the model is discoverable under the appropriate task filter on the Hugging Face Hub.
* Adding `library_name: transformers` to indicate compatibility with the Hugging Face `transformers` library, allowing users to easily load and use the model with standard `transformers` API calls.
* Adding additional `tags` (`multimodal`, `video-understanding`, `video-qa`, `video-captioning`, `audio-understanding`) for better categorization and searchability.
* Including the paper abstract in a dedicated `

Files changed (1) hide show
  1. README.md +28 -17
README.md CHANGED
@@ -1,3 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
1
  # ARC-Hunyuan-Video-7B
2
 
3
  [![arXiv](https://img.shields.io/badge/arXiv-2507.20939-b31b1b.svg)](https://arxiv.org/abs/2507.20939)
@@ -11,6 +22,8 @@ Please note that in our Demo, ARC-Hunyuan-Video-7B is the model consistent with
11
  Due to API file size limits, our demo uses compressed input video resolutions, which may cause slight performance differences from the paper. For original results, please run locally.
12
  </span>
13
 
 
 
14
 
15
  ## Introduction
16
 
@@ -24,10 +37,10 @@ inference accelerated by the vLLM framework.
24
 
25
  Compared to prior arts, we introduces a new paradigm of **Structured Video Comprehension**, with capabilities including:
26
 
27
- - **Deep Understanding of Real-World Short Videos:** ARC-Hunyuan-Video-7B excels at analyzing user-generated content from platforms like WeChat Channels and TikTok. It goes beyond surface-level descriptions to grasp the creator's intent, emotional expression, and core message by processing complex visual elements, dense audio cues, and rapid pacing.
28
- - **Synchronized Audio-Visual Reasoning:** The synchronization of raw visual and audio signals allows our model to answer complex questions that are impossible to solve with only one modality, such as understanding humor in a skit or details in a product review.
29
- - **Precise Temporal Awareness:** ARC-Hunyuan-Video-7B knows not just _what_ happens, but _when_ it happens. It supports multi-granularity timestamped captioning, temporal video grounding, and detailed event summarization, making it perfect for applications like video search, highlight generation, and content analysis.
30
- - **Advanced Reasoning and Application Versatility:** Leveraging a comprehensive multi-stage training regimen including Reinforcement Learning (RL), ARC-Hunyuan-Video-7B demonstrates strong reasoning capabilities. It supports zero-shot or few-shot fine-tuning for diverse downstream applications like video tagging, recommendation, and retrieval.
31
 
32
  The model is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and
33
  video reasoning as below,
@@ -38,11 +51,11 @@ video reasoning as below,
38
 
39
  Specifically, ARC-Hunyuan-Video-7B is built on top of the Hunyuan-7B vision-language model with the following key designs to meet the requirements of effective structured video comprehension:
40
 
41
- - An extra audio encoder with fine-grained visual-audio synchronization for temporally aligned visual-audio inputs
42
- - A timestamp overlay mechanism on visual frames that explicitly provides the model with temporal awareness
43
- - Millions of real-world videos with a totally automated bootstrapped annotation pipeline
44
- - A comprehensive training regimen based on the finding that grounding the model in objective
45
- tasks with RL is key to unlocking high-quality, subjective understanding
46
 
47
  <p align="center">
48
  <img src="https://github.com/TencentARC/ARC-Hunyuan-Video-7B/blob/master/figures/method.jpg?raw=true" width="95%"/>
@@ -50,13 +63,13 @@ Specifically, ARC-Hunyuan-Video-7B is built on top of the Hunyuan-7B vision-lang
50
 
51
  ## News
52
 
53
- - 2025.07.25: We release the [model checkpoint](https://huggingface.co/TencentARC/ARC-Hunyuan-Video-7B) and inference code of ARC-Hunyuan-Video-7B including [vLLM](https://github.com/vllm-project/vllm) version.
54
- - 2025.07.25: We release the [API service](https://arc.tencent.com/zh/document/ARC-Hunyuan-Video-7B) of ARC-Hunyuan-Video-7B, which is supported by [vLLM](https://github.com/vllm-project/vllm). We release two versions: one is V0, which only supports video description and summarization in Chinese; the other is the version consistent with the model checkpoint and the one described in the paper.
55
 
56
  ## Usage
57
  ### Dependencies
58
- - Our inference can be performed on a single NVIDIA A100 40GB GPU.
59
- - For the vLLM deployment version, we recommend using two NVIDIA A100 40GB GPUs.
60
  ### Installation
61
 
62
  Clone the repo and install dependent packages
@@ -87,7 +100,7 @@ pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.
87
 
88
  ### Model Weights
89
 
90
- - Download [ARC-Hunyuan-Video-7B](https://huggingface.co/TencentARC/ARC-Hunyuan-Video-7B) including ViT and LLM and the original [whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) .
91
 
92
  ### Inference
93
  ```bash
@@ -136,6 +149,4 @@ If you find the work helpful, please consider citing:
136
  journal={arXiv preprint arXiv:2507.20939},
137
  year={2025}
138
  }
139
- ```
140
-
141
-
 
1
+ ---
2
+ pipeline_tag: video-text-to-text
3
+ library_name: transformers
4
+ tags:
5
+ - multimodal
6
+ - video-understanding
7
+ - video-qa
8
+ - video-captioning
9
+ - audio-understanding
10
+ ---
11
+
12
  # ARC-Hunyuan-Video-7B
13
 
14
  [![arXiv](https://img.shields.io/badge/arXiv-2507.20939-b31b1b.svg)](https://arxiv.org/abs/2507.20939)
 
22
  Due to API file size limits, our demo uses compressed input video resolutions, which may cause slight performance differences from the paper. For original results, please run locally.
23
  </span>
24
 
25
+ ## Abstract
26
+ Real-world user-generated short videos, especially those distributed on platforms such as WeChat Channel and TikTok, dominate the mobile internet. However, current large multimodal models lack essential temporally-structured, detailed, and in-depth video comprehension capabilities, which are the cornerstone of effective video search and recommendation, as well as emerging video applications. Understanding real-world shorts is actually challenging due to their complex visual elements, high information density in both visuals and audio, and fast pacing that focuses on emotional expression and viewpoint delivery. This requires advanced reasoning to effectively integrate multimodal information, including visual, audio, and text. In this work, we introduce ARC-Hunyuan-Video, a multimodal model that processes visual, audio, and textual signals from raw video inputs end-to-end for structured comprehension. The model is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and video reasoning. Leveraging high-quality data from an automated annotation pipeline, our compact 7B-parameter model is trained through a comprehensive regimen: pre-training, instruction fine-tuning, cold start, reinforcement learning (RL) post-training, and final instruction fine-tuning. Quantitative evaluations on our introduced benchmark ShortVid-Bench and qualitative comparisons demonstrate its strong performance in real-world video comprehension, and it supports zero-shot or fine-tuning with a few samples for diverse downstream applications. The real-world production deployment of our model has yielded tangible and measurable improvements in user engagement and satisfaction, a success supported by its remarkable efficiency, with stress tests indicating an inference time of just 10 seconds for a one-minute video on H20 GPU.
27
 
28
  ## Introduction
29
 
 
37
 
38
  Compared to prior arts, we introduces a new paradigm of **Structured Video Comprehension**, with capabilities including:
39
 
40
+ - **Deep Understanding of Real-World Short Videos:** ARC-Hunyuan-Video-7B excels at analyzing user-generated content from platforms like WeChat Channels and TikTok. It goes beyond surface-level descriptions to grasp the creator's intent, emotional expression, and core message by processing complex visual elements, dense audio cues, and rapid pacing.
41
+ - **Synchronized Audio-Visual Reasoning:** The synchronization of raw visual and audio signals allows our model to answer complex questions that are impossible to solve with only one modality, such as understanding humor in a skit or details in a product review.
42
+ - **Precise Temporal Awareness:** ARC-Hunyuan-Video-7B knows not just _what_ happens, but _when_ it happens. It supports multi-granularity timestamped captioning, temporal video grounding, and detailed event summarization, making it perfect for applications like video search, highlight generation, and content analysis.
43
+ - **Advanced Reasoning and Application Versatility:** Leveraging a comprehensive multi-stage training regimen including Reinforcement Learning (RL), ARC-Hunyuan-Video-7B demonstrates strong reasoning capabilities. It supports zero-shot or few-shot fine-tuning for diverse downstream applications like video tagging, recommendation, and retrieval.
44
 
45
  The model is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and
46
  video reasoning as below,
 
51
 
52
  Specifically, ARC-Hunyuan-Video-7B is built on top of the Hunyuan-7B vision-language model with the following key designs to meet the requirements of effective structured video comprehension:
53
 
54
+ - An extra audio encoder with fine-grained visual-audio synchronization for temporally aligned visual-audio inputs
55
+ - A timestamp overlay mechanism on visual frames that explicitly provides the model with temporal awareness
56
+ - Millions of real-world videos with a totally automated bootstrapped annotation pipeline
57
+ - A comprehensive training regimen based on the finding that grounding the model in objective
58
+ tasks with RL is key to unlocking high-quality, subjective understanding
59
 
60
  <p align="center">
61
  <img src="https://github.com/TencentARC/ARC-Hunyuan-Video-7B/blob/master/figures/method.jpg?raw=true" width="95%"/>
 
63
 
64
  ## News
65
 
66
+ - 2025.07.25: We release the [model checkpoint](https://huggingface.co/TencentARC/ARC-Hunyuan-Video-7B) and inference code of ARC-Hunyuan-Video-7B including [vLLM](https://github.com/vllm-project/vllm) version.
67
+ - 2025.07.25: We release the [API service](https://arc.tencent.com/zh/document/ARC-Hunyuan-Video-7B) of ARC-Hunyuan-Video-7B, which is supported by [vLLM](https://github.com/vllm-project/vllm). We release two versions: one is V0, which only supports video description and summarization in Chinese; the other is the version consistent with the model checkpoint and the one described in the paper.
68
 
69
  ## Usage
70
  ### Dependencies
71
+ - Our inference can be performed on a single NVIDIA A100 40GB GPU.
72
+ - For the vLLM deployment version, we recommend using two NVIDIA A100 40GB GPUs.
73
  ### Installation
74
 
75
  Clone the repo and install dependent packages
 
100
 
101
  ### Model Weights
102
 
103
+ - Download [ARC-Hunyuan-Video-7B](https://huggingface.co/TencentARC/ARC-Hunyuan-Video-7B) including ViT and LLM and the original [whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) .
104
 
105
  ### Inference
106
  ```bash
 
149
  journal={arXiv preprint arXiv:2507.20939},
150
  year={2025}
151
  }
152
+ ```