Image-to-Text
Transformers
Safetensors
Cosmos
English
qwen2_5_vl
nvidia
text-generation-inference
zekunhao commited on
Commit
0caf724
·
1 Parent(s): 1674a72

08/01 release: Added support for spatial-temporal reasoning of city and industrial operations

Browse files
README.md CHANGED
@@ -17,7 +17,6 @@ tags:
17
  - cosmos
18
  ---
19
 
20
-
21
  # **Cosmos-Reason1: Physical AI Common Sense and Embodied Reasoning Models**
22
 
23
  [**Cosmos**](https://huggingface.co/collections/nvidia/cosmos-reason1-67c9e926206426008f1da1b7) | [**Code**](https://github.com/nvidia-cosmos/cosmos-reason1) | [**Paper**](https://arxiv.org/abs/2503.15558) | [**Paper Website**](https://research.nvidia.com/labs/dir/cosmos-reason1)
@@ -64,6 +63,7 @@ Physical AI: Space, time, fundamental physics understanding and embodied reasoni
64
 
65
  * Github: [05/17/2025](https://github.com/nvidia-cosmos/cosmos-reason1)
66
  * Huggingface:
 
67
  * [06/10/2025](https://huggingface.co/nvidia/Cosmos-Reason1-7B/commit/2464fff43c5c0bfb1916ac8c009feda4aed81be9). Enhanced critic capability for physical plausibility.
68
  * [05/17/2025](https://huggingface.co/nvidia/Cosmos-Reason1-7B/commit/098a5bb62a1f4fc05e5c4ac89aae8005e301aa18). Initial release.
69
 
@@ -75,6 +75,21 @@ Network Architecture: Qwen2.5-VL-7B-Instruct.
75
  Cosmos-Reason-7B is post-trained based on [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) and follows the same model architecture.
76
 
77
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
  ## Input
79
 
80
  **Input Type(s)**: Text+Video/Image
@@ -95,7 +110,7 @@ Cosmos-Reason-7B is post-trained based on [Qwen2.5-VL-7B-Instruct](https://huggi
95
 
96
  ## Output
97
 
98
- **Output Type(s)**: Text
99
 
100
  **Output Format**: String
101
 
@@ -103,6 +118,9 @@ Cosmos-Reason-7B is post-trained based on [Qwen2.5-VL-7B-Instruct](https://huggi
103
 
104
  **Other Properties Related to Output**:
105
  * Recommend using 4096 or more output max tokens to avoid truncation of long chain-of-thought response.
 
 
 
106
  * Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>
107
 
108
 
@@ -129,11 +147,47 @@ Cosmos-Reason-7B is post-trained based on [Qwen2.5-VL-7B-Instruct](https://huggi
129
  See [Cosmos-Reason1](https://github.com/nvidia-cosmos/cosmos-reason1) for details.
130
  * Post Training: [Cosmos-Reason1](https://github.com/nvidia-cosmos/cosmos-reason1) provides examples of supervised fine-tuning and reinforcement learning on embodied reasoning datasets.
131
 
132
- # Evaluation
133
-
134
  Please see our [technical paper](https://arxiv.org/pdf/2503.15558) for detailed evaluations on physical common sense and embodied reasoning. Part of the evaluation datasets are released under [Cosmos-Reason1-Benchmark](https://huggingface.co/datasets/nvidia/Cosmos-Reason1-Benchmark). The embodied reasoning datasets and benchmarks focus on the following areas: robotics (RoboVQA, BridgeDataV2, Agibot, RobFail), ego-centric human demonstration (HoloAssist), and Autonomous Vehicle (AV) driving video data. The AV dataset is collected and annotated by NVIDIA.
 
135
  All datasets go through the data annotation process described in the technical paper to prepare training and evaluation data and annotations.
136
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
137
  **Data Collection Method**:
138
  * RoboVQA: Hybrid: Automatic/Sensors
139
  * BridgeDataV2: Automatic/Sensors
@@ -142,6 +196,7 @@ All datasets go through the data annotation process described in the technical p
142
  * HoloAssist: Human
143
  * AV: Automatic/Sensors
144
 
 
145
  **Labeling Method**:
146
  * RoboVQA: Hybrid: Human,Automated
147
  * BridgeDataV2: Hybrid: Human,Automated
@@ -150,6 +205,7 @@ All datasets go through the data annotation process described in the technical p
150
  * HoloAssist: Hybrid: Human,Automated
151
  * AV: Hybrid: Human,Automated
152
 
 
153
  **Metrics**:
154
  We report the model accuracy on the embodied reasoning benchmark introduced in [Cosmos-Reason1](https://arxiv.org/abs/2503.15558). The results differ from those presented in Table 9 due to additional training aimed at supporting a broader range of Physical AI tasks beyond the benchmark.
155
  | | [RoboVQA](https://robovqa.github.io/) | AV | [BridgeDataV2](https://rail-berkeley.github.io/bridgedata/)| [Agibot](https://github.com/OpenDriveLab/AgiBot-World)| [HoloAssist](https://holoassist.github.io/) | [RoboFail](https://robot-reflect.github.io/) | Average |
@@ -160,6 +216,7 @@ We report the model accuracy on the embodied reasoning benchmark introduced in [
160
  Modality: Video (mp4) and Text
161
 
162
  ## Dataset Quantification
 
163
  We release the embodied reasoning data and benchmarks. Each data sample is a pair of video and text. The text annotations include understanding and reasoning annotations described in the Cosmos-Reason1 paper. Each video may have multiple text annotations. The quantity of the video and text pairs is described in the table below.
164
  **The AV data is currently unavailable and will be uploaded soon!**
165
 
@@ -169,9 +226,13 @@ We release the embodied reasoning data and benchmarks. Each data sample is a pai
169
  | **RL Data** | 252 | 200 | 240 | 200 | 200 | N/A | **2.6GB** |
170
  | **Benchmark Data** | 110 | 100 | 100 | 100 | 100 | 100 | **1.5GB** |
171
 
 
172
 
 
 
 
 
173
 
174
- We release text annotations for all embodied reasoning datasets and videos for RoboVQA and AV datasets. For other datasets, users may download the source videos from the original data source and find corresponding video sources via the video names. The held-out RoboFail benchmark is released for measuring the generalization capability.
175
 
176
 
177
  ## Inference:
@@ -310,4 +371,5 @@ We value you, the datasets, the diversity they represent, and what we have been
310
  | Model Application(s): | Physical AI common sense understanding and embodied reasoning |
311
  | Describe the life critical impact (if present). | None Known |
312
  | Use Case Restrictions: | [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license) |
313
- | Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to. Model checkpoints are made available on Hugging Face, and may become available on cloud providers' model catalog. |
 
 
17
  - cosmos
18
  ---
19
 
 
20
  # **Cosmos-Reason1: Physical AI Common Sense and Embodied Reasoning Models**
21
 
22
  [**Cosmos**](https://huggingface.co/collections/nvidia/cosmos-reason1-67c9e926206426008f1da1b7) | [**Code**](https://github.com/nvidia-cosmos/cosmos-reason1) | [**Paper**](https://arxiv.org/abs/2503.15558) | [**Paper Website**](https://research.nvidia.com/labs/dir/cosmos-reason1)
 
63
 
64
  * Github: [05/17/2025](https://github.com/nvidia-cosmos/cosmos-reason1)
65
  * Huggingface:
66
+ * [08/01/2025](https://huggingface.co/nvidia/Cosmos-Reason1-7B/commit/1d4cbdc7a277affb4a69eca40b60b4479a5c63b8). Shipped a few improvements which include captions with temporal timestamp, Set of Mark prompting.
67
  * [06/10/2025](https://huggingface.co/nvidia/Cosmos-Reason1-7B/commit/2464fff43c5c0bfb1916ac8c009feda4aed81be9). Enhanced critic capability for physical plausibility.
68
  * [05/17/2025](https://huggingface.co/nvidia/Cosmos-Reason1-7B/commit/098a5bb62a1f4fc05e5c4ac89aae8005e301aa18). Initial release.
69
 
 
75
  Cosmos-Reason-7B is post-trained based on [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) and follows the same model architecture.
76
 
77
 
78
+ **Number of model parameters:**
79
+
80
+ Cosmos-Reason1-7B:<br>
81
+ * Vision Transformer (ViT): 675.76M (675,759,104)
82
+ * Language Model (LLM): 7.07B (7,070,619,136)
83
+ * Other components (output projection layer): 545.00M (544,997,376)
84
+
85
+
86
+ ## Computational Load:
87
+
88
+ * Cumulative Compute: 3.2603016e+21 FLOPS
89
+ * Estimated Energy and Emissions for Model Training:
90
+ * Total kWh = 16658432
91
+ * Total Emissions (tCO2e) = 5380.674
92
+
93
  ## Input
94
 
95
  **Input Type(s)**: Text+Video/Image
 
110
 
111
  ## Output
112
 
113
+ **Output Type(s)**: Text
114
 
115
  **Output Format**: String
116
 
 
118
 
119
  **Other Properties Related to Output**:
120
  * Recommend using 4096 or more output max tokens to avoid truncation of long chain-of-thought response.
121
+
122
+ * Our AI model recognizes timestamps added at the bottom of each frame for accurate temporal localization.
123
+
124
  * Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>
125
 
126
 
 
147
  See [Cosmos-Reason1](https://github.com/nvidia-cosmos/cosmos-reason1) for details.
148
  * Post Training: [Cosmos-Reason1](https://github.com/nvidia-cosmos/cosmos-reason1) provides examples of supervised fine-tuning and reinforcement learning on embodied reasoning datasets.
149
 
150
+ ## Training and Evaluation Sections:
151
+ ### 05/17/2025
152
  Please see our [technical paper](https://arxiv.org/pdf/2503.15558) for detailed evaluations on physical common sense and embodied reasoning. Part of the evaluation datasets are released under [Cosmos-Reason1-Benchmark](https://huggingface.co/datasets/nvidia/Cosmos-Reason1-Benchmark). The embodied reasoning datasets and benchmarks focus on the following areas: robotics (RoboVQA, BridgeDataV2, Agibot, RobFail), ego-centric human demonstration (HoloAssist), and Autonomous Vehicle (AV) driving video data. The AV dataset is collected and annotated by NVIDIA.
153
+
154
  All datasets go through the data annotation process described in the technical paper to prepare training and evaluation data and annotations.
155
 
156
+ ### 08/01/2025
157
+ We enhance the model capability with the augmented training data. PLM-Video-Human and Nexar are used to enable dense temporal captioning. Describe Anything is added to enhance a set of mark (SoM) prompting. We enrich data in intelligent transportation systems (ITS) and warehouse applications. Lastly, Visual Critics dataset contains a collection of AI generated videos from Cosmos-Predict2 and Wan2.1 with human annotations to describe the physical correctness in AI videos.
158
+
159
+
160
+ ## Training Datasets:
161
+
162
+ **Data Collection Method**:
163
+ * RoboVQA: Hybrid: Automatic/Sensors
164
+ * BridgeDataV2: Automatic/Sensors
165
+ * AgiBot: Automatic/Sensors
166
+ * RoboFail: Automatic/Sensors
167
+ * HoloAssist: Human
168
+ * AV: Automatic/Sensors
169
+ * PLM-Video-Human: Human
170
+ * Nexar: Automatic/Sensors
171
+ * Describe Anything: Human
172
+ * ITS / Warehouse: Human, Automatic
173
+ * Visual Critics: Automatic
174
+
175
+ **Labeling Method**:
176
+ * RoboVQA: Hybrid: Human,Automated
177
+ * BridgeDataV2: Hybrid: Human,Automated
178
+ * AgiBot: Hybrid: Human,Automated
179
+ * RoboFail: Hybrid: Human,Automated
180
+ * HoloAssist: Hybrid: Human,Automated
181
+ * AV: Hybrid: Human,Automated
182
+ * PLM-Video-Human: Human,Automated
183
+ * Nexar: Human
184
+ * Describe Anything: Human,Automated
185
+ * ITS / Warehouse: Human, Automated
186
+ * Visual Critics: Human,Automated
187
+
188
+
189
+ # Evaluation Datasets:
190
+
191
  **Data Collection Method**:
192
  * RoboVQA: Hybrid: Automatic/Sensors
193
  * BridgeDataV2: Automatic/Sensors
 
196
  * HoloAssist: Human
197
  * AV: Automatic/Sensors
198
 
199
+
200
  **Labeling Method**:
201
  * RoboVQA: Hybrid: Human,Automated
202
  * BridgeDataV2: Hybrid: Human,Automated
 
205
  * HoloAssist: Hybrid: Human,Automated
206
  * AV: Hybrid: Human,Automated
207
 
208
+
209
  **Metrics**:
210
  We report the model accuracy on the embodied reasoning benchmark introduced in [Cosmos-Reason1](https://arxiv.org/abs/2503.15558). The results differ from those presented in Table 9 due to additional training aimed at supporting a broader range of Physical AI tasks beyond the benchmark.
211
  | | [RoboVQA](https://robovqa.github.io/) | AV | [BridgeDataV2](https://rail-berkeley.github.io/bridgedata/)| [Agibot](https://github.com/OpenDriveLab/AgiBot-World)| [HoloAssist](https://holoassist.github.io/) | [RoboFail](https://robot-reflect.github.io/) | Average |
 
216
  Modality: Video (mp4) and Text
217
 
218
  ## Dataset Quantification
219
+ ### 05/17/2025
220
  We release the embodied reasoning data and benchmarks. Each data sample is a pair of video and text. The text annotations include understanding and reasoning annotations described in the Cosmos-Reason1 paper. Each video may have multiple text annotations. The quantity of the video and text pairs is described in the table below.
221
  **The AV data is currently unavailable and will be uploaded soon!**
222
 
 
226
  | **RL Data** | 252 | 200 | 240 | 200 | 200 | N/A | **2.6GB** |
227
  | **Benchmark Data** | 110 | 100 | 100 | 100 | 100 | 100 | **1.5GB** |
228
 
229
+ We release text annotations for all embodied reasoning datasets and videos for RoboVQA and AV datasets. For other datasets, users may download the source videos from the original data source and find corresponding video sources via the video names. The held-out RoboFail benchmark is released for measuring the generalization capability.
230
 
231
+ ### 08/01/2025
232
+ | | [PLM-Video-Human](https://huggingface.co/datasets/facebook/PLM-Video-Human) | Nexar | [Describe Anything](https://huggingface.co/datasets/nvidia/describe-anything-dataset)| [ITS / Warehouse] | Visual Critics | Total Storage Size |
233
+ |------------------ |-----------------------------------------------------------------------------|-------------|--------------------------------------------------------------------------------------|-------------------------|--------------------------------------------|--------------------|
234
+ | **SFT Data** | 39k | 240k | 178k | 24k | 24k | **2.6TB** |
235
 
 
236
 
237
 
238
  ## Inference:
 
371
  | Model Application(s): | Physical AI common sense understanding and embodied reasoning |
372
  | Describe the life critical impact (if present). | None Known |
373
  | Use Case Restrictions: | [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license) |
374
+ | Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to. Model checkpoints are made available on Hugging Face, and may become available on cloud providers' model catalog. |
375
+
model-00001-of-00004.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:cb3789b9843f08b8a181ee43fc074f79edfacb9081677b8f35fae69c34de9efd
3
  size 4968243304
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c28404126221997ae8eb70a23b919c96174d42e35ae1d537e0c95093d50b359a
3
  size 4968243304
model-00002-of-00004.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:522dd37d8d1ed41b2134903d374167e7336c7f4e5eac3ca7b310dfbb42606a05
3
  size 4991495816
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f281081864c10992d3e03874c79d526c84407e049d713747f19eb9c79cd16db3
3
  size 4991495816
model-00003-of-00004.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:2f74ab185f8f51ceb0d5521dbe6ec51b11e16d986412269456a0e5c043f526a6
3
  size 4932751040
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3bb3a62a8d0e83c6283388ddea99395b221f908f9181b8edd0f7f91d02260ebe
3
  size 4932751040
model-00004-of-00004.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:5e0879e4d16019a1af04bec65f3058d114de947bd1412dd5cc8096b3fb7c6969
3
  size 1691924384
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:91bacbe1ad798e16daa05023b4e4bec70b53c8cd7d757db86c5bc76c4e0bbf15
3
  size 1691924384