Robotics
Safetensors
gr00t_n1
kkundalia commited on
Commit
b35b73b
·
verified ·
1 Parent(s): 32e1fd2

Update model card

Browse files
Files changed (1) hide show
  1. README.md +135 -1
README.md CHANGED
@@ -12,4 +12,138 @@ tags:
12
 
13
  Github page: https://github.com/NVIDIA/Isaac-GR00T/
14
 
15
- NVIDIA Isaac GR00T N1 is the world's first open foundation model for generalized humanoid robot reasoning and skills.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
  Github page: https://github.com/NVIDIA/Isaac-GR00T/
14
 
15
+ ## Description:
16
+ NVIDIA Isaac GR00T N1 is the world’s first open foundation model for generalized humanoid robot reasoning and skills. This cross-embodiment model takes multimodal input, including language and images, to perform manipulation tasks in diverse environments. Developers and researchers can post-train GR00T N1 with real or synthetic data for their specific humanoid robot or task.
17
+
18
+ Isaac GR00T N1-1B is the lightweight version of our model built using pre-trained vision and language encoders, and uses a flow matching action transformer to model a chunk of actions conditioned on vision, language and proprioception.
19
+
20
+ A detailed description of Isaac GR00T N1-1B architecture is provided in the [Whitepaper](https://arxiv.org/abs/2503.14734)
21
+
22
+ ## License/Terms of Use
23
+ NSCL V1 License
24
+ NVIDIA OneWay Noncommercial License_22Mar2022
25
+
26
+ ### Deployment Geography:
27
+ Global
28
+
29
+ ### Use Case:
30
+ Researchers, Academics, Open-Source Community: AI-driven robotics research and algorithm development.
31
+ Developers: Integrate and customize AI for various robotic applications.
32
+ Startups & Companies: Accelerate robotics development and reduce training costs.
33
+
34
+ ### Release Date:
35
+ Github [Insert 03/17/2025] via [URL]
36
+ Huggingface [Insert 03/17/2025] via [URL]
37
+
38
+ ## Reference(s):
39
+ NVIDIA-EAGLE:
40
+ Li, Zhiqi, et al. "Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models." arXiv preprint arXiv:2501.14818 (2025).
41
+ Rectified Flow:
42
+ Liu, Xingchao, and Chengyue Gong. "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow." The Eleventh International Conference on Learning Representations” [link].
43
+ Flow Matching Policy:
44
+ Black, Kevin, et al. "π0: A Vision-Language-Action Flow Model for General Robot Control." arXiv preprint arXiv:2410.24164 (2024).
45
+
46
+ ## Model Architecture:
47
+ **Architecture Type:** Vision Transformer, Multilayer Perceptron, Flow matching Transformer
48
+
49
+ Isaac GR00T N1 uses vision and text transformers to encode the robot's image observations and text instructions. The architecture handles a varying number of views per embodiment by concatenating image token embeddings from all frames into a sequence, followed by language token embeddings.
50
+
51
+ To model proprioception and a sequence of actions conditioned on observations, Isaac GR00T N1-1B uses a flow matching transformer. The flow matching transformer interleaves self-attention over proprioception and actions with cross-attention to the vision and language embeddings. During training, the input actions are corrupted by randomly interpolating between the clean action vector and a gaussian noise vector. At inference time, the policy first samples a gaussian noise vector and iteratively reconstructs a continuous-value action using its velocity prediction.
52
+
53
+ **Network Architecture:**
54
+ Illustrated in [Whitepaper](https://arxiv.org/abs/2503.14734) Figure 2
55
+ RGB camera frames are processed through a pre-trained vision transformer (SigLip2).
56
+ Text is encoded by a pre-trained transformer (T5)
57
+ Robot proprioception is encoded using a multi-layer perceptron (MLP) indexed by the embodiment ID. To handle variable-dimension proprio, inputs are padded to a configurable max length before feeding into the MLP.
58
+ Actions are encoded and velocity predictions decoded by an MLP, one per unique embodiment.
59
+ The flow matching transformer is implemented as a diffusion transformer (DiT), in which the diffusion step conditioning is implemented using adaptive layernorm (AdaLN).
60
+
61
+ ## Input:
62
+ Input Type:
63
+ -Vision: Image Frames
64
+ -State: Robot Proprioception
65
+ -Language Instruction: Text
66
+ -Embodiment ID: Integer
67
+
68
+ Input Format:
69
+ -Vision: Variable number of 224x224 uint8 image frames, coming from robot cameras
70
+ -State: Floating Point
71
+ -Language Instruction: String
72
+ -Embodiment ID: Integer indicating which of the training embodiments is observed
73
+
74
+ Input Parameters:
75
+ -Vision: 2D - RGB image, square
76
+ -State: 1D - Floating number vector
77
+ -Language Instruction: 1D - String
78
+ -Embodiment ID: 1D - Integer
79
+
80
+ ## Output:
81
+ **Output Type(s):** Actions
82
+ **Output Format** Continuous-value vectors that correspond to different motor controls on a robot.
83
+
84
+ ## Software Integration:
85
+ **Runtime Engine(s):** PyTorch
86
+
87
+ **Supported Hardware Microarchitecture Compatibility:**
88
+ All of the below:
89
+ * NVIDIA Ampere
90
+ * NVIDIA Blackwell
91
+ * NVIDIA Jetson
92
+ * NVIDIA Hopper
93
+ * NVIDIA Lovelace
94
+
95
+ **[Preferred/Supported] Operating System(s):**
96
+ * Linux
97
+
98
+ ## Model Version(s):
99
+ This is the initial version of the model, version 1.0.
100
+
101
+ # Training and Evaluation Datasets:
102
+
103
+ ## Training Dataset:
104
+ GR00T Pretraining Data
105
+ **Link:** <Dataset link>
106
+ **Data Collection Method by dataset:** Hybrid: Human, Synthetic.
107
+ **Labeling Method by dataset:** Hybrid: Human, Automated.
108
+ **Properties:**
109
+ Cross-embodiment: Data collected on various robot embodiments
110
+ Sensor types: RGB camera, robot proprioception, robot actuator data
111
+ **Dataset License(s):** Release Legal Tracker (GR00T-N1)
112
+
113
+ ## Evaluation:
114
+ We evaluate in both simulation and real robot benchmarks, as defined in the [Whitepaper](https://arxiv.org/abs/2503.14734)
115
+
116
+ Sim evaluation benchmarks for upper body control: (nspect NSPECT-5WZF-67VI)
117
+ 9 DexMG [Whitepaper](https://arxiv.org/abs/2503.14734) tasks
118
+ 24 RoboCasa simulated mobile manipulator tasks
119
+ 24 Digital Cousin simulated GR-1 humanoid manipulation tasks
120
+ For sim, we automatically measure the success rate in each manipulation behavior.
121
+ For real robot (nspect NSPECT-IDAT-9M9L):
122
+ Grocery packing task
123
+ Industrial multi-robot coordination with handoffs
124
+ Evaluated by human observers in the lab
125
+
126
+ ## Inference:
127
+ **Engine:** PyTorch
128
+ **Test Hardware:** A6000
129
+
130
+ ## Ethical Considerations:
131
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
132
+
133
+ For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards.
134
+
135
+ Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
136
+
137
+ ## Model Limitations:
138
+
139
+ This model is not tested or intended for use in mission critical applications that require functional safety. The use of the model in those applications is at the user's own risk and sole responsibility, including taking the necessary steps to add needed guardrails or safety mechanisms.
140
+
141
+ Risk: Model underperformance in highly dynamic environments.
142
+ Mitigation: Enhance dataset with dynamic obstacle scenarios and fine-tune models accordingly.
143
+
144
+ Risk: Integration challenges in specific customer environments.
145
+ Mitigation: Provide detailed integration guides and support, leveraging NVIDIA's ecosystem.
146
+
147
+ Risk: Limited initial support for certain robot embodiments.
148
+ Mitigation: Expand testing and validation across a wider range of robot platforms.
149
+