zhilinw commited on
Commit
01225db
·
verified ·
1 Parent(s): 78e5850

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +65 -28
README.md CHANGED
@@ -35,12 +35,41 @@ library_name: transformers
35
 
36
  Llama-3.3-Nemotron-Super-49B-GenRM-Multilingual is a generative reward model that leverages Llama-3.3-Nemotron-Super-49B-v1 as the foundation and is fine-tuned using Reinforcement Learning to predict the quality of LLM generated responses.
37
 
 
 
38
  See details on how this model was trained at [https://arxiv.org/abs/2505.11475](https://arxiv.org/abs/2505.11475)
39
 
 
 
40
  ## License/Terms of Use:
41
 
42
  GOVERNING TERMS: Use of this model is governed by the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/) . Additional Information: [Llama 3.3 Community License Agreement](https://www.llama.com/llama3_3/license/). Built with Llama.
43
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
  ## RM-Bench LeaderBoard
45
 
46
  As of 15 May 2025, our reward models trained with HelpSteer3-Preference are the top performing Bradley-Terry reward models on [RM-Bench](https://arxiv.org/abs/2410.16184), an improved variant of RewardBench for evaluating Reward Models in Chat, Math, Code and Safety. Our GenRMs also outperform the corresponding Bradley-Terry reward models.
@@ -78,34 +107,11 @@ As of 15 May 2025, our reward models trained with HelpSteer3-Preference are the
78
  *Note that Skywork-Reward-Gemma-2-27B was the best performing reward model reported on JudgeBench and we evaluated all other numbers.*
79
 
80
 
81
- ## Use Case:
82
-
83
- Llama-3.3-Nemotron-Super-49B-GenRM-Multilingual can be used to judge the quality of one response, or the ranking between two responses given a multilingual conversation history. It will first generate reasoning traces then output an integer score.
84
-
85
-
86
- ## Release Date:
87
-
88
- 05/30/2025
89
-
90
-
91
- ## Referencess:
92
-
93
- * [HelpSteer3-Preference](https://arxiv.org/abs/2505.11475)
94
- * [HelpSteer2-Preference](https://arxiv.org/abs/2410.01257)
95
- * [SteerLM method](https://arxiv.org/abs/2310.05344)
96
- * [HelpSteer](https://arxiv.org/abs/2311.09528)
97
- * [HelpSteer2](https://arxiv.org/abs/2406.08673)
98
- * [Llama-Nemotron: Efficient Reasoning Models](https://arxiv.org/abs/2505.00949)
99
- * [The future of AI: Built with Llama](https://ai.meta.com/blog/future-of-ai-built-with-llama/)
100
- * [Meta's Llama 3.3 Webpage](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_3)
101
- * [Meta's Llama 3.3 Model Card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/MODEL_CARD.md)
102
-
103
-
104
  ## Model Architecture:
105
  **Architecture Type:** Transformer <br>
106
  **Network Architecture:** Llama-3.3-Nemotron-Super-49B-v1 <br>
107
 
108
- We developed this model using Llama-3.3-Nemotron-Super-49B-v1 as its foundation. This model contains 49 billion parameters.
109
 
110
  ## Input:
111
  **Input Type(s):** Text <br>
@@ -119,19 +125,21 @@ We developed this model using Llama-3.3-Nemotron-Super-49B-v1 as its foundation.
119
  **Output Parameters:** One-Dimensional (1D) <br>
120
  **Other Properties Related to Output:** The output contains a reasoning trace and a final score. <br>
121
 
 
 
122
  ## Software Integration:
123
  **Runtime Engine(s):** <br>
124
  * vLLM 0.8.3 <br>
125
 
126
  **Supported Hardware Microarchitecture Compatibility:** <br>
127
  * NVIDIA Ampere <br>
128
- * NVIDIA Hopper
129
 
130
  **Supported Operating System(s):** Linux <br>
131
 
132
  ## Quick Start
133
 
134
- We recommend serving the model with vLLM. You can use the model with 2 or more 80GB GPUs (NVIDIA Ampere or newer) with at least 100GB of free disk space to accomodate the download.
135
 
136
  ```
137
  pip install vllm==0.8.3
@@ -221,7 +229,7 @@ Response 2 states "1+2=3", which is accurate and directly addresses the user's q
221
  ```
222
  Note that the conversation history should be presented in "user" and "assistant" roles, where the last turn is user turn. The responses to be judged should be in "response_1" (and "response_2") roles.
223
 
224
- ### Intepretation of Scores
225
  When judging one response, the model will generate a helpfulness score from 1 to 5, where higher is better.
226
 
227
  When judging two responses, the model will generate an individual helpfulness score for each response, then a ranking score. The ranking score is a number between 1 and 6, where:
@@ -243,7 +251,7 @@ For details, please see Appendix J in our [paper](https://arxiv.org/abs/2505.114
243
  ## Model Version:
244
  v1.0
245
 
246
- # Training and Testing Datasets:
247
 
248
  ## Training Datasets:
249
 
@@ -273,6 +281,33 @@ v1.0
273
  **Properties:** <br>
274
  * 403 prompts, each with a pair of responses as well as human preferences between the pair of responses.
275
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
276
 
277
  # Inference:
278
  **Engine:** vLLM <br>
@@ -281,6 +316,8 @@ v1.0
281
 
282
  ## Ethical Considerations:
283
  NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
 
 
284
  Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
285
 
286
  ## Citation
 
35
 
36
  Llama-3.3-Nemotron-Super-49B-GenRM-Multilingual is a generative reward model that leverages Llama-3.3-Nemotron-Super-49B-v1 as the foundation and is fine-tuned using Reinforcement Learning to predict the quality of LLM generated responses.
37
 
38
+ Llama-3.3-Nemotron-Super-49B-GenRM-Multilingual can be used to judge the quality of one response, or the ranking between two responses given a multilingual conversation history. It will first generate reasoning traces then output an integer score. A higher score means the response is of higher quality.
39
+
40
  See details on how this model was trained at [https://arxiv.org/abs/2505.11475](https://arxiv.org/abs/2505.11475)
41
 
42
+ This model is ready for commercial/non-commercial use.
43
+
44
  ## License/Terms of Use:
45
 
46
  GOVERNING TERMS: Use of this model is governed by the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/) . Additional Information: [Llama 3.3 Community License Agreement](https://www.llama.com/llama3_3/license/). Built with Llama.
47
 
48
+ ### Deployment Geography
49
+
50
+ Global
51
+
52
+ ## Use Case:
53
+
54
+ Llama-3.3-Nemotron-Super-49B-GenRM-Multilingual can be used to judge the quality of one response, or the ranking between two responses given a multilingual conversation history. It will first generate reasoning traces then output an integer score.
55
+
56
+ ## Release Date:
57
+
58
+ HuggingFace 06/27/2025 via https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-GenRM-Multilingual
59
+
60
+ ## References:
61
+
62
+ * [HelpSteer3-Preference](https://arxiv.org/abs/2505.11475)
63
+ * [HelpSteer2-Preference](https://arxiv.org/abs/2410.01257)
64
+ * [SteerLM method](https://arxiv.org/abs/2310.05344)
65
+ * [HelpSteer](https://arxiv.org/abs/2311.09528)
66
+ * [HelpSteer2](https://arxiv.org/abs/2406.08673)
67
+ * [Llama-Nemotron: Efficient Reasoning Models](https://arxiv.org/abs/2505.00949)
68
+ * [The future of AI: Built with Llama](https://ai.meta.com/blog/future-of-ai-built-with-llama/)
69
+ * [Meta's Llama 3.3 Webpage](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_3)
70
+ * [Meta's Llama 3.3 Model Card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/MODEL_CARD.md)
71
+
72
+
73
  ## RM-Bench LeaderBoard
74
 
75
  As of 15 May 2025, our reward models trained with HelpSteer3-Preference are the top performing Bradley-Terry reward models on [RM-Bench](https://arxiv.org/abs/2410.16184), an improved variant of RewardBench for evaluating Reward Models in Chat, Math, Code and Safety. Our GenRMs also outperform the corresponding Bradley-Terry reward models.
 
107
  *Note that Skywork-Reward-Gemma-2-27B was the best performing reward model reported on JudgeBench and we evaluated all other numbers.*
108
 
109
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
110
  ## Model Architecture:
111
  **Architecture Type:** Transformer <br>
112
  **Network Architecture:** Llama-3.3-Nemotron-Super-49B-v1 <br>
113
 
114
+ We developed this model using [Llama-3.3-Nemotron-Super-49B-v1](https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1) as its foundation. This model contains 49 billion parameters.
115
 
116
  ## Input:
117
  **Input Type(s):** Text <br>
 
125
  **Output Parameters:** One-Dimensional (1D) <br>
126
  **Other Properties Related to Output:** The output contains a reasoning trace and a final score. <br>
127
 
128
+ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>
129
+
130
  ## Software Integration:
131
  **Runtime Engine(s):** <br>
132
  * vLLM 0.8.3 <br>
133
 
134
  **Supported Hardware Microarchitecture Compatibility:** <br>
135
  * NVIDIA Ampere <br>
136
+ * NVIDIA Hopper <br>
137
 
138
  **Supported Operating System(s):** Linux <br>
139
 
140
  ## Quick Start
141
 
142
+ We recommend serving the model with vLLM. You can use the model with 2 or more 80GB GPUs (NVIDIA Ampere or newer) with at least 100GB of free disk space to accommodate the download.
143
 
144
  ```
145
  pip install vllm==0.8.3
 
229
  ```
230
  Note that the conversation history should be presented in "user" and "assistant" roles, where the last turn is user turn. The responses to be judged should be in "response_1" (and "response_2") roles.
231
 
232
+ ### Interpretation of Scores
233
  When judging one response, the model will generate a helpfulness score from 1 to 5, where higher is better.
234
 
235
  When judging two responses, the model will generate an individual helpfulness score for each response, then a ranking score. The ranking score is a number between 1 and 6, where:
 
251
  ## Model Version:
252
  v1.0
253
 
254
+ # Training, Testing and Evaluation Datasets:
255
 
256
  ## Training Datasets:
257
 
 
281
  **Properties:** <br>
282
  * 403 prompts, each with a pair of responses as well as human preferences between the pair of responses.
283
 
284
+ ## Evaluation Datasets
285
+
286
+ **Dataset Name:** RM-Bench <br>
287
+ **Dataset Link:** https://huggingface.co/datasets/THU-KEG/RM-Bench
288
+
289
+ **Data Collection Method by dataset** <br>
290
+ * [Hybrid: Human, Synthetic] <br>
291
+
292
+ **Labeling Method by dataset** <br>
293
+ * [Hybrid: Human, Synthetic] <br>
294
+
295
+ **Properties:** <br>
296
+ * 1,327 prompts, each with three pairs of responses as well as preferences between the pair of responses.
297
+
298
+
299
+ **Dataset Name:** JudgeBench <br>
300
+ **Dataset Link:** https://huggingface.co/datasets/ScalerLab/JudgeBench
301
+
302
+ **Data Collection Method by dataset** <br>
303
+ * [Hybrid: Human, Synthetic] <br>
304
+
305
+ **Labeling Method by dataset** <br>
306
+ * [Hybrid: Human, Synthetic] <br>
307
+
308
+ **Properties:** <br>
309
+ * 350 prompts, each with a pair of responses as well as preferences between the pair of responses.
310
+
311
 
312
  # Inference:
313
  **Engine:** vLLM <br>
 
316
 
317
  ## Ethical Considerations:
318
  NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
319
+ For more detailed information on ethical considerations for this model, please see the Model Card++ [Explainability](explainability.md), [Bias](bias.md), [Safety & Security](safety.md), and [Privacy](privacy.md) Subcards.
320
+
321
  Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
322
 
323
  ## Citation