Update README.md
Browse files
README.md
CHANGED
@@ -221,7 +221,7 @@ We follow the jinja chat template provided below. This template conditionally ad
|
|
221 |
|
222 |
## Training, Testing, and Evaluation Datasets
|
223 |
|
224 |
-
The post-training corpus for Nemotron-H-47B-Reasoning-128K consists of English and multilingual text (German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, Chinese and English). Our sources cover a variety of document types such as: webpages, dialogue, articles, and other written materials. The corpus spans domains including code, legal, math, science, finance, and more. We also include a small portion of question-answering, and alignment style data to improve model accuracies. For several of the domains listed above we used syntheitc data, specifically reasoning traces, from R1.
|
225 |
|
226 |
**Data Collection for Training & Testing Datasets:** Hybrid: Automated, Human, Synthetic
|
227 |
|
@@ -251,7 +251,7 @@ We evaluated our BF16 and FP8 models in **Reasoning-On** mode against [Llama-Nem
|
|
251 |
|
252 |
The model was trained on data that contains toxic language, unsafe content, and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.
|
253 |
|
254 |
-
The model demonstrates weakness to indirect prompt injection via some encodings, including Base16, Hex/ASCII, and Braille, though is more resilient than other similar models to injections using the more common Base64 vector.
|
255 |
|
256 |
## Inference
|
257 |
|
|
|
221 |
|
222 |
## Training, Testing, and Evaluation Datasets
|
223 |
|
224 |
+
The post-training corpus for Nemotron-H-47B-Reasoning-128K consists of English and multilingual text (German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, Chinese and English). Our sources cover a variety of document types such as: webpages, dialogue, articles, and other written materials. The corpus spans domains including code, legal, math, science, finance, and more. We also include a small portion of question-answering, and alignment style data to improve model accuracies. For several of the domains listed above we used syntheitc data, specifically reasoning traces, from DeepSeek R1.
|
225 |
|
226 |
**Data Collection for Training & Testing Datasets:** Hybrid: Automated, Human, Synthetic
|
227 |
|
|
|
251 |
|
252 |
The model was trained on data that contains toxic language, unsafe content, and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.
|
253 |
|
254 |
+
The model demonstrates weakness to indirect prompt injection via some encodings, including Base16, Hex/ASCII, and Braille, though it is more resilient than other similar models to injections using the more common Base64 vector.
|
255 |
|
256 |
## Inference
|
257 |
|