Text Generation
Transformers
Safetensors
PyTorch
English
nvidia
conversational
bkartal commited on
Commit
824097d
·
verified ·
1 Parent(s): 06457c9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -221,7 +221,7 @@ We follow the jinja chat template provided below. This template conditionally ad
221
 
222
  ## Training, Testing, and Evaluation Datasets
223
 
224
- The post-training corpus for Nemotron-H-47B-Reasoning-128K consists of English and multilingual text (German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, Chinese and English). Our sources cover a variety of document types such as: webpages, dialogue, articles, and other written materials. The corpus spans domains including code, legal, math, science, finance, and more. We also include a small portion of question-answering, and alignment style data to improve model accuracies. For several of the domains listed above we used syntheitc data, specifically reasoning traces, from R1.
225
 
226
  **Data Collection for Training & Testing Datasets:** Hybrid: Automated, Human, Synthetic
227
 
@@ -251,7 +251,7 @@ We evaluated our BF16 and FP8 models in **Reasoning-On** mode against [Llama-Nem
251
 
252
  The model was trained on data that contains toxic language, unsafe content, and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.
253
 
254
- The model demonstrates weakness to indirect prompt injection via some encodings, including Base16, Hex/ASCII, and Braille, though is more resilient than other similar models to injections using the more common Base64 vector.
255
 
256
  ## Inference
257
 
 
221
 
222
  ## Training, Testing, and Evaluation Datasets
223
 
224
+ The post-training corpus for Nemotron-H-47B-Reasoning-128K consists of English and multilingual text (German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, Chinese and English). Our sources cover a variety of document types such as: webpages, dialogue, articles, and other written materials. The corpus spans domains including code, legal, math, science, finance, and more. We also include a small portion of question-answering, and alignment style data to improve model accuracies. For several of the domains listed above we used syntheitc data, specifically reasoning traces, from DeepSeek R1.
225
 
226
  **Data Collection for Training & Testing Datasets:** Hybrid: Automated, Human, Synthetic
227
 
 
251
 
252
  The model was trained on data that contains toxic language, unsafe content, and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.
253
 
254
+ The model demonstrates weakness to indirect prompt injection via some encodings, including Base16, Hex/ASCII, and Braille, though it is more resilient than other similar models to injections using the more common Base64 vector.
255
 
256
  ## Inference
257