nvidia
/

llama-3.1-nemoguard-8b-content-safety

@@ -16,7 +16,7 @@ library_name: peft
  ## Description
- **Llama-3.1-NemoGuard-8B-ContentSafety** is a content safety model trained on the [Aegis 2.0 dataset](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0/) that moderates human-LLM interaction content and classifies user prompts and LLM responses as safe or unsafe. If the content is unsafe, the model additionally returns a response with a list of categories that the content violates. The base large language model (LLM) is the multilingual [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) model from Meta. NVIDIA’s optimized release is LoRa-tuned on approved datasets and better conforms NVIDIA’s content safety risk taxonomy and other safety risks in human-LLM interactions.
  The model can be prompted using an instruction and a taxonomy of unsafe risks to be categorized. The instruction format for prompt moderation is shown below under input and output examples.
@@ -29,10 +29,19 @@ library_name: peft
 ## Reference(s):
 Related paper:
 ```
-@inproceedings{ghosh2024aegis2,
-  title={AEGIS2. 0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails},
-  author={Ghosh, Shaona and Varshney, Prasoon and Sreedhar, Makesh Narsimhan and Padmakumar, Aishwarya and Rebedea, Traian and Varghese, Jibin Rajan and Parisien, Christopher},
-  booktitle={Neurips Safe Generative AI Workshop 2024}
 }
 ```
@@ -49,18 +58,18 @@ Related paper:
  **Training Method:**
- The training method for **Llama-3.1-NemoGuard-8B-ContentSafety** involves the following concepts:
- - A system prompt, including the [Aegis 2.0 safety taxonomy](https://openreview.net/forum?id=yRkJtCOpAu&noteId=yRkJtCOpAu), which is a safety policy that contains a list of unsafe categories.
  - Novel safety risk categories and policies can be provided in the instruction for the model to predict categories of violation if unsafe
  - The safety taxonomy and policy used to train the model contains 23 critically unsafe risk categories, a safe category and a "needs caution" category.
- - An internally annotated dataset, called Aegis-AI-Content-Safety-Dataset-2.0, of approximately 30,000 prompts and responses are used to instruction-tune the model.
  - The model is instruction-tuned to follow either safety or topic-following system prompts, with the LLM behaving as a classifier in both settings.
  - The model can return labels for both user and bot messages together in one inference call, if they both exist in the payload. This is unlike previous models in this space, where the system prompt needs to instruct the LLM to moderate either the user or an LLM turn. See the section on output format for more information.
  ## Prompt Format:
- The prompt template consists of the Aegis 2.0 Taxonomy followed placeholders for either a user message alone, or a user message and a bot response, and finally an instruction with the task and output format.
  ### Example Model Input/Output for prompt safety classification:
  Note that the <BEGIN CONVERSATION> and <END CONVERSATION> tags only contain the `user:` field for prompt classification.
@@ -172,7 +181,7 @@ Related paper:
  Due to the serious nature of this project, annotators were asked to join on a volunteer basis based on their skill level, availability, and willingness to expose themselves to potentially unsafe content. Before work on this project began, all participants were asked to sign an Adult Content Acknowledgement that coincides with the organization's existing AntiHarassment Policy and Code of Conduct. This was to ensure that all annotators be made aware of the nature of this work, as well as the resources available to them should it affect their mental well-being. Regular 1:1 meetings were held between the leads assigned to this project and each annotator to make sure they are still comfortable with the material and are capable of continuing on with this type of work.
  Throughout the six months time span of the Content Moderation Guardrails project, we averaged twelve annotators at any given time. Of these twelve, four annotators come from Engineering backgrounds specializing in data analysis and collection, gaming, and robotics. Eight annotators have a background in Creative Writing, with specialization in linguistics, research and development, and other creative arts such as photography and
  film. All annotators have been extensively trained in working with Large Language Models (LLM), as well as other variations of Generative AI such as image retrieval or evaluations of multi-turn conversations. All are capable of generating creative text-based output and categorization work. Each of these twelve annotators resides in the United States, all from various ethnic and religious backgrounds that allow for representation across race, age, and social status.
- The process in which the Aegis-AI-Content-Safety-Dataset-2.0 creation abides by ethical data categorization work is based within the tooling of [Label Studio](http://label-studio.nvidia.com/), an open source data labeling tool
  often used for the organization's internal projects. This tooling technology allows for large sets of data to be analyzed by individual annotators without seeing the work of their peers. This is essential in preventing bias between annotators, as well as delivering prompts to each individual with variability so that no one annotator is completing similar tasks based on how the data was initially arranged.
  For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards.  Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).

  ## Description
+ **Llama Nemotron Safety Guard V2**, formerly known as **Llama-3.1-NemoGuard-8B-ContentSafety**, is a content safety model trained on the [Nemotron Content Safety Dataset V2](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0/) that moderates human-LLM interaction content and classifies user prompts and LLM responses as safe or unsafe. If the content is unsafe, the model additionally returns a response with a list of categories that the content violates. The base large language model (LLM) is the multilingual [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) model from Meta. NVIDIA’s optimized release is LoRa-tuned on approved datasets and better conforms NVIDIA’s content safety risk taxonomy and other safety risks in human-LLM interactions.
  The model can be prompted using an instruction and a taxonomy of unsafe risks to be categorized. The instruction format for prompt moderation is shown below under input and output examples.
 ## Reference(s):
 Related paper:
 ```
+@inproceedings{ghosh-etal-2025-aegis2,
+    title = "{AEGIS}2.0: A Diverse {AI} Safety Dataset and Risks Taxonomy for Alignment of {LLM} Guardrails",
+    author = "Ghosh, Shaona and Varshney, Prasoon and Sreedhar, Makesh Narsimhan and Padmakumar, Aishwarya and Rebedea, Traian and Varghese, Jibin Rajan and Parisien, Christopher",
+    editor = "Chiruzzo, Luis and Ritter, Alan and Wang, Lu",
+    booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
+    month = apr,
+    year = "2025",
+    address = "Albuquerque, New Mexico",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2025.naacl-long.306/",
+    doi = "10.18653/v1/2025.naacl-long.306",
+    pages = "5992--6026",
+    ISBN = "979-8-89176-189-6"
 }
 ```
  **Training Method:**
+ The training method for  **Llama Nemotron Safety Guard V2** involves the following concepts:
+ - A system prompt, including the [Nemotron Content Safety Dataset V2 Taxonomy](https://aclanthology.org/2025.naacl-long.306/), which is a safety policy that contains a list of unsafe categories.
  - Novel safety risk categories and policies can be provided in the instruction for the model to predict categories of violation if unsafe
  - The safety taxonomy and policy used to train the model contains 23 critically unsafe risk categories, a safe category and a "needs caution" category.
+ - An internally annotated dataset, called [Nemotron Content Safety Dataset V2](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0/), of approximately 30,000 prompts and responses are used to instruction-tune the model.
  - The model is instruction-tuned to follow either safety or topic-following system prompts, with the LLM behaving as a classifier in both settings.
  - The model can return labels for both user and bot messages together in one inference call, if they both exist in the payload. This is unlike previous models in this space, where the system prompt needs to instruct the LLM to moderate either the user or an LLM turn. See the section on output format for more information.
  ## Prompt Format:
+ The prompt template consists of the [Nemotron Content Safety Dataset V2 Taxonomy](https://aclanthology.org/2025.naacl-long.306/) followed placeholders for either a user message alone, or a user message and a bot response, and finally an instruction with the task and output format.
  ### Example Model Input/Output for prompt safety classification:
  Note that the <BEGIN CONVERSATION> and <END CONVERSATION> tags only contain the `user:` field for prompt classification.
  Due to the serious nature of this project, annotators were asked to join on a volunteer basis based on their skill level, availability, and willingness to expose themselves to potentially unsafe content. Before work on this project began, all participants were asked to sign an Adult Content Acknowledgement that coincides with the organization's existing AntiHarassment Policy and Code of Conduct. This was to ensure that all annotators be made aware of the nature of this work, as well as the resources available to them should it affect their mental well-being. Regular 1:1 meetings were held between the leads assigned to this project and each annotator to make sure they are still comfortable with the material and are capable of continuing on with this type of work.
  Throughout the six months time span of the Content Moderation Guardrails project, we averaged twelve annotators at any given time. Of these twelve, four annotators come from Engineering backgrounds specializing in data analysis and collection, gaming, and robotics. Eight annotators have a background in Creative Writing, with specialization in linguistics, research and development, and other creative arts such as photography and
  film. All annotators have been extensively trained in working with Large Language Models (LLM), as well as other variations of Generative AI such as image retrieval or evaluations of multi-turn conversations. All are capable of generating creative text-based output and categorization work. Each of these twelve annotators resides in the United States, all from various ethnic and religious backgrounds that allow for representation across race, age, and social status.
+ The process in which the Nemotron Content Safety Dataset V2 creation abides by ethical data categorization work is based within the tooling of [Label Studio](http://label-studio.nvidia.com/), an open source data labeling tool
  often used for the organization's internal projects. This tooling technology allows for large sets of data to be analyzed by individual annotators without seeing the work of their peers. This is essential in preventing bias between annotators, as well as delivering prompts to each individual with variability so that no one annotator is completing similar tasks based on how the data was initially arranged.
  For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards.  Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).