nvidia/NVIDIA-Nemotron-Nano-9B-v2

17 days ago

Hello, model looks very promising!
I want to try it locally via llama.cpp/ollama, will the model be available in GGUF format?

Thank you.

QuamtumX

16 days ago

Always the same bulls*** .... nerds get top priority, but the average person who uses GGUF comes second... sigh

weathermanj

15 days ago

I pushed a safetensors fp8 you can run on 3090 for now.

Working on llamacpp today. Which is required to even get a gguf. Nemotron-h is a new hybrid architecture.

It’s not some trivial thing. It’s a 57 layer hybrid state space model interwoven with transformer MLP layers.

suhara

NVIDIA org 13 days ago

Thank you for your interest and your support!

There's an ongoing discussion & work for Nemotron-H support for GGUF/Llama.cpp. Please join the discussion & effort. Thank you!
https://github.com/ggml-org/llama.cpp/issues/15409

weathermanj

13 days ago

I have it working up to text gen.

Everything else is done up to token generation.

I push the code up sometime today.

gabegoodhart

8 days ago

Hi all! I've been working alongside @weathermanj on the llama.cpp support, and it's now fully working: https://github.com/ggml-org/llama.cpp/pull/15507

One NOTE: There may be one more change to the architecture string name on my branch (nemotronh -> nemotron_h), so GGUF files generated using my branch may be invalid after this change.

dominguesm

7 days ago

NVIDIA Nemotron-Nano-9B-v2 in GGUF format are now available on: https://huggingface.co/dominguesm/NVIDIA-Nemotron-Nano-9B-v2-GGUF.

weathermanj

7 days ago

Nice. mine are going up now as well!

gabegoodhart

7 days ago

As of release b6315, the nemotron_h architecture is officially supported in llama.cpp. It will take some time for it to roll out to other inference platforms such as LM Studio, Ollama, and Docker model runner, but should be picked up by those platforms once they bump their dependency version for llama.cpp.

owao

6 days ago

•

edited 6 days ago

Anyone having an issue with the and/or tags not being output? I'm using the official template for which I don't manage to see the issue:

{%- if add_generation_prompt -%}
    {{- "<SPECIAL_11>Assistant\n" -}}
    {%- if ns.enable_thinking is defined and ns.enable_thinking is false -%}
        {{- "<think></think>" -}}
    {%- else -%}
        {{- "<think>\n" -}}
    {%- endif -%}
    {%- if ns.last_turn_assistant_content is defined and ns.last_turn_assistant_content != "" -%}
        {{- ns.last_turn_assistant_content -}}
    {%- endif -%}
{%- elif ns.last_turn_assistant_content is defined and ns.last_turn_assistant_content != "" -%}
    {{- "<SPECIAL_11>Assistant\n" -}}
    {%- if ns.enable_thinking is defined and ns.enable_thinking is false -%}
        {{- "<think></think>" -}}
    {%- else -%}
        {{- "<think>\n" -}}
    {%- endif -%}
    {{- ns.last_turn_assistant_content -}}
    {%- if continue_final_message is defined -%}
        {%- if continue_final_message is false -%}
            {{- "\n<SPECIAL_12>\n" -}}
        {%- endif -%}
    {%- else -%}
        {{- "\n<SPECIAL_12>\n" -}}
    {%- endif -%}

When using /think as system prompt or --chat-template-kwargs '{"enable_thinking":true}' as llama-server option:
- reasoning behavior is correctly triggered
- the model correctly ends its generation with </think> as last token
- but <think> opening tag isn't output
When using /no_think as system prompt or --chat-template-kwargs '{"enable_thinking":false}' as llama-server option:
- non reasoning behavior is correctly triggered
- but <think></think> is not prepend to the output

My llama.cpp is freshly built from master and here is the command I use to serve it:

      /home/user/llama.cpp/build/bin/llama-server
      --model /mnt/277c6bdc-56fd-45a3-9195-3612028a5a15/GGUFs/NVIDIA-Nemotron-Nano-9B-v2-Q8_0/nvidia_NVIDIA-Nemotron-Nano-9B-v2-Q8_0.gguf
      --ctx-size 131000
      --no-context-shift
      --n-gpu-layers 57
      --temp 0.6
      --top-p 0.95
      --jinja
      --host 0.0.0.0
      --port ${PORT}
      --flash-attn
      --chat-template-file /mnt/277c6bdc-56fd-45a3-9195-3612028a5a15/GGUFs/NVIDIA-Nemotron-Nano-9B-v2-Q8_0/template.jinja
      --chat-template-kwargs '{"enable_thinking":true}'

owao

6 days ago

However, when playing with https://huggingface.co/spaces/huggingfacejs/chat-template-playground?modelId=nvidia/NVIDIA-Nemotron-Nano-9B-v2 it triggers the correct behavior:

reasoning input

{
  messages: [
    {
      role: 'system',
      content: '/think',
    },
    {
      role: 'user',
      content: 'What is the capital of France?',
    }
  ],
  add_generation_prompt: true,
}

reasoning output

<SPECIAL_10>System

<SPECIAL_11>User
What is the capital of France?
<SPECIAL_11>Assistant
<think>

non-reasoning input

{
  messages: [
    {
      role: 'system',
      content: '/no_think',
    },
    {
      role: 'user',
      content: 'What is the capital of France?',
    }
  ],
  add_generation_prompt: true,
}

non-reasoning output

<SPECIAL_10>System

<SPECIAL_11>User
What is the capital of France?
<SPECIAL_11>Assistant
<think></think>

owao

6 days ago

•

edited 6 days ago

I just tried b6315 and b6318 to see just in case: same result

gabegoodhart

6 days ago

Interesting, I haven't played with the chat template extensively, nor have I gone deep on llama.cpp's use of minja to implement jinja2. I know there are some subtle differences between what is supported with minja vs the python jinja2 implementation, so it's possible that there are some rendering differences? The first thing I'd try to debug would be to do the rendering client-side and see if the raw /completions output includes the output you expect.

weathermanj

6 days ago

I also did not mess with the chat template at all. I left it as it was. I would recommend checking the completions output as well. When I get a second I will take a peek. Ive been working on training pipeline for nemotron_h.

owao

6 days ago

You've done enough @gabegoodhart !
I'll figure out how to try what you suggested, and report my findings. Thanks for all your efforts! And same to @weathermanj !

owao

6 days ago

•

edited 6 days ago

The raw completion from v1/chat/completions already misses the tokens - same exact behavior:

/think --> no opening <think>, but the model "reasons" and outputs the closing </think>

~/l/b/bin ❯❯❯ curl -X POST http://127.0.0.1:8678/v1/chat/completions \                                                         master
                    -H "Content-Type: application/json" \
                    -d '{
                      "messages": [
                        { "role": "system", "content": "/think" },
                        { "role": "user", "content": "Hello" }
                      ],
                      "add_generation_prompt": true
                    }'
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"Okay, the user just said \"Hello\". That's a greeting. I should respond politely. Let me make sure to acknowledge their greeting and offer help. Maybe say something like \"Hello! How can I assist you today?\" That's friendly and opens the door for them to ask questions.\n</think>\n\nHello! How can I assist you today? 😊\n"}}],"created":1756499399,"model":"NVIDIA-Nemotron-Nano-9B-v2_Q8_0_131K","system_fingerprint":"b6319-792b44f2","object":"chat.completion","usage":{"completion_tokens":76,"prompt_tokens":15,"total_tokens":91},"id":"chatcmpl-KtFzvtnW1be9O4RjfzFKiMyd43CVbUkm","timings":{"prompt_n":15,"prompt_ms":27.912,"prompt_per_token_ms":1.8608,"prompt_per_second":537.4032674118658,"predicted_n":76,"predicted_ms":955.159,"predicted_per_token_ms":12.567881578947368,"predicted_per_second":79.56790440125677}}

/no_think --> no opening <think></think>, but the model answers immediately as expected

~/l/b/bin ❯❯❯ curl -X POST http://127.0.0.1:8678/v1/chat/completions \                                                         master
                    -H "Content-Type: application/json" \
                    -d '{
                      "messages": [
                        { "role": "system", "content": "/no_think" },
                        { "role": "user", "content": "Hello" }
                      ],
                      "add_generation_prompt": true
                    }'
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"Hello! How can I assist you today? 😊\n"}}],"created":1756499420,"model":"NVIDIA-Nemotron-Nano-9B-v2_Q8_0_131K","system_fingerprint":"b6319-792b44f2","object":"chat.completion","usage":{"completion_tokens":14,"prompt_tokens":17,"total_tokens":31},"id":"chatcmpl-rAVCr1OTTh2Q5bmlQrjZJeNo5lOFuGK6","timings":{"prompt_n":17,"prompt_ms":35.473,"prompt_per_token_ms":2.0866470588235293,"prompt_per_second":479.2377301045866,"predicted_n":14,"predicted_ms":165.365,"predicted_per_token_ms":11.811785714285715,"predicted_per_second":84.66120400326551}}

I tried replacing the {{- and -}} by {{ and }}in the template in case minja was not liking it, but same behavior.

I'm sorry I'm not sure what I could try next. I still have many things to learn... I'd like to try parsing the template using minja directly but I have absolutely zero C++ experience, I feel a bit weak here

weathermanj

6 days ago

I can start helping likely tomorrow morning. I was running into weird chat template stuff with vllm 4.4 running my fp8 version. I just finished this up and it seems to train. just a basic lora pipeline for now. Might work on adding to unsloth. https://github.com/jwjohns/nvidia-nemotron-h-training

ilintar

6 days ago

@owao please try https://github.com/ggml-org/llama.cpp/pull/15676

owao

6 days ago

continuing there

nvidia
/

NVIDIA-Nemotron-Nano-9B-v2

GGUF support