GGUF support
Hello, model looks very promising!
I want to try it locally via llama.cpp/ollama, will the model be available in GGUF format?
Thank you.
Always the same bulls*** .... nerds get top priority, but the average person who uses GGUF comes second... sigh
I pushed a safetensors fp8 you can run on 3090 for now.
Working on llamacpp today. Which is required to even get a gguf. Nemotron-h is a new hybrid architecture.
Itβs not some trivial thing. Itβs a 57 layer hybrid state space model interwoven with transformer MLP layers.
Thank you for your interest and your support!
There's an ongoing discussion & work for Nemotron-H support for GGUF/Llama.cpp. Please join the discussion & effort. Thank you!
https://github.com/ggml-org/llama.cpp/issues/15409
I have it working up to text gen.
Everything else is done up to token generation.
I push the code up sometime today.
Hi all! I've been working alongside
@weathermanj
on the llama.cpp
support, and it's now fully working: https://github.com/ggml-org/llama.cpp/pull/15507
One NOTE: There may be one more change to the architecture string name on my branch (nemotronh
-> nemotron_h
), so GGUF files generated using my branch may be invalid after this change.
NVIDIA Nemotron-Nano-9B-v2 in GGUF format are now available on: https://huggingface.co/dominguesm/NVIDIA-Nemotron-Nano-9B-v2-GGUF
.
Nice. mine are going up now as well!
As of release b6315, the nemotron_h
architecture is officially supported in llama.cpp
. It will take some time for it to roll out to other inference platforms such as LM Studio, Ollama, and Docker model runner, but should be picked up by those platforms once they bump their dependency version for llama.cpp
.
Anyone having an issue with the and/or tags not being output? I'm using the official template for which I don't manage to see the issue:
{%- if add_generation_prompt -%}
{{- "<SPECIAL_11>Assistant\n" -}}
{%- if ns.enable_thinking is defined and ns.enable_thinking is false -%}
{{- "<think></think>" -}}
{%- else -%}
{{- "<think>\n" -}}
{%- endif -%}
{%- if ns.last_turn_assistant_content is defined and ns.last_turn_assistant_content != "" -%}
{{- ns.last_turn_assistant_content -}}
{%- endif -%}
{%- elif ns.last_turn_assistant_content is defined and ns.last_turn_assistant_content != "" -%}
{{- "<SPECIAL_11>Assistant\n" -}}
{%- if ns.enable_thinking is defined and ns.enable_thinking is false -%}
{{- "<think></think>" -}}
{%- else -%}
{{- "<think>\n" -}}
{%- endif -%}
{{- ns.last_turn_assistant_content -}}
{%- if continue_final_message is defined -%}
{%- if continue_final_message is false -%}
{{- "\n<SPECIAL_12>\n" -}}
{%- endif -%}
{%- else -%}
{{- "\n<SPECIAL_12>\n" -}}
{%- endif -%}
- When using
/think
as system prompt or--chat-template-kwargs '{"enable_thinking":true}'
as llama-server option:- reasoning behavior is correctly triggered
- the model correctly ends its generation with
</think>
as last token - but
<think>
opening tag isn't output
- When using
/no_think
as system prompt or--chat-template-kwargs '{"enable_thinking":false}'
as llama-server option:- non reasoning behavior is correctly triggered
- but
<think></think>
is not prepend to the output
My llama.cpp
is freshly built from master
and here is the command I use to serve it:
/home/user/llama.cpp/build/bin/llama-server
--model /mnt/277c6bdc-56fd-45a3-9195-3612028a5a15/GGUFs/NVIDIA-Nemotron-Nano-9B-v2-Q8_0/nvidia_NVIDIA-Nemotron-Nano-9B-v2-Q8_0.gguf
--ctx-size 131000
--no-context-shift
--n-gpu-layers 57
--temp 0.6
--top-p 0.95
--jinja
--host 0.0.0.0
--port ${PORT}
--flash-attn
--chat-template-file /mnt/277c6bdc-56fd-45a3-9195-3612028a5a15/GGUFs/NVIDIA-Nemotron-Nano-9B-v2-Q8_0/template.jinja
--chat-template-kwargs '{"enable_thinking":true}'
However, when playing with https://huggingface.co/spaces/huggingfacejs/chat-template-playground?modelId=nvidia/NVIDIA-Nemotron-Nano-9B-v2 it triggers the correct behavior:
reasoning input
{
messages: [
{
role: 'system',
content: '/think',
},
{
role: 'user',
content: 'What is the capital of France?',
}
],
add_generation_prompt: true,
}
reasoning output
<SPECIAL_10>System
<SPECIAL_11>User
What is the capital of France?
<SPECIAL_11>Assistant
<think>
non-reasoning input
{
messages: [
{
role: 'system',
content: '/no_think',
},
{
role: 'user',
content: 'What is the capital of France?',
}
],
add_generation_prompt: true,
}
non-reasoning output
<SPECIAL_10>System
<SPECIAL_11>User
What is the capital of France?
<SPECIAL_11>Assistant
<think></think>
I just tried b6315
and b6318
to see just in case: same result
Interesting, I haven't played with the chat template extensively, nor have I gone deep on llama.cpp
's use of minja
to implement jinja2
. I know there are some subtle differences between what is supported with minja
vs the python jinja2
implementation, so it's possible that there are some rendering differences? The first thing I'd try to debug would be to do the rendering client-side and see if the raw /completions
output includes the output you expect.
I also did not mess with the chat template at all. I left it as it was. I would recommend checking the completions output as well. When I get a second I will take a peek. Ive been working on training pipeline for nemotron_h.
You've done enough
@gabegoodhart
!
I'll figure out how to try what you suggested, and report my findings. Thanks for all your efforts! And same to
@weathermanj
!
The raw completion from v1/chat/completions
already misses the tokens - same exact behavior:
/think
--> no opening <think>
, but the model "reasons" and outputs the closing </think>
~/l/b/bin β―β―β― curl -X POST http://127.0.0.1:8678/v1/chat/completions \ master
-H "Content-Type: application/json" \
-d '{
"messages": [
{ "role": "system", "content": "/think" },
{ "role": "user", "content": "Hello" }
],
"add_generation_prompt": true
}'
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"Okay, the user just said \"Hello\". That's a greeting. I should respond politely. Let me make sure to acknowledge their greeting and offer help. Maybe say something like \"Hello! How can I assist you today?\" That's friendly and opens the door for them to ask questions.\n</think>\n\nHello! How can I assist you today? π\n"}}],"created":1756499399,"model":"NVIDIA-Nemotron-Nano-9B-v2_Q8_0_131K","system_fingerprint":"b6319-792b44f2","object":"chat.completion","usage":{"completion_tokens":76,"prompt_tokens":15,"total_tokens":91},"id":"chatcmpl-KtFzvtnW1be9O4RjfzFKiMyd43CVbUkm","timings":{"prompt_n":15,"prompt_ms":27.912,"prompt_per_token_ms":1.8608,"prompt_per_second":537.4032674118658,"predicted_n":76,"predicted_ms":955.159,"predicted_per_token_ms":12.567881578947368,"predicted_per_second":79.56790440125677}}
/no_think
--> no opening <think></think>
, but the model answers immediately as expected
~/l/b/bin β―β―β― curl -X POST http://127.0.0.1:8678/v1/chat/completions \ master
-H "Content-Type: application/json" \
-d '{
"messages": [
{ "role": "system", "content": "/no_think" },
{ "role": "user", "content": "Hello" }
],
"add_generation_prompt": true
}'
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"Hello! How can I assist you today? π\n"}}],"created":1756499420,"model":"NVIDIA-Nemotron-Nano-9B-v2_Q8_0_131K","system_fingerprint":"b6319-792b44f2","object":"chat.completion","usage":{"completion_tokens":14,"prompt_tokens":17,"total_tokens":31},"id":"chatcmpl-rAVCr1OTTh2Q5bmlQrjZJeNo5lOFuGK6","timings":{"prompt_n":17,"prompt_ms":35.473,"prompt_per_token_ms":2.0866470588235293,"prompt_per_second":479.2377301045866,"predicted_n":14,"predicted_ms":165.365,"predicted_per_token_ms":11.811785714285715,"predicted_per_second":84.66120400326551}}
I tried replacing the {{-
and -}}
by {{
and }}
in the template in case minja was not liking it, but same behavior.
I'm sorry I'm not sure what I could try next. I still have many things to learn... I'd like to try parsing the template using minja directly but I have absolutely zero C++ experience, I feel a bit weak here
I can start helping likely tomorrow morning. I was running into weird chat template stuff with vllm 4.4 running my fp8 version. I just finished this up and it seems to train. just a basic lora pipeline for now. Might work on adding to unsloth. https://github.com/jwjohns/nvidia-nemotron-h-training
continuing there