BasedBase/Qwen3-30B-A3B-Thinking-2507-Deepseek-v3.1-Distill · Start of thinking token appears to be missing

11 days ago

Have seen this effect with several thinking distills recently wherein the <thinking> token isn't sent back to start the pondering block but the </thinking> termination token is sent. Number of clients expect that block to be correctly encapsulated which creates a bit of a mess in outputs since that block is normally omitted. Running through candle-vllm (at almost 70 T/S on V100s - their fusedmoe kernels seem to work even better w/ DS and Qwen3Moe together) @ current master if that makes any difference.

jpbwin

11 days ago

•

edited 11 days ago

Download the chat template from the original model and save it as qwen3.jinja — edit the last few lines of it and remove the <think> prefill. Then pass the file as the model’s chat template to whatever inference engine you’re using.

rageltman

11 days ago

Interesting, thanks @jpbwin - will take a shot at that.
@BasedBase : was the template change intentional, does that token break something or confuse context?

jpbwin

11 days ago

The template wasn't changed fwiw, I think some inference engines send along the prefill if the model matches one known to do it like Qwen3, but they don't know what this is even with qwen in the name. It's the only explanation I have.

rageltman

11 days ago

@jbpwin: are you suggesting that i change:

{{- '<|im_start|>assistant\\n<think>\\n' }}\n{%- endif %}

to

{{- '<think>\\n' }}\n{%- endif %}

?

I use candle-vllm so the chat template is part of the tokenizer confg file as opposed to a standalone so i'm guessing the edit needs to be made in there.

BasedBase

Owner 11 days ago

•

edited 11 days ago

Interesting, thanks @jpbwin - will take a shot at that.
@BasedBase : was the template change intentional, does that token break something or confuse context?

The template wasn't changed from the original model.
I haven't been able to replicate this issue on my end using LM studio, which suggests it might be specific to the interaction between the model and candle-vllm, especially with the fused MoE kernels you mentioned.
The behavior you're describing, where is omitted but is generated, strongly suggests that the token is not being correctly processed as a special token in your setup. When this happens, the model might either skip it or, as is common with some tokenizers, break it down into smaller, regular tokens that don't trigger the intended "thinking" block. It may also be due to you using v100 GPU's since they are pretty much considered legacy devices now.

jpbwin

11 days ago

•

edited 11 days ago

change
{{- '<|im_start|>assistant\\n<think>\\n' }}\n{%- endif %}

to:
{{- '<|im_start|>assistant\\n' }}\n{%- endif %}

would you mind telling me what candle vllm is btw? I've never heard of it before now, is it a fork? does it provide any advantages over vllm / sglang / llama.cpp?

fwiw: this is one half step below 'hacky' but I've had to do it to a ton of models on varying inference engines, the model is so heavily weighted towards the first token being <think> that it will just output it first if it isn't there.

rageltman

11 days ago

Thank you both. Will see what if i can fix that.

Candle-vllm is a rust-based runtime for text-gen models. It uses a chunked prefix cache nowadays which allows absolutely massive context windows in conjunction with the recent rope scaling work without blowing out VRAM. GPU-only though for CUDA and MKL devices. Most of the innovations are coming from vllm.rs and being downstreamed by https://github.com/guoqingbao into https://github.com/EricLBuehler/candle-vllm

BasedBase

Owner 11 days ago

Thank you both. Will see what if i can fix that.

Candle-vllm is a rust-based runtime for text-gen models. It uses a chunked prefix cache nowadays which allows absolutely massive context windows in conjunction with the recent rope scaling work without blowing out VRAM. GPU-only though for CUDA and MKL devices. Most of the innovations are coming from vllm.rs and being downstreamed by https://github.com/guoqingbao into https://github.com/EricLBuehler/candle-vllmin

in your prompt if you include the tags in your question to the LLM it will mess up its formatting since it will print out and many times as it is thinking so some sentences will be in a chain of thought section while others wont even if its still reasoning. This seems to be something qwen models do/be a formatting error in alot of backends.

rageltman

11 days ago

@BasedBase : sorry, not enough coffee yet - mind breaking this down barney style for the crayon-eaters among us? :)
Specifically:

"in your prompt" meaning in the chat template or in what i the human am typing then sending from the client to the LLMs API in my system or user prompt sections (in between the top options potentially being tags the clients themselvse add)?
"include the tags in your question" meaning the somehow push the<thinking> in the query initiated to the API? I recall this being a thing with the hybrid qwen models before this split-out thinking/non-thinking to include being able to pass </nothinking> or something like that to disable it.

rageltman

11 days ago

•

edited 11 days ago

@jpbwin : thank you, did the trick
@BasedBase : w a 2X yarn rope factor making the ctx window 512k i've so far been able to feed it >300k of document data and stable at:

candle-vllm-swe  | 2025-09-07T15:38:51.707288Z  INFO candle_vllm: Pipeline config PipelineConfig { max_model_len: 524288, default_max_tokens: 16384, generation_cfg: Some(GenerationConfig { temperature: Some(0.6), top_p: Some(0.95), top_k: Some(20), penalty: Some(1.1) }) }

going to push to 0.75 and see how it fares. The merge seems to have made its output somewhat more dry and lacking than the "factory" qwen3 although its not really making any mistakes or "typos."

BasedBase

Owner 11 days ago

@jpbwin : thank you, did the trick
@BasedBase : w a 2X yarn rope factor making the ctx window 512k i've so far been able to feed it >300k of document data and stable at:
candle-vllm-swe  | 2025-09-07T15:38:51.707288Z  INFO candle_vllm: Pipeline config PipelineConfig { max_model_len: 524288, default_max_tokens: 16384, generation_cfg: Some(GenerationConfig { temperature: Some(0.6), top_p: Some(0.95), top_k: Some(20), penalty: Some(1.1) }) }
going to push to 0.75 and see how it fares. The merge seems to have made its output somewhat more dry and lacking than the "factory" qwen3 although its not really making any mistakes or "typos."

Having any LLM stable at 300k worth of context used is impressive. I can use gemini 2.5 pro and by the time I get to 100k tokens its answers are too unstable to be able to use that chat for coding. Thank you for the insight.

rageltman

10 days ago

@BasedBase - tune temp down to 4.5 and penalty to 1.11+ and it looks like you can get into the 400 range. You do need to prompt-focus it to stay on task with labels like TASK but it seems to be able to find its way through the work. ... more or less. Need BF16 and some axle grease to really measure efficacy there (blackwell hardware en route)

kbuettner

9 days ago

FWIW, the following vllm command is working well for me with regard to collapsing the thinking in clients like Open WebUI. Tool calling also works in Charm Crush and llxprt.

vllm serve BasedBase/Qwen3-30B-A3B-Thinking-2507-Deepseek-v3.1-Distill-FP32 -tp 2 --max-model-len 262144 --dtype bfloat16 --reasoning_parser deepseek_r1 --enable-auto-tool-choice --tool-call-parser hermes