Created with llm-compressor
's latest changes, quantized on a GH200, works well for me with vLLM's main
branch on my RTX 3090Ti as of 2025-07-01.
What about tool calling?
Per https://vllm-dev.slack.com/archives/C07QP347J4D/p1751401629797809?thread_ts=1751399869.254259&cid=C07QP347J4D, there is currently no way to get tool calling with Mistral-HF formatted models.
I've worked around this on a GitHub branch here: https://github.com/sjuxax/vllm/tree/Mistral3.1-rebase . It includes code to remap the weights from HF-Mistral to Mistral, allowing use of MistralTokenizer
.
I've updated the config.json
to be compatible with this approach, and I'm about to push the tekken.json
tokenizer. With that, if you build that branch, you should be able
to run this checkpoint with MistralTokenizer
and get tool calling.
Note: I spoke a little too soon on the above. We also needed https://github.com/vllm-project/vllm/pull/20503 to get tool calling to work properly. I've merged and pushed this to the Mistral3.1-rebase branch.
- Downloads last month
- 9,424
Model tree for jeffcookio/Mistral-Small-3.2-24B-Instruct-2506-awq-sym
Base model
mistralai/Mistral-Small-3.1-24B-Base-2503