Created with llm-compressor's latest changes, quantized on a GH200, works well for me with vLLM's main branch on my RTX 3090Ti as of 2025-07-01.

What about tool calling?

Per https://vllm-dev.slack.com/archives/C07QP347J4D/p1751401629797809?thread_ts=1751399869.254259&cid=C07QP347J4D, there is currently no way to get tool calling with Mistral-HF formatted models.

I've worked around this on a GitHub branch here: https://github.com/sjuxax/vllm/tree/Mistral3.1-rebase . It includes code to remap the weights from HF-Mistral to Mistral, allowing use of MistralTokenizer.

I've updated the config.json to be compatible with this approach, and I'm about to push the tekken.json tokenizer. With that, if you build that branch, you should be able to run this checkpoint with MistralTokenizer and get tool calling.

Note: I spoke a little too soon on the above. We also needed https://github.com/vllm-project/vllm/pull/20503 to get tool calling to work properly. I've merged and pushed this to the Mistral3.1-rebase branch.

jeffcookio
/

Mistral-Small-3.2-24B-Instruct-2506-awq-sym

What about tool calling?

Model tree for jeffcookio/Mistral-Small-3.2-24B-Instruct-2506-awq-sym

Dataset used to train jeffcookio/Mistral-Small-3.2-24B-Instruct-2506-awq-sym