Any pointers on how to run this model in vllm?
I'm trying to load this model into an RTX 3090 (24Gb VRAM) But I always get a Cuda out of memory error. My arguments to vllm are:
--model neody/mistralai-Devstral-Small-2507-GPTQ-8bit
--enable-auto-tool-choice --tool-call-parser mistral
--quantization gptq --dtype float16
--gpu_memory_utilization 0.6
--block_size 16
--max_num_seqs 32
--override-generation-config '{"temperature": 0.25, "min_p": 0, "top_p": 0.8, "top_k": 10}'
but I've tried various things and no matter what I do I end up with torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.25 GiB. GPU 0 has a total capacity of 23.56 GiB of which 764.00 MiB is free. Process 3827213 has 22.81 GiB memory in use. Of the allocated memory 22.50 GiB is allocated by PyTorch, and 18.20 MiB is reserved by PyTorch but unallocated.
There is nothing else using the card and its totally headless (nvidia-smi shows all VRAM is available).
You need at least 24GB VRAM allocated to load this model.
--gpu_memory_utilization 0.6
means that you have only 14.4GB allocated, which is not enough.
Even if you allocate 24GB full memory, you need extra VRAM for kv-cache, so you cannot load this model on your RTX3090.
Try other model or lower quant.
Thank you! I hadn't factored in the kv-cache requirements.