VLLM - Flash-attn 3
uv pip install --pre vllm==0.10.1+gptoss
--extra-index-url https://wheels.vllm.ai/gpt-oss/
--extra-index-url https://download.pytorch.org/whl/nightly/cu128
--index-strategy unsafe-best-match
vllm serve openai/gpt-oss-120b this does not work for me
[cuda.py:323] Using Flash Attention backend on V1 engine.
(VllmWorker TP2 pid=2994061) INFO 08-05 15:47:01 [cuda.py:323] Using Flash Attention backend on V1 engine.
(VllmWorker TP1 pid=2994060) INFO 08-05 15:47:01 [cuda.py:323] Using Flash Attention backend on V1 engine.
(VllmWorker TP0 pid=2994059) INFO 08-05 15:47:01 [cuda.py:323] Using Flash Attention backend on V1 engine.
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] WorkerProc failed to start.
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] Traceback (most recent call last):
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] File "/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 533, in worker_main
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] worker = WorkerProc(*args, **kwargs)
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] File "/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 402, in init
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] self.worker.load_model()
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] File "/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 211, in load_model
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] self.model_runner.load_model(eep_scale_up=eep_scale_up)
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] File "/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1946, in load_model
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] self.model = model_loader.load_model(
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] ^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] File "/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 44, in load_model
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] model = initialize_model(vllm_config=vllm_config,
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] File "/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 63, in initialize_model
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] return model_class(vllm_config=vllm_config, prefix=prefix)
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] File "/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 241, in init
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] self.model = GptOssModel(
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] ^^^^^^^^^^^^
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] File "/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 183, in init
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] File "/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 214, in init
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] TransformerBlock(
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] File "/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 183, in init
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] self.attn = OAIAttention(config, prefix=f"{prefix}.attn")
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] File "/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 110, in init
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] self.attn = Attention(
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] ^^^^^^^^^^
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] File "/lib/python3.12/site-packages/vllm/attention/layer.py", line 176, in init
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] File "/lib/python3.12/site-packages/vllm/v1/attention/backends/flash_attn.py", line 417, in init
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] assert self.vllm_flash_attn_version == 3, (
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP3 pid=2994062) ERROR 08-05 15:47:02 [multiproc_executor.py:559] AssertionError: Sinks are only supported in FlashAttention 3
Facing a similar issue on Blackwell and Ada Lovelace GPUs. I also tried with the vLLM docker image vllm/vllm-openai:gptoss
.
Facing a similar issue on Blackwell and Ada Lovelace GPUs. I also tried with the vLLM docker image
vllm/vllm-openai:gptoss
.
yes i just found the page https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html but thought the docker image would work nope. same error
It needs flash attention 3 to my understanding and that is not supported on Ada Lovelace. Any workarounds? since we have a bunch of L40S cards
If you are using Blackwell GPU, please make sure you are passing in the right env var following https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html!
For Ada Lovelace, we current don't have support for it but certain something we had in mind supporting.
will a100s GPU work? had the same issue with FlashAttention 3.
I'm experiencing the same issue, and I haven't found a workaround. I guess we will have to move to llama.cpp which in the blogpost figure of speeds on RTX shows the 50 series GPU's
https://blogs.nvidia.com/blog/rtx-ai-garage-openai-oss/
I am not happy with this fix, though, and will likely not bother moving to another framework.
If you are using Blackwell GPU, please make sure you are passing in the right env var following https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html!
I still get this on the Blackwell 6000:
Cannot use FA version 3 is not supported due to FA3 is only supported on devices with compute capability >= 8 excluding 8.6 and 8.9 and Blackwell archs (>=10)
(VllmWorker pid=3422006) ERROR 08-05 18:34:04 [multiproc_executor.py:559] assert self.vllm_flash_attn_version == 3, (
(VllmWorker pid=3422006) ERROR 08-05 18:34:04 [multiproc_executor.py:559] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker pid=3422006) ERROR 08-05 18:34:04 [multiproc_executor.py:559] AssertionError: Sinks are only supported in FlashAttention 3
And as far as I can tell FA3 does not support Blackwell yet?
Has to use export VLLM_ATTENTION_BACKEND=FLASHINFER as well.
Also, watchout if you have FLASHINFER 2.10 or higher, on the vllm==0.10.1+gptoss still has a version comparison bug and it will fail as too old without patch the check.
same issue on a100 GPUs
Here's a working example of serving gpt oss 120b using vllm running on h100 http://playground.tracto.ai/playground?pr=notebooks/bulk-inference-gpt-oss-120b
we ended up customizing vllm and pulling in nvidia toolkit bc the model uses nvcc directly.
Has to use export VLLM_ATTENTION_BACKEND=FLASHINFER as well.
Also, watchout if you have FLASHINFER 2.10 or higher, on the vllm==0.10.1+gptoss still has a version comparison bug and it will fail as too old without patch the check.
I did get a bit further after passing VLLM_ATTENTION_BACKEND
. Here is what I have when trying Blackwell (RTX Pro 6000):
docker run --runtime nvidia --gpus '"device=0"' \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
-e VLLM_USE_TRTLLM_ATTENTION=1 \
-e VLLM_USE_TRTLLM_DECODE_ATTENTION=1 \
-e VLLM_USE_TRTLLM_CONTEXT_ATTENTION=1 \
-e VLLM_USE_FLASHINFER_MXFP4_BF16_MOE=1 \
-e VLLM_ATTENTION_BACKEND=FLASHINFER \
vllm/vllm-openai:gptoss \
--model $MODEL \
--disable-log-requests --tensor-parallel-size 1 --port 8000 \
--max-model-len 4096 --max-num-seqs 1 --async-scheduling
Model downloads, etc, but then I get:
(VllmWorker pid=144) ERROR 08-06 11:51:06 [multiproc_executor.py:559] shuffle_matrix_a(w13_bias[i].clone().reshape(-1, 1),
(VllmWorker pid=144) ERROR 08-06 11:51:06 [multiproc_executor.py:559] ^^^^^^^^^^^^^^^^^^^
(VllmWorker pid=144) ERROR 08-06 11:51:06 [multiproc_executor.py:559] torch.AcceleratorError: CUDA error: no kernel image is available for execution on the devic
IIRC, this error means either CUDA 12.8 wasn't used for building some dependencies in Docker, or some modules were not built with SM_120
flag for Blackwell.