How to run on 16GB VRAM

#4
by aahila - opened

Hello,

Thank you for your model. I am trying to load it on 16GB VRAM (NVIDIA 4060 Ti). You mentioned in your comments that it is possible. How can I do this?

see the example code in the model's card, have you tried that?

Thanks for your response,

I did try that my GPU OOMs before even the print("pipeline loaded") statement.

This comment has been hidden

try with this.

pipeline = QwenImageEditPipeline.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map="balanced")

Thanks for looking into that,

I did actually try that as well but I saw that it fails with:

Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:00<00:00, 22.01it/s]
Loading pipeline components...:  17%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–                                                                                                                                                      | 1/6 [00:00<00:03,  1.64it/s]
Traceback (most recent call last):
  File "/home/[USERNAME]/workspace/[PROJECT]/qwen-30b.py", line 9, in <module>
    pipeline = QwenImageEditPipeline.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map="balanced")
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/[USERNAME]/workspace/[PROJECT]/[VENV]/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 114, in *inner*fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/[USERNAME]/workspace/[PROJECT]/[VENV]/lib/python3.12/site-packages/diffusers/pipelines/pipeline_utils.py", line 1025, in from_pretrained
    loaded_sub_model = load_sub_model(
                       ^^^^^^^^^^^^^^^
  File "/home/[USERNAME]/workspace/[PROJECT]/[VENV]/lib/python3.12/site-packages/diffusers/pipelines/pipeline_loading_utils.py", line 860, in load_sub_model
    dispatch_model(
  File "/home/[USERNAME]/workspace/[PROJECT]/[VENV]/lib/python3.12/site-packages/accelerate/big_modeling.py", line 426, in dispatch_model
    attach_align_device_hook_on_blocks(
  File "/home/[USERNAME]/workspace/[PROJECT]/[VENV]/lib/python3.12/site-packages/accelerate/hooks.py", line 658, in attach_align_device_hook_on_blocks
    attach_execution_device_hook(
  File "/home/[USERNAME]/workspace/[PROJECT]/[VENV]/lib/python3.12/site-packages/accelerate/hooks.py", line 451, in attach_execution_device_hook
    attach_execution_device_hook(
  File "/home/[USERNAME]/workspace/[PROJECT]/[VENV]/lib/python3.12/site-packages/accelerate/hooks.py", line 440, in attach_execution_device_hook
    if not hasattr(module, "_hf_hook") and len(module.state_dict()) > 0:
                                               ^^^^^^^^^^^^^^^^^^^
  File "/home/[USERNAME]/workspace/[PROJECT]/[VENV]/lib/python3.12/site-packages/torch/nn/modules/module.py", line 2260, in state_dict
    module.state_dict(
  File "/home/[USERNAME]/workspace/[PROJECT]/[VENV]/lib/python3.12/site-packages/torch/nn/modules/module.py", line 2260, in state_dict
    module.state_dict(
  File "/home/[USERNAME]/workspace/[PROJECT]/[VENV]/lib/python3.12/site-packages/torch/nn/modules/module.py", line 2260, in state_dict
    module.state_dict(
  [Previous line repeated 2 more times]
  File "/home/[USERNAME]/workspace/[PROJECT]/[VENV]/lib/python3.12/site-packages/torch/nn/modules/module.py", line 2257, in state_dict
    self._save_to_state_dict(destination, prefix, keep_vars)
  File "/home/[USERNAME]/workspace/[PROJECT]/[VENV]/lib/python3.12/site-packages/bitsandbytes/nn/modules.py", line 528, in *save*to_state_dict
    for k, v in self.weight.quant_state.as_dict(packed=True).items():
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/[USERNAME]/workspace/[PROJECT]/[VENV]/lib/python3.12/site-packages/bitsandbytes/functional.py", line 524, in as_dict
    "nested_offset": self.offset.item(),
                     ^^^^^^^^^^^^^^^^^^
  File "/home/[USERNAME]/workspace/[PROJECT]/[VENV]/lib/python3.12/site-packages/torch/_meta_registrations.py", line 7457, in meta_local_scalar_dense
    raise RuntimeError("Tensor.item() cannot be called on meta tensors")
RuntimeError: Tensor.item() cannot be called on meta tensors

Have you seen this issue before?

I am currently using :

transformers              4.55.4
bitsandbytes              0.47.0
torch                     2.8.0
torchaudio                2.8.0
torchvision               0.23.0
diffusers                 0.35.1

it will work on 16Gb but you need to know how device mapping works and apply it correctly. I havent seen this problem before but it seems to me you did not install accelerate as per your requirements.txt

Ah sorry I have accelerate 1.10.0. Does device_map="balanced" work for you?

Could you share with me your pip freeze for when device_map="balanced" works for you? Thank you!

i will post an updated code for 16gb. I dont have a 16gb card but I can force the behavior and test.

by right the pipeline should be loading the model as is so its made for 20GB VRAM , it should automatically handle the loading in NF4. To make it work for 16GB you have to map the components manually between cpu/gpu so for e.g keep the TE on CPU and then construct the pipeline. If I had to make it as easy for 16GB I would have to look at more quantization. But I dont think its necessary if you can code this one out.

Here is s starting point for code:

so download the model locally and use the folder path and manual lo;ading

tokenizer = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_name, subfolder="text_encoder", trust_remote_code=True, device_map="balanced.. or cpu", 
)
# check docs for the cpu device map I am just typing as I think.

do the same for other components, you can use device device_map="cuda", for the rest, it not even necessary to specify the other ones just override the text_encoder when you make the pipeline with text_encoder=yuour_Text_Encoder_on_cpu

then the rest is the usual inference. On my gpu I can reproduce with memory limit of 16GB so this is the one way it would work. This is how the other model we posted for Qwen_image is also used by us and it works on 16GB

ovedrive changed discussion status to closed

Thanks! Let me try to do this!

Nice I was able to make it work with:

import os
from PIL import Image
import torch
from transformers import Qwen2_5_VLForConditionalGeneration
from diffusers import QwenImageEditPipeline

# Load text encoder
text_encoder = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "path/to/your/model/",
    subfolder="text_encoder",
    trust_remote_code=True,
    device_map="cpu",
    torch_dtype=torch.bfloat16
)

# Load pipeline
pipeline = QwenImageEditPipeline.from_pretrained(
    "path/to/your/model/",
    text_encoder=text_encoder,
    torch_dtype=torch.bfloat16,
    device_map="cuda"
)

pipeline.reset_device_map()
pipeline.enable_model_cpu_offload()

# Load input image
image = Image.open("input_image.png").convert("RGB")

# Define prompt
prompt = "Convert to painting art style."

# Set up inputs
inputs = {
    "image": image,
    "prompt": prompt,
    "generator": torch.manual_seed(0),
    "true_cfg_scale": 4,
    "negative_prompt": "blurry, low quality,",
    "num_inference_steps": 20,
}

# Generate image
with torch.inference_mode():
    output = pipeline(**inputs)

# Save output
output_image = output.images[0]
output_image.save("output.png")

It’s always good to see people figure things out and share their working examples. Glad it worked and I hope you post some benchmarks from your card.

Sign up or log in to comment