How to run on 16GB VRAM
Hello,
Thank you for your model. I am trying to load it on 16GB VRAM (NVIDIA 4060 Ti). You mentioned in your comments that it is possible. How can I do this?
see the example code in the model's card, have you tried that?
Thanks for your response,
I did try that my GPU OOMs before even the print("pipeline loaded")
statement.
try with this.
pipeline = QwenImageEditPipeline.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map="balanced")
Thanks for looking into that,
I did actually try that as well but I saw that it fails with:
Loading checkpoint shards: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2/2 [00:00<00:00, 22.01it/s]
Loading pipeline components...: 17%|βββββββββββββββββββββββββββββββ | 1/6 [00:00<00:03, 1.64it/s]
Traceback (most recent call last):
File "/home/[USERNAME]/workspace/[PROJECT]/qwen-30b.py", line 9, in <module>
pipeline = QwenImageEditPipeline.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map="balanced")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/[USERNAME]/workspace/[PROJECT]/[VENV]/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 114, in *inner*fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/[USERNAME]/workspace/[PROJECT]/[VENV]/lib/python3.12/site-packages/diffusers/pipelines/pipeline_utils.py", line 1025, in from_pretrained
loaded_sub_model = load_sub_model(
^^^^^^^^^^^^^^^
File "/home/[USERNAME]/workspace/[PROJECT]/[VENV]/lib/python3.12/site-packages/diffusers/pipelines/pipeline_loading_utils.py", line 860, in load_sub_model
dispatch_model(
File "/home/[USERNAME]/workspace/[PROJECT]/[VENV]/lib/python3.12/site-packages/accelerate/big_modeling.py", line 426, in dispatch_model
attach_align_device_hook_on_blocks(
File "/home/[USERNAME]/workspace/[PROJECT]/[VENV]/lib/python3.12/site-packages/accelerate/hooks.py", line 658, in attach_align_device_hook_on_blocks
attach_execution_device_hook(
File "/home/[USERNAME]/workspace/[PROJECT]/[VENV]/lib/python3.12/site-packages/accelerate/hooks.py", line 451, in attach_execution_device_hook
attach_execution_device_hook(
File "/home/[USERNAME]/workspace/[PROJECT]/[VENV]/lib/python3.12/site-packages/accelerate/hooks.py", line 440, in attach_execution_device_hook
if not hasattr(module, "_hf_hook") and len(module.state_dict()) > 0:
^^^^^^^^^^^^^^^^^^^
File "/home/[USERNAME]/workspace/[PROJECT]/[VENV]/lib/python3.12/site-packages/torch/nn/modules/module.py", line 2260, in state_dict
module.state_dict(
File "/home/[USERNAME]/workspace/[PROJECT]/[VENV]/lib/python3.12/site-packages/torch/nn/modules/module.py", line 2260, in state_dict
module.state_dict(
File "/home/[USERNAME]/workspace/[PROJECT]/[VENV]/lib/python3.12/site-packages/torch/nn/modules/module.py", line 2260, in state_dict
module.state_dict(
[Previous line repeated 2 more times]
File "/home/[USERNAME]/workspace/[PROJECT]/[VENV]/lib/python3.12/site-packages/torch/nn/modules/module.py", line 2257, in state_dict
self._save_to_state_dict(destination, prefix, keep_vars)
File "/home/[USERNAME]/workspace/[PROJECT]/[VENV]/lib/python3.12/site-packages/bitsandbytes/nn/modules.py", line 528, in *save*to_state_dict
for k, v in self.weight.quant_state.as_dict(packed=True).items():
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/[USERNAME]/workspace/[PROJECT]/[VENV]/lib/python3.12/site-packages/bitsandbytes/functional.py", line 524, in as_dict
"nested_offset": self.offset.item(),
^^^^^^^^^^^^^^^^^^
File "/home/[USERNAME]/workspace/[PROJECT]/[VENV]/lib/python3.12/site-packages/torch/_meta_registrations.py", line 7457, in meta_local_scalar_dense
raise RuntimeError("Tensor.item() cannot be called on meta tensors")
RuntimeError: Tensor.item() cannot be called on meta tensors
Have you seen this issue before?
I am currently using :
transformers 4.55.4
bitsandbytes 0.47.0
torch 2.8.0
torchaudio 2.8.0
torchvision 0.23.0
diffusers 0.35.1
it will work on 16Gb but you need to know how device mapping works and apply it correctly. I havent seen this problem before but it seems to me you did not install accelerate
as per your requirements.txt
Ah sorry I have accelerate 1.10.0
. Does device_map="balanced" work for you?
Could you share with me your pip freeze
for when device_map="balanced"
works for you? Thank you!
i will post an updated code for 16gb. I dont have a 16gb card but I can force the behavior and test.
by right the pipeline should be loading the model as is so its made for 20GB VRAM , it should automatically handle the loading in NF4. To make it work for 16GB you have to map the components manually between cpu/gpu so for e.g keep the TE on CPU and then construct the pipeline. If I had to make it as easy for 16GB I would have to look at more quantization. But I dont think its necessary if you can code this one out.
Here is s starting point for code:
so download the model locally and use the folder path and manual lo;ading
tokenizer = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_name, subfolder="text_encoder", trust_remote_code=True, device_map="balanced.. or cpu",
)
# check docs for the cpu device map I am just typing as I think.
do the same for other components, you can use device device_map="cuda",
for the rest, it not even necessary to specify the other ones just override the text_encoder when you make the pipeline with text_encoder=yuour_Text_Encoder_on_cpu
then the rest is the usual inference. On my gpu I can reproduce with memory limit of 16GB so this is the one way it would work. This is how the other model we posted for Qwen_image is also used by us and it works on 16GB
Thanks! Let me try to do this!
Nice I was able to make it work with:
import os
from PIL import Image
import torch
from transformers import Qwen2_5_VLForConditionalGeneration
from diffusers import QwenImageEditPipeline
# Load text encoder
text_encoder = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"path/to/your/model/",
subfolder="text_encoder",
trust_remote_code=True,
device_map="cpu",
torch_dtype=torch.bfloat16
)
# Load pipeline
pipeline = QwenImageEditPipeline.from_pretrained(
"path/to/your/model/",
text_encoder=text_encoder,
torch_dtype=torch.bfloat16,
device_map="cuda"
)
pipeline.reset_device_map()
pipeline.enable_model_cpu_offload()
# Load input image
image = Image.open("input_image.png").convert("RGB")
# Define prompt
prompt = "Convert to painting art style."
# Set up inputs
inputs = {
"image": image,
"prompt": prompt,
"generator": torch.manual_seed(0),
"true_cfg_scale": 4,
"negative_prompt": "blurry, low quality,",
"num_inference_steps": 20,
}
# Generate image
with torch.inference_mode():
output = pipeline(**inputs)
# Save output
output_image = output.images[0]
output_image.save("output.png")
Itβs always good to see people figure things out and share their working examples. Glad it worked and I hope you post some benchmarks from your card.