Permanent residence in VRAM

by rtbonet - opened about 12 hours ago

about 12 hours ago

Every time I do an image generation, even if I keep the pipe alive, it seems to be loading/unloading stuff which adds tons of waiting time.
I tried removing the enable cpu offload function call but then there are no memory savings compared to base qwen-image
Use case is: I run this a service and I want to have it all in VRAM permanently for fast serving (reaping the benefit of lower vram usage with DF11).

ovedrive

about 7 hours ago

i spent a whole day trying what you are suggesting. I am not the author and not at his level. What i learned was that its a trade-off. You are getting a lossless full model but at the cost of longer generation times. The internals of DFloat11 handle cpu offloading better than the default cpu offloads available in pytorch. Which means they are managed internally to Dfloat. The way DFloat is doing this means its probably doing 1 block or few blocks at a time (i think its just 1) so it uses little VRAM during inference. I am not sure if you are using with pin_memory=True (default) or not but if you have more than enough RAM you will see "some" speedup.

Only DFloat11 can optimize their package but it may be not be possible the other opption might be using a quantized version that can full load into memory then you dont need DFloat11

LeanQuant

Dynamic-length Float (DFloat11) org about 1 hour ago

Thank you for bringing this to my attention!

I have added a feature in the DFloat11 package for configuring the number of blocks to offload, which means

offloading more blocks uses less GPU memory and more CPU memory,
offloading less blocks uses more GPU memory and less CPU memory, and could be faster.

This will allow you to configure the optimal number of blocks to offload for the best balance between memory-efficient and speed. To try it, upgrade to the latest pip version pip install -U dfloat11[cuda12] and follow the instructions in this model card.

LeanQuant

Dynamic-length Float (DFloat11) org about 1 hour ago

Every time I do an image generation, even if I keep the pipe alive, it seems to be loading/unloading stuff which adds tons of waiting time.
I tried removing the enable cpu offload function call but then there are no memory savings compared to base qwen-image
Use case is: I run this a service and I want to have it all in VRAM permanently for fast serving (reaping the benefit of lower vram usage with DF11).

To answer your question, the DFloat11 version does save around 11GB in VRAM usage if you load everything into VRAM.

The problem is that the Qwen-Image model is larger than you think. It has a diffusion transformer (41GB) and a text-encoder (14GB) and a VAE (0.25GB). If you load everything into VRAM, it would consume around 55GB, which is probably larger than your GPU capacity. The DFloat11 model reduces the diffusion transformer from 41GB to 28.5GB.

To load everything into VRAM, you replace the line pipe.enable_model_cpu_offload() with pipe = pipe.to('cuda'). This should remain compatible with DFloat11 offloading.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment