DFloat11 Compressed Model: Qwen/Qwen-Image

This is a DFloat11 losslessly compressed version of the original Qwen/Qwen-Image model. It reduces model size by 32% compared to the original BFloat16 model, while maintaining bit-identical outputs and supporting efficient GPU inference.

๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ Thanks to DFloat11 compression, Qwen-Image can now run on a single 32GB GPU, or on a single 16GB GPU with CPU offloading, while maintaining full model quality. ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ

๐Ÿ“Š Performance Comparison

Model Model Size Peak GPU Memory (1328x1328 image generation) Generation Time (A100 GPU)
Qwen-Image (BFloat16) ~41 GB OOM -
Qwen-Image (DFloat11) 28.42 GB 29.74 GB 100 seconds
Qwen-Image (DFloat11 + GPU Offloading) 28.42 GB 16.68 GB 260 seconds

๐Ÿ”ง How to Use

  1. Install or upgrade the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed):

    pip install -U dfloat11[cuda12]
    
  2. Install or upgrade diffusers:

    pip install git+https://github.com/huggingface/diffusers
    
  3. Save the following code to a Python file qwen_image.py:

    from diffusers import DiffusionPipeline, QwenImageTransformer2DModel
    import torch
    from transformers.modeling_utils import no_init_weights
    from dfloat11 import DFloat11Model
    import argparse
    
    def parse_args():
        parser = argparse.ArgumentParser(description='Generate images using Qwen-Image model')
        parser.add_argument('--cpu_offload', action='store_true', help='Enable CPU offloading')
        parser.add_argument('--no_pin_memory', action='store_true', help='Disable memory pinning')
        parser.add_argument('--prompt', type=str, default='A coffee shop entrance features a chalkboard sign reading "Qwen Coffee ๐Ÿ˜Š $2 per cup," with a neon light beside it displaying "้€šไน‰ๅƒ้—ฎ". Next to it hangs a poster showing a beautiful Chinese woman, and beneath the poster is written "ฯ€โ‰ˆ3.1415926-53589793-23846264-33832795-02384197".',
                            help='Text prompt for image generation')
        parser.add_argument('--negative_prompt', type=str, default=' ',
                            help='Negative prompt for image generation')
        parser.add_argument('--aspect_ratio', type=str, default='16:9', choices=['1:1', '16:9', '9:16', '4:3', '3:4'],
                            help='Aspect ratio of generated image')
        parser.add_argument('--num_inference_steps', type=int, default=50,
                            help='Number of denoising steps')
        parser.add_argument('--true_cfg_scale', type=float, default=4.0,
                            help='Classifier free guidance scale')
        parser.add_argument('--seed', type=int, default=42,
                            help='Random seed for generation')
        parser.add_argument('--output', type=str, default='example.png',
                            help='Output image path')
        parser.add_argument('--language', type=str, default='en', choices=['en', 'zh'],
                            help='Language for positive magic prompt')
        return parser.parse_args()
    
    args = parse_args()
    
    model_name = "Qwen/Qwen-Image"
    
    with no_init_weights():
        transformer = QwenImageTransformer2DModel.from_config(
            QwenImageTransformer2DModel.load_config(
                model_name, subfolder="transformer",
            ),
        ).to(torch.bfloat16)
    
    DFloat11Model.from_pretrained(
        "DFloat11/Qwen-Image-DF11",
        device="cpu",
        cpu_offload=args.cpu_offload,
        pin_memory=not args.no_pin_memory,
        bfloat16_model=transformer,
    )
    
    pipe = DiffusionPipeline.from_pretrained(
        model_name,
        transformer=transformer,
        torch_dtype=torch.bfloat16,
    )
    pipe.enable_model_cpu_offload()
    
    positive_magic = {
        "en": "Ultra HD, 4K, cinematic composition.", # for english prompt,
        "zh": "่ถ…ๆธ…๏ผŒ4K๏ผŒ็”ตๅฝฑ็บงๆž„ๅ›พ" # for chinese prompt,
    }
    
    # Generate with different aspect ratios
    aspect_ratios = {
        "1:1": (1328, 1328),
        "16:9": (1664, 928),
        "9:16": (928, 1664),
        "4:3": (1472, 1140),
        "3:4": (1140, 1472),
    }
    
    width, height = aspect_ratios[args.aspect_ratio]
    
    image = pipe(
        prompt=args.prompt + positive_magic[args.language],
        negative_prompt=args.negative_prompt,
        width=width,
        height=height,
        num_inference_steps=args.num_inference_steps,
        true_cfg_scale=args.true_cfg_scale,
        generator=torch.Generator(device="cuda").manual_seed(args.seed)
    ).images[0]
    
    image.save(args.output)
    
    max_memory = torch.cuda.max_memory_allocated()
    print(f"Max memory: {max_memory / (1000 ** 3):.2f} GB")
    
  4. To run without CPU offloading (32GB VRAM required):

    python qwen_image.py
    

    To run with CPU offloading (16GB VRAM required):

    python qwen_image.py --cpu_offload
    

    If you are getting out-of-memory errors, try disabling memory-pinning:

    python qwen_image.py --cpu_offload --no_pin_memory
    

๐Ÿ” How It Works

We apply Huffman coding to losslessly compress the exponent bits of BFloat16 model weights, which are highly compressible (their 8 bits carry only ~2.6 bits of actual information). To enable fast inference, we implement a highly efficient CUDA kernel that performs on-the-fly weight decompression directly on the GPU.

The result is a model that is ~32% smaller, delivers bit-identical outputs, and achieves performance comparable to the original BFloat16 model.

Learn more in our research paper.

๐Ÿ“„ Learn More

Downloads last month
2,205
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for DFloat11/Qwen-Image-DF11

Base model

Qwen/Qwen-Image
Quantized
(7)
this model