LeanQuant commited on
Commit
ff195eb
·
verified ·
1 Parent(s): 01ca8f9

Add files using upload-large-folder tool

Browse files
Files changed (3) hide show
  1. README.md +150 -0
  2. config.json +28 -0
  3. diffusion_pytorch_model.safetensors +3 -0
README.md ADDED
@@ -0,0 +1,150 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - Qwen/Qwen-Image
4
+ base_model_relation: quantized
5
+ tags:
6
+ - dfloat11
7
+ - df11
8
+ - lossless compression
9
+ - 70% size, 100% accuracy
10
+ ---
11
+
12
+ # DFloat11 Compressed Model: `Qwen/Qwen-Image`
13
+
14
+ This is a **DFloat11 losslessly compressed** version of the original `Qwen/Qwen-Image` model. It reduces model size by **32%** compared to the original BFloat16 model, while maintaining **bit-identical outputs** and supporting **efficient GPU inference**.
15
+
16
+ 🔥🔥🔥 Thanks to DFloat11 compression, Qwen-Image can now run on **a single 32GB GPU**, or on **a single 16GB GPU with CPU offloading**, while maintaining full model quality. 🔥🔥🔥
17
+
18
+ ### 📊 Performance Comparison
19
+
20
+ | Model | Model Size | Peak GPU Memory (1328x1328 image generation) | Generation Time (A100 GPU) |
21
+ |-------------------------------------------|------------|----------------------------------------------|----------------------------|
22
+ | Qwen-Image (BFloat16) | ~41 GB | OOM | - |
23
+ | Qwen-Image (DFloat11) | 28.42 GB | 29.74 GB | 100 seconds |
24
+ | Qwen-Image (DFloat11 + GPU Offloading) | 28.42 GB | 16.68 GB | 260 seconds |
25
+
26
+ ### 🔧 How to Use
27
+
28
+ 1. Install or upgrade the DFloat11 pip package *(installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed)*:
29
+
30
+ ```bash
31
+ pip install -U dfloat11[cuda12]
32
+ ```
33
+
34
+ 2. Install or upgrade diffusers:
35
+
36
+ ```bash
37
+ pip install git+https://github.com/huggingface/diffusers
38
+ ```
39
+
40
+ 3. Save the following code to a Python file `qwen_image.py`:
41
+
42
+ ```python
43
+ from diffusers import DiffusionPipeline, QwenImageTransformer2DModel
44
+ import torch
45
+ from transformers.modeling_utils import no_init_weights
46
+ from dfloat11 import DFloat11Model
47
+ import argparse
48
+
49
+ def parse_args():
50
+ parser = argparse.ArgumentParser(description='Generate images using Qwen-Image model')
51
+ parser.add_argument('--cpu_offload', action='store_true', help='Enable CPU offloading')
52
+ parser.add_argument('--prompt', type=str, default='A coffee shop entrance features a chalkboard sign reading "Qwen Coffee 😊 $2 per cup," with a neon light beside it displaying "通义千问". Next to it hangs a poster showing a beautiful Chinese woman, and beneath the poster is written "π≈3.1415926-53589793-23846264-33832795-02384197".',
53
+ help='Text prompt for image generation')
54
+ parser.add_argument('--negative_prompt', type=str, default=' ',
55
+ help='Negative prompt for image generation')
56
+ parser.add_argument('--aspect_ratio', type=str, default='16:9', choices=['1:1', '16:9', '9:16', '4:3', '3:4'],
57
+ help='Aspect ratio of generated image')
58
+ parser.add_argument('--num_inference_steps', type=int, default=50,
59
+ help='Number of denoising steps')
60
+ parser.add_argument('--true_cfg_scale', type=float, default=4.0,
61
+ help='Classifier free guidance scale')
62
+ parser.add_argument('--seed', type=int, default=42,
63
+ help='Random seed for generation')
64
+ parser.add_argument('--output', type=str, default='example.png',
65
+ help='Output image path')
66
+ parser.add_argument('--language', type=str, default='en', choices=['en', 'zh'],
67
+ help='Language for positive magic prompt')
68
+ return parser.parse_args()
69
+
70
+ args = parse_args()
71
+
72
+ model_name = "Qwen/Qwen-Image"
73
+
74
+ with no_init_weights():
75
+ transformer = QwenImageTransformer2DModel.from_config(
76
+ QwenImageTransformer2DModel.load_config(
77
+ model_name, subfolder="transformer",
78
+ ),
79
+ ).to(torch.bfloat16)
80
+
81
+ DFloat11Model.from_pretrained(
82
+ "DFloat11/Qwen-Image-DF11",
83
+ device="cpu",
84
+ cpu_offload=args.cpu_offload,
85
+ bfloat16_model=transformer,
86
+ )
87
+
88
+ pipe = DiffusionPipeline.from_pretrained(
89
+ model_name,
90
+ transformer=transformer,
91
+ torch_dtype=torch.bfloat16,
92
+ )
93
+ pipe.enable_model_cpu_offload()
94
+
95
+ positive_magic = {
96
+ "en": "Ultra HD, 4K, cinematic composition.", # for english prompt,
97
+ "zh": "超清,4K,电影级构图" # for chinese prompt,
98
+ }
99
+
100
+ # Generate with different aspect ratios
101
+ aspect_ratios = {
102
+ "1:1": (1328, 1328),
103
+ "16:9": (1664, 928),
104
+ "9:16": (928, 1664),
105
+ "4:3": (1472, 1140),
106
+ "3:4": (1140, 1472),
107
+ }
108
+
109
+ width, height = aspect_ratios[args.aspect_ratio]
110
+
111
+ image = pipe(
112
+ prompt=args.prompt + positive_magic[args.language],
113
+ negative_prompt=args.negative_prompt,
114
+ width=width,
115
+ height=height,
116
+ num_inference_steps=args.num_inference_steps,
117
+ true_cfg_scale=args.true_cfg_scale,
118
+ generator=torch.Generator(device="cuda").manual_seed(args.seed)
119
+ ).images[0]
120
+
121
+ image.save(args.output)
122
+
123
+ max_memory = torch.cuda.max_memory_allocated()
124
+ print(f"Max memory: {max_memory / (1000 ** 3):.2f} GB")
125
+ ```
126
+
127
+ 4. To run without CPU offloading (32GB VRAM required):
128
+ ```bash
129
+ python qwen_image.py
130
+ ```
131
+
132
+ To run with CPU offloading (16GB VRAM required):
133
+ ```bash
134
+ python qwen_image.py --cpu_offload
135
+ ```
136
+
137
+
138
+ ### 🔍 How It Works
139
+
140
+ We apply **Huffman coding** to losslessly compress the exponent bits of BFloat16 model weights, which are highly compressible (their 8 bits carry only ~2.6 bits of actual information). To enable fast inference, we implement a highly efficient CUDA kernel that performs on-the-fly weight decompression directly on the GPU.
141
+
142
+ The result is a model that is **~32% smaller**, delivers **bit-identical outputs**, and achieves performance **comparable to the original** BFloat16 model.
143
+
144
+ Learn more in our [research paper](https://arxiv.org/abs/2504.11651).
145
+
146
+ ### 📄 Learn More
147
+
148
+ * **Paper**: [70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float](https://arxiv.org/abs/2504.11651)
149
+ * **GitHub**: [https://github.com/LeanModels/DFloat11](https://github.com/LeanModels/DFloat11)
150
+ * **HuggingFace**: [https://huggingface.co/DFloat11](https://huggingface.co/DFloat11)
config.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "dfloat11_config": {
3
+ "bytes_per_thread": 8,
4
+ "pattern_dict": {
5
+ "transformer_blocks\\.\\d+": [
6
+ "img_mod.1",
7
+ "attn.to_q",
8
+ "attn.to_k",
9
+ "attn.to_v",
10
+ "attn.add_k_proj",
11
+ "attn.add_v_proj",
12
+ "attn.add_q_proj",
13
+ "attn.to_out.0",
14
+ "attn.to_add_out",
15
+ "img_mlp.net.0.proj",
16
+ "img_mlp.net.2",
17
+ "txt_mod.1",
18
+ "txt_mlp.net.0.proj",
19
+ "txt_mlp.net.2"
20
+ ]
21
+ },
22
+ "threads_per_block": [
23
+ 512
24
+ ],
25
+ "version": "0.3.1"
26
+ },
27
+ "model_type": "qwen2_5_vl"
28
+ }
diffusion_pytorch_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:37837fc3e638dbb9296584e8a417fa8d624fc637e2efb5902ee3cb1f903ddbcd
3
+ size 28423288808