nm-research commited on
Commit
772cc43
·
verified ·
1 Parent(s): 93a29e5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +285 -3
README.md CHANGED
@@ -1,3 +1,285 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - vllm
4
+ - vision
5
+ - audio
6
+ - int8
7
+ license: mit
8
+ base_model: google/gemma-3n-E4B-it
9
+ library_name: transformers
10
+ ---
11
+
12
+ # RedHatAI/gemma-3n-E4B-it-quantized.w4a16
13
+
14
+ ## Model Overview
15
+ - **Model Architecture:** gemma-3n-E4B-it
16
+ - **Input:** Audio-Vision-Text
17
+ - **Output:** Text
18
+ - **Model Optimizations:**
19
+ - **Weight quantization:** INT4
20
+ - **Activation quantization:** INT16
21
+ - **Release Date:** 08/01/2025
22
+ - **Version:** 1.0
23
+ - **Model Developers:** RedHatAI
24
+
25
+ Quantized version of [google/gemma-3n-E4B-it](https://huggingface.co/google/gemma-3n-E4B-it).
26
+
27
+ ### Model Optimizations
28
+
29
+ This model was obtained by quantizing the weights of [google/gemma-3n-E4B-it](https://huggingface.co/google/gemma-3n-E4B-it) to INT4 data type, ready for inference with vLLM >= 0.10.0
30
+
31
+ ## Deployment
32
+
33
+ ### Use with vLLM
34
+
35
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
36
+
37
+ ```python
38
+ from vllm.assets.image import ImageAsset
39
+ from vllm import LLM, SamplingParams
40
+
41
+ # prepare model
42
+ llm = LLM(
43
+ model="RedHatAI/gemma-3n-E4B-it-quantized.w4a16",
44
+ trust_remote_code=True,
45
+ max_model_len=4096,
46
+ max_num_seqs=2,
47
+ )
48
+
49
+ # prepare inputs
50
+ question = "What is the content of this image?"
51
+ inputs = {
52
+ "prompt": f"<|user|>\n<|image_1|>\n{question}<|end|>\n<|assistant|>\n",
53
+ "multi_modal_data": {
54
+ "image": ImageAsset("cherry_blossom").pil_image.convert("RGB")
55
+ },
56
+ }
57
+
58
+ # generate response
59
+ print("========== SAMPLE GENERATION ==============")
60
+ outputs = llm.generate(inputs, SamplingParams(temperature=0.2, max_tokens=64))
61
+ print(f"PROMPT : {outputs[0].prompt}")
62
+ print(f"RESPONSE: {outputs[0].outputs[0].text}")
63
+ print("==========================================")
64
+ ```
65
+
66
+ vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
67
+
68
+ ## Creation
69
+
70
+ This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
71
+
72
+ <details>
73
+ <summary>Model Creation Code</summary>
74
+
75
+ ```python
76
+ import requests
77
+ import torch
78
+ from PIL import Image
79
+ from transformers import AutoProcessor, Gemma3nForConditionalGeneration
80
+
81
+ from llmcompressor import oneshot
82
+ from llmcompressor.modifiers.quantization import GPTQModifier
83
+ from llmcompressor.utils import dispatch_for_generation
84
+
85
+ # Load model.
86
+ model_id = "google/gemma-3n-E4B-it"
87
+ model = Gemma3nForConditionalGeneration.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
88
+ processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
89
+
90
+ # Oneshot arguments
91
+ DATASET_ID = "flickr30k"
92
+ DATASET_SPLIT = {"calibration": "test[:512]"}
93
+ NUM_CALIBRATION_SAMPLES = 512
94
+ MAX_SEQUENCE_LENGTH = 2048
95
+
96
+
97
+ # Define a oneshot data collator for multimodal inputs.
98
+ def data_collator(batch):
99
+ assert len(batch) == 1
100
+ return {key: torch.tensor(value) for key, value in batch[0].items()}
101
+
102
+ dampening_frac=0.01
103
+
104
+ # Recipe
105
+ recipe = [
106
+ GPTQModifier(
107
+ targets="Linear",
108
+ scheme="W4A16",
109
+ ignore=[
110
+ "re:.*embed_audio.*",
111
+ "re:.*embed_vision.*",
112
+ "re:.*audio_tower.*",
113
+ "re:.*vision_tower.*",
114
+ "re:.*altup.*",
115
+ "re:.*lm_head.*",
116
+ "re:.*laurel.*",
117
+ "re:model\.language_model\.layers\.\d+\.per_layer_input_gate",
118
+ "re:model\.language_model\.layers\.\d+\.per_layer_projection",
119
+ "model.language_model.per_layer_model_projection",
120
+ ],
121
+ dampening_frac=dampening_frac
122
+ ),
123
+ ]
124
+
125
+ SAVE_DIR = f"{model_id.split('/')[1]}-quantized.{recipe[0].scheme}"
126
+
127
+ # Perform oneshot
128
+ oneshot(
129
+ model=model,
130
+ tokenizer=model_id,
131
+ dataset=DATASET_ID,
132
+ splits=DATASET_SPLIT,
133
+ recipe=recipe,
134
+ max_seq_length=MAX_SEQUENCE_LENGTH,
135
+ num_calibration_samples=NUM_CALIBRATION_SAMPLES,
136
+ trust_remote_code_model=True,
137
+ data_collator=data_collator,
138
+ # gemma3n has broken weight offloading which is required by the sequential pipeline
139
+ pipeline="basic",
140
+ # gemma3n does not support untying word embeddings
141
+ tie_word_embeddings=True,
142
+ output_dir=SAVE_DIR,
143
+ )
144
+
145
+ # Save to disk compressed.
146
+ model.save_pretrained(SAVE_DIR, save_compressed=True)
147
+ processor.save_pretrained(SAVE_DIR)
148
+ ```
149
+ </details>
150
+
151
+ ## Evaluation
152
+
153
+ The model was evaluated using [lm_evaluation_harness](https://github.com/EleutherAI/lm-evaluation-harness) for OpenLLM V1 and V2 text-based benchmarks. The evaluations were conducted using the following commands:
154
+
155
+ <details>
156
+ <summary>Evaluation Commands</summary>
157
+
158
+ ### OpenLLM V1
159
+
160
+ ```
161
+ lm_eval \
162
+ --model vllm \
163
+ --model_args pretrained="<model_name>",dtype=auto,add_bos_token=false,max_model_len=4096,gpu_memory_utilization=0.8,enable_chunked_prefill=True,enforce_eager=True,trust_remote_code=True \
164
+ --tasks openllm \
165
+ --batch_size auto \
166
+ --apply_chat_template \
167
+ --fewshot_as_multiturn
168
+
169
+ ```
170
+
171
+ ### Leaderboard V2
172
+
173
+ ```
174
+ lm_eval \
175
+ --model vllm \
176
+ --model_args pretrained="<model_name>",dtype=auto,add_bos_token=false,max_model_len=15000,gpu_memory_utilization=0.5,enable_chunked_prefill=True,enforce_eager=True,trust_remote_code=True \
177
+ --tasks leaderboard \
178
+ --batch_size auto \
179
+ --apply_chat_template \
180
+ --fewshot_as_multiturn
181
+
182
+ ```
183
+ </details>
184
+
185
+ ### Accuracy
186
+
187
+ <table>
188
+ <thead>
189
+ <tr>
190
+ <th>Category</th>
191
+ <th>Metric</th>
192
+ <th>google/gemma-3n-E4B-it</th>
193
+ <th>RedHatAI/gemma-3n-E4B-it-quantized.w4a16</th>
194
+ <th>Recovery (%)</th>
195
+ </tr>
196
+ </thead>
197
+ <tbody>
198
+ <tr>
199
+ <td rowspan="7"><b>OpenLLM V1</b></td>
200
+ <td>arc_challenge</td>
201
+ <td>60.24</td>
202
+ <td>59.30</td>
203
+ <td>98.44%</td>
204
+ </tr>
205
+ <tr>
206
+ <td>gsm8k</td>
207
+ <td>60.12</td>
208
+ <td>65.13</td>
209
+ <td>108.34%</td>
210
+ </tr>
211
+ <tr>
212
+ <td>hellaswag</td>
213
+ <td>74.94</td>
214
+ <td>73.31</td>
215
+ <td>97.82%</td>
216
+ </tr>
217
+ <tr>
218
+ <td>mmlu</td>
219
+ <td>64.14</td>
220
+ <td>63.08</td>
221
+ <td>98.35%</td>
222
+ </tr>
223
+ <tr>
224
+ <td>truthfulqa_mc2</td>
225
+ <td>54.87</td>
226
+ <td>54.31</td>
227
+ <td>99.00%</td>
228
+ </tr>
229
+ <tr>
230
+ <td>winogrande</td>
231
+ <td>68.35</td>
232
+ <td>66.77</td>
233
+ <td>97.68%</td>
234
+ </tr>
235
+ <tr>
236
+ <td><b>Average</b></td>
237
+ <td>63.78</td>
238
+ <td>63.65</td>
239
+ <td><b>99.80%</b></td>
240
+ </tr>
241
+ <tr>
242
+ <td rowspan="7"><b>Leaderboard</b></td>
243
+ <td>bbh</td>
244
+ <td>55.46</td>
245
+ <td>54.89</td>
246
+ <td>98.97%</td>
247
+ </tr>
248
+ <tr>
249
+ <td>mmlu_pro</td>
250
+ <td>34.38</td>
251
+ <td>32.05</td>
252
+ <td>93.23%</td>
253
+ </tr>
254
+ <tr>
255
+ <td>musr</td>
256
+ <td>33.20</td>
257
+ <td>34.66</td>
258
+ <td>104.40%</td>
259
+ </tr>
260
+ <tr>
261
+ <td>ifeval</td>
262
+ <td>84.41</td>
263
+ <td>81.65</td>
264
+ <td>96.73%</td>
265
+ </tr>
266
+ <tr>
267
+ <td>gpqa</td>
268
+ <td>30.87</td>
269
+ <td>28.69</td>
270
+ <td>92.95%</td>
271
+ </tr>
272
+ <tr>
273
+ <td>math_hard</td>
274
+ <td>45.54</td>
275
+ <td>39.95</td>
276
+ <td>87.72%</td>
277
+ </tr>
278
+ <tr>
279
+ <td><b>Average</b></td>
280
+ <td>47.31</td>
281
+ <td>45.32</td>
282
+ <td><b>95.78%</b></td>
283
+ </tr>
284
+ </tbody>
285
+ </table>