kyleavery commited on
Commit
347a6fc
·
verified ·
1 Parent(s): 79ca94a

Upload sliced model checkpoint

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,536 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: gemma
3
+ library_name: transformers
4
+ pipeline_tag: image-text-to-text
5
+ extra_gated_button_content: Acknowledge license
6
+ base_model: google/gemma-3n-E4B-it
7
+ tags:
8
+ - automatic-speech-recognition
9
+ - automatic-speech-translation
10
+ - audio-text-to-text
11
+ - video-text-to-text
12
+ - matformer
13
+ ---
14
+
15
+ > [!Note]
16
+ > This is a submodel derived from `google/gemma-3n-E4B-it`. It has been modified by slicing specific layers and resizing FFN dimensions. It is not the original model.
17
+ > To learn more about MatFormers, please review the [launch blog](https://developers.googleblog.com/en/introducing-gemma-3n-developer-guide) and generate your own submodels
18
+ with the [MatFormer Lab](https://goo.gle/gemma3n-matformer-lab).
19
+ >
20
+
21
+ Skipped layers: []
22
+
23
+ FFN hidden dimensions: [2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4]
24
+
25
+
26
+ > [!Note]
27
+ > This repository corresponds to the launch version of Gemma 3n E4B IT (Instruct), to be used with Hugging Face `transformers`,
28
+ > supporting text, audio, and vision (image and video) inputs.
29
+ >
30
+ > Gemma 3n models have multiple architecture innovations:
31
+ > * They are available in two sizes based on [effective parameters](https://ai.google.dev/gemma/docs/gemma-3n#parameters). While the raw parameter count of this model is 8B, the architecture design allows the model to be run with a memory footprint comparable to a traditional 4B model by offloading low-utilization matrices from the accelerator.
32
+ > * They use a MatFormer architecture that allows nesting sub-models within the E4B model. We provide one sub-model (an [E2B](https://huggingface.co/google/gemma-3n-E2B-it)), or you can access a spectrum of custom-sized models using the [Mix-and-Match method](https://goo.gle/gemma3n-matformer-lab).
33
+ >
34
+ > Learn more about these techniques in the [technical blog post](https://developers.googleblog.com/en/introducing-gemma-3n-developer-guide)
35
+ > and the [Gemma documentation](https://ai.google.dev/gemma/docs/gemma-3n).
36
+
37
+ # Gemma 3n model card
38
+
39
+ **Model Page**: [Gemma 3n](https://ai.google.dev/gemma/docs/gemma-3n)
40
+
41
+ **Resources and Technical Documentation**:
42
+
43
+ - [Responsible Generative AI Toolkit](https://ai.google.dev/responsible)
44
+ - [Gemma on Kaggle](https://www.kaggle.com/models/google/gemma-3n)
45
+ - [Gemma on HuggingFace](https://huggingface.co/collections/google/gemma-3n-685065323f5984ef315c93f4)
46
+ - [Gemma on Vertex Model Garden](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemma3n)
47
+
48
+ **Terms of Use**: [Terms](https://ai.google.dev/gemma/terms)\
49
+ **Authors**: Google DeepMind
50
+
51
+ ## Model Information
52
+
53
+ Summary description and brief definition of inputs and outputs.
54
+
55
+ ### Description
56
+
57
+ Gemma is a family of lightweight, state-of-the-art open models from Google,
58
+ built from the same research and technology used to create the Gemini models.
59
+ Gemma 3n models are designed for efficient execution on low-resource devices.
60
+ They are capable of multimodal input, handling text, image, video, and audio
61
+ input, and generating text outputs, with open weights for pre-trained and
62
+ instruction-tuned variants. These models were trained with data in over 140
63
+ spoken languages.
64
+
65
+ Gemma 3n models use selective parameter activation technology to reduce resource
66
+ requirements. This technique allows the models to operate at an effective size
67
+ of 2B and 4B parameters, which is lower than the total number of parameters they
68
+ contain. For more information on Gemma 3n's efficient parameter management
69
+ technology, see the
70
+ [Gemma 3n](https://ai.google.dev/gemma/docs/gemma-3n#parameters)
71
+ page.
72
+
73
+ ### Inputs and outputs
74
+
75
+ - **Input:**
76
+ - Text string, such as a question, a prompt, or a document to be
77
+ summarized
78
+ - Images, normalized to 256x256, 512x512, or 768x768 resolution
79
+ and encoded to 256 tokens each
80
+ - Audio data encoded to 6.25 tokens per second from a single channel
81
+ - Total input context of 32K tokens
82
+ - **Output:**
83
+ - Generated text in response to the input, such as an answer to a
84
+ question, analysis of image content, or a summary of a document
85
+ - Total output length up to 32K tokens, subtracting the request
86
+ input tokens
87
+
88
+ ### Usage
89
+
90
+ Below, there are some code snippets on how to get quickly started with running
91
+ the model. First, install the Transformers library. Gemma 3n is supported
92
+ starting from transformers 4.53.0.
93
+
94
+ ```sh
95
+ $ pip install -U transformers
96
+ ```
97
+
98
+ Then, copy the snippet from the section that is relevant for your use case.
99
+
100
+ #### Running with the `pipeline` API
101
+
102
+ You can initialize the model and processor for inference with `pipeline` as
103
+ follows.
104
+
105
+ ```python
106
+ from transformers import pipeline
107
+ import torch
108
+
109
+ pipe = pipeline(
110
+ "image-text-to-text",
111
+ model="google/gemma-3n-e4b-it",
112
+ device="cuda",
113
+ torch_dtype=torch.bfloat16,
114
+ )
115
+ ```
116
+
117
+ With instruction-tuned models, you need to use chat templates to process our
118
+ inputs first. Then, you can pass it to the pipeline.
119
+
120
+ ```python
121
+ messages = [
122
+ {
123
+ "role": "system",
124
+ "content": [{"type": "text", "text": "You are a helpful assistant."}]
125
+ },
126
+ {
127
+ "role": "user",
128
+ "content": [
129
+ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
130
+ {"type": "text", "text": "What animal is on the candy?"}
131
+ ]
132
+ }
133
+ ]
134
+
135
+ output = pipe(text=messages, max_new_tokens=200)
136
+ print(output[0]["generated_text"][-1]["content"])
137
+ # Okay, let's take a look!
138
+ # Based on the image, the animal on the candy is a **turtle**.
139
+ # You can see the shell shape and the head and legs.
140
+ ```
141
+
142
+ #### Running the model on a single GPU
143
+
144
+ ```python
145
+ from transformers import AutoProcessor, Gemma3nForConditionalGeneration
146
+ from PIL import Image
147
+ import requests
148
+ import torch
149
+
150
+ model_id = "google/gemma-3n-e4b-it"
151
+
152
+ model = Gemma3nForConditionalGeneration.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16,).eval()
153
+
154
+ processor = AutoProcessor.from_pretrained(model_id)
155
+
156
+ messages = [
157
+ {
158
+ "role": "system",
159
+ "content": [{"type": "text", "text": "You are a helpful assistant."}]
160
+ },
161
+ {
162
+ "role": "user",
163
+ "content": [
164
+ {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
165
+ {"type": "text", "text": "Describe this image in detail."}
166
+ ]
167
+ }
168
+ ]
169
+
170
+ inputs = processor.apply_chat_template(
171
+ messages,
172
+ add_generation_prompt=True,
173
+ tokenize=True,
174
+ return_dict=True,
175
+ return_tensors="pt",
176
+ ).to(model.device)
177
+
178
+ input_len = inputs["input_ids"].shape[-1]
179
+
180
+ with torch.inference_mode():
181
+ generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
182
+ generation = generation[0][input_len:]
183
+
184
+ decoded = processor.decode(generation, skip_special_tokens=True)
185
+ print(decoded)
186
+
187
+ # **Overall Impression:** The image is a close-up shot of a vibrant garden scene,
188
+ # focusing on a cluster of pink cosmos flowers and a busy bumblebee.
189
+ # It has a slightly soft, natural feel, likely captured in daylight.
190
+ ```
191
+
192
+ ### Citation
193
+
194
+ ```
195
+ @article{gemma_3n_2025,
196
+ title={Gemma 3n},
197
+ url={https://ai.google.dev/gemma/docs/gemma-3n},
198
+ publisher={Google DeepMind},
199
+ author={Gemma Team},
200
+ year={2025}
201
+ }
202
+ ```
203
+
204
+ ## Model Data
205
+
206
+ Data used for model training and how the data was processed.
207
+
208
+ ### Training Dataset
209
+
210
+ These models were trained on a dataset that includes a wide variety of sources
211
+ totalling approximately 11 trillion tokens. The knowledge cutoff date for the
212
+ training data was June 2024. Here are the key components:
213
+
214
+ - **Web Documents**: A diverse collection of web text ensures the model
215
+ is exposed to a broad range of linguistic styles, topics, and vocabulary.
216
+ The training dataset includes content in over 140 languages.
217
+ - **Code**: Exposing the model to code helps it to learn the syntax and
218
+ patterns of programming languages, which improves its ability to generate
219
+ code and understand code-related questions.
220
+ - **Mathematics**: Training on mathematical text helps the model learn
221
+ logical reasoning, symbolic representation, and to address mathematical queries.
222
+ - **Images**: A wide range of images enables the model to perform image
223
+ analysis and visual data extraction tasks.
224
+ - Audio: A diverse set of sound samples enables the model to recognize
225
+ speech, transcribe text from recordings, and identify information in audio data.
226
+
227
+ The combination of these diverse data sources is crucial for training a
228
+ powerful multimodal model that can handle a wide variety of different tasks and
229
+ data formats.
230
+
231
+ ### Data Preprocessing
232
+
233
+ Here are the key data cleaning and filtering methods applied to the training
234
+ data:
235
+
236
+ - **CSAM Filtering**: Rigorous CSAM (Child Sexual Abuse Material)
237
+ filtering was applied at multiple stages in the data preparation process to
238
+ ensure the exclusion of harmful and illegal content.
239
+ - **Sensitive Data Filtering**: As part of making Gemma pre-trained models
240
+ safe and reliable, automated techniques were used to filter out certain
241
+ personal information and other sensitive data from training sets.
242
+ - **Additional methods**: Filtering based on content quality and safety in
243
+ line with
244
+ [our policies](https://ai.google/static/documents/ai-responsibility-update-published-february-2025.pdf).
245
+
246
+ ## Implementation Information
247
+
248
+ Details about the model internals.
249
+
250
+ ### Hardware
251
+
252
+ Gemma was trained using [Tensor Processing Unit
253
+ (TPU)](https://cloud.google.com/tpu/docs/intro-to-tpu) hardware (TPUv4p, TPUv5p
254
+ and TPUv5e). Training generative models requires significant computational
255
+ power. TPUs, designed specifically for matrix operations common in machine
256
+ learning, offer several advantages in this domain:
257
+
258
+ - **Performance**: TPUs are specifically designed to handle the massive
259
+ computations involved in training generative models. They can speed up
260
+ training considerably compared to CPUs.
261
+ - **Memory**: TPUs often come with large amounts of high-bandwidth memory,
262
+ allowing for the handling of large models and batch sizes during training.
263
+ This can lead to better model quality.
264
+ - **Scalability**: TPU Pods (large clusters of TPUs) provide a scalable
265
+ solution for handling the growing complexity of large foundation models.
266
+ You can distribute training across multiple TPU devices for faster and more
267
+ efficient processing.
268
+ - **Cost-effectiveness**: In many scenarios, TPUs can provide a more
269
+ cost-effective solution for training large models compared to CPU-based
270
+ infrastructure, especially when considering the time and resources saved
271
+ due to faster training.
272
+
273
+ These advantages are aligned with
274
+ [Google's commitments to operate sustainably](https://sustainability.google/operating-sustainably/).
275
+
276
+ ### Software
277
+
278
+ Training was done using [JAX](https://github.com/jax-ml/jax) and
279
+ [ML Pathways](https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/).
280
+ JAX allows researchers to take advantage of the latest generation of hardware,
281
+ including TPUs, for faster and more efficient training of large models. ML
282
+ Pathways is Google's latest effort to build artificially intelligent systems
283
+ capable of generalizing across multiple tasks. This is specially suitable for
284
+ foundation models, including large language models like these ones.
285
+
286
+ Together, JAX and ML Pathways are used as described in the
287
+ [paper about the Gemini family of models](https://goo.gle/gemma2report):
288
+ *"the 'single controller' programming model of Jax and Pathways allows a single
289
+ Python process to orchestrate the entire training run, dramatically simplifying
290
+ the development workflow."*
291
+
292
+ ## Evaluation
293
+
294
+ Model evaluation metrics and results.
295
+
296
+ ### Benchmark Results
297
+
298
+ These models were evaluated at full precision (float32) against a large
299
+ collection of different datasets and metrics to cover different aspects of
300
+ content generation. Evaluation results marked with **IT** are for
301
+ instruction-tuned models. Evaluation results marked with **PT** are for
302
+ pre-trained models.
303
+
304
+ #### Reasoning and factuality
305
+
306
+ | Benchmark | Metric | n-shot | E2B PT | E4B PT |
307
+ | ------------------------------ |----------------|----------|:--------:|:--------:|
308
+ | [HellaSwag][hellaswag] | Accuracy | 10-shot | 72.2 | 78.6 |
309
+ | [BoolQ][boolq] | Accuracy | 0-shot | 76.4 | 81.6 |
310
+ | [PIQA][piqa] | Accuracy | 0-shot | 78.9 | 81.0 |
311
+ | [SocialIQA][socialiqa] | Accuracy | 0-shot | 48.8 | 50.0 |
312
+ | [TriviaQA][triviaqa] | Accuracy | 5-shot | 60.8 | 70.2 |
313
+ | [Natural Questions][naturalq] | Accuracy | 5-shot | 15.5 | 20.9 |
314
+ | [ARC-c][arc] | Accuracy | 25-shot | 51.7 | 61.6 |
315
+ | [ARC-e][arc] | Accuracy | 0-shot | 75.8 | 81.6 |
316
+ | [WinoGrande][winogrande] | Accuracy | 5-shot | 66.8 | 71.7 |
317
+ | [BIG-Bench Hard][bbh] | Accuracy | few-shot | 44.3 | 52.9 |
318
+ | [DROP][drop] | Token F1 score | 1-shot | 53.9 | 60.8 |
319
+
320
+ [hellaswag]: https://arxiv.org/abs/1905.07830
321
+ [boolq]: https://arxiv.org/abs/1905.10044
322
+ [piqa]: https://arxiv.org/abs/1911.11641
323
+ [socialiqa]: https://arxiv.org/abs/1904.09728
324
+ [triviaqa]: https://arxiv.org/abs/1705.03551
325
+ [naturalq]: https://github.com/google-research-datasets/natural-questions
326
+ [arc]: https://arxiv.org/abs/1911.01547
327
+ [winogrande]: https://arxiv.org/abs/1907.10641
328
+ [bbh]: https://paperswithcode.com/dataset/bbh
329
+ [drop]: https://arxiv.org/abs/1903.00161
330
+
331
+ #### Multilingual
332
+
333
+ | Benchmark | Metric | n-shot | E2B IT | E4B IT |
334
+ | ------------------------------------|-------------------------|----------|:--------:|:--------:|
335
+ | [MGSM][mgsm] | Accuracy | 0-shot | 53.1 | 60.7 |
336
+ | [WMT24++][wmt24pp] (ChrF) | Character-level F-score | 0-shot | 42.7 | 50.1 |
337
+ | [Include][include] | Accuracy | 0-shot | 38.6 | 57.2 |
338
+ | [MMLU][mmlu] (ProX) | Accuracy | 0-shot | 8.1 | 19.9 |
339
+ | [OpenAI MMLU][openai-mmlu] | Accuracy | 0-shot | 22.3 | 35.6 |
340
+ | [Global-MMLU][global-mmlu] | Accuracy | 0-shot | 55.1 | 60.3 |
341
+ | [ECLeKTic][eclektic] | ECLeKTic score | 0-shot | 2.5 | 1.9 |
342
+
343
+ [mgsm]: https://arxiv.org/abs/2210.03057
344
+ [wmt24pp]: https://arxiv.org/abs/2502.12404v1
345
+ [include]:https://arxiv.org/abs/2411.19799
346
+ [mmlu]: https://arxiv.org/abs/2009.03300
347
+ [openai-mmlu]: https://huggingface.co/datasets/openai/MMMLU
348
+ [global-mmlu]: https://huggingface.co/datasets/CohereLabs/Global-MMLU
349
+ [eclektic]: https://arxiv.org/abs/2502.21228
350
+
351
+ #### STEM and code
352
+
353
+ | Benchmark | Metric | n-shot | E2B IT | E4B IT |
354
+ | ------------------------------------|--------------------------|----------|:--------:|:--------:|
355
+ | [GPQA][gpqa] Diamond | RelaxedAccuracy/accuracy | 0-shot | 24.8 | 23.7 |
356
+ | [LiveCodeBench][lcb] v5 | pass@1 | 0-shot | 18.6 | 25.7 |
357
+ | Codegolf v2.2 | pass@1 | 0-shot | 11.0 | 16.8 |
358
+ | [AIME 2025][aime-2025] | Accuracy | 0-shot | 6.7 | 11.6 |
359
+
360
+ [gpqa]: https://arxiv.org/abs/2311.12022
361
+ [lcb]: https://arxiv.org/abs/2403.07974
362
+ [aime-2025]: https://www.vals.ai/benchmarks/aime-2025-05-09
363
+
364
+ #### Additional benchmarks
365
+
366
+ | Benchmark | Metric | n-shot | E2B IT | E4B IT |
367
+ | ------------------------------------ |------------|----------|:--------:|:--------:|
368
+ | [MMLU][mmlu] | Accuracy | 0-shot | 60.1 | 64.9 |
369
+ | [MBPP][mbpp] | pass@1 | 3-shot | 56.6 | 63.6 |
370
+ | [HumanEval][humaneval] | pass@1 | 0-shot | 66.5 | 75.0 |
371
+ | [LiveCodeBench][lcb] | pass@1 | 0-shot | 13.2 | 13.2 |
372
+ | HiddenMath | Accuracy | 0-shot | 27.7 | 37.7 |
373
+ | [Global-MMLU-Lite][global-mmlu-lite] | Accuracy | 0-shot | 59.0 | 64.5 |
374
+ | [MMLU][mmlu] (Pro) | Accuracy | 0-shot | 40.5 | 50.6 |
375
+
376
+ [gpqa]: https://arxiv.org/abs/2311.12022
377
+ [mbpp]: https://arxiv.org/abs/2108.07732
378
+ [humaneval]: https://arxiv.org/abs/2107.03374
379
+ [lcb]: https://arxiv.org/abs/2403.07974
380
+ [global-mmlu-lite]: https://huggingface.co/datasets/CohereForAI/Global-MMLU-Lite
381
+
382
+ ## Ethics and Safety
383
+
384
+ Ethics and safety evaluation approach and results.
385
+
386
+ ### Evaluation Approach
387
+
388
+ Our evaluation methods include structured evaluations and internal red-teaming
389
+ testing of relevant content policies. Red-teaming was conducted by a number of
390
+ different teams, each with different goals and human evaluation metrics. These
391
+ models were evaluated against a number of different categories relevant to
392
+ ethics and safety, including:
393
+
394
+ - **Child Safety**: Evaluation of text-to-text and image to text prompts
395
+ covering child safety policies, including child sexual abuse and
396
+ exploitation.
397
+ - **Content Safety:** Evaluation of text-to-text and image to text prompts
398
+ covering safety policies including, harassment, violence and gore, and hate
399
+ speech.
400
+ - **Representational Harms**: Evaluation of text-to-text and image to text
401
+ prompts covering safety policies including bias, stereotyping, and harmful
402
+ associations or inaccuracies.
403
+
404
+ In addition to development level evaluations, we conduct "assurance
405
+ evaluations" which are our 'arms-length' internal evaluations for responsibility
406
+ governance decision making. They are conducted separately from the model
407
+ development team, to inform decision making about release. High level findings
408
+ are fed back to the model team, but prompt sets are held-out to prevent
409
+ overfitting and preserve the results' ability to inform decision making. Notable
410
+ assurance evaluation results are reported to our Responsibility & Safety Council
411
+ as part of release review.
412
+
413
+ ### Evaluation Results
414
+
415
+ For all areas of safety testing, we saw safe levels of performance across the
416
+ categories of child safety, content safety, and representational harms relative
417
+ to previous Gemma models. All testing was conducted without safety filters to
418
+ evaluate the model capabilities and behaviors. For text-to-text, image-to-text,
419
+ and audio-to-text, and across all model sizes, the model produced minimal policy
420
+ violations, and showed significant improvements over previous Gemma models'
421
+ performance with respect to high severity violations. A limitation of our
422
+ evaluations was they included primarily English language prompts.
423
+
424
+ ## Usage and Limitations
425
+
426
+ These models have certain limitations that users should be aware of.
427
+
428
+ ### Intended Usage
429
+
430
+ Open generative models have a wide range of applications across various
431
+ industries and domains. The following list of potential uses is not
432
+ comprehensive. The purpose of this list is to provide contextual information
433
+ about the possible use-cases that the model creators considered as part of model
434
+ training and development.
435
+
436
+ - Content Creation and Communication
437
+ - **Text Generation**: Generate creative text formats such as
438
+ poems, scripts, code, marketing copy, and email drafts.
439
+ - **Chatbots and Conversational AI**: Power conversational
440
+ interfaces for customer service, virtual assistants, or interactive
441
+ applications.
442
+ - **Text Summarization**: Generate concise summaries of a text
443
+ corpus, research papers, or reports.
444
+ - **Image Data Extraction**: Extract, interpret, and summarize
445
+ visual data for text communications.
446
+ - **Audio Data Extraction**: Transcribe spoken language, translate speech
447
+ to text in other languages, and analyze sound-based data.
448
+ - Research and Education
449
+ - **Natural Language Processing (NLP) and generative model
450
+ Research**: These models can serve as a foundation for researchers to
451
+ experiment with generative models and NLP techniques, develop
452
+ algorithms, and contribute to the advancement of the field.
453
+ - **Language Learning Tools**: Support interactive language
454
+ learning experiences, aiding in grammar correction or providing writing
455
+ practice.
456
+ - **Knowledge Exploration**: Assist researchers in exploring large
457
+ bodies of data by generating summaries or answering questions about
458
+ specific topics.
459
+
460
+ ### Limitations
461
+
462
+ - Training Data
463
+ - The quality and diversity of the training data significantly
464
+ influence the model's capabilities. Biases or gaps in the training data
465
+ can lead to limitations in the model's responses.
466
+ - The scope of the training dataset determines the subject areas
467
+ the model can handle effectively.
468
+ - Context and Task Complexity
469
+ - Models are better at tasks that can be framed with clear
470
+ prompts and instructions. Open-ended or highly complex tasks might be
471
+ challenging.
472
+ - A model's performance can be influenced by the amount of context
473
+ provided (longer context generally leads to better outputs, up to a
474
+ certain point).
475
+ - Language Ambiguity and Nuance
476
+ - Natural language is inherently complex. Models might struggle
477
+ to grasp subtle nuances, sarcasm, or figurative language.
478
+ - Factual Accuracy
479
+ - Models generate responses based on information they learned
480
+ from their training datasets, but they are not knowledge bases. They
481
+ may generate incorrect or outdated factual statements.
482
+ - Common Sense
483
+ - Models rely on statistical patterns in language. They might
484
+ lack the ability to apply common sense reasoning in certain situations.
485
+
486
+ ### Ethical Considerations and Risks
487
+
488
+ The development of generative models raises several ethical concerns. In
489
+ creating an open model, we have carefully considered the following:
490
+
491
+ - Bias and Fairness
492
+ - Generative models trained on large-scale, real-world text and image data
493
+ can reflect socio-cultural biases embedded in the training material.
494
+ These models underwent careful scrutiny, input data pre-processing
495
+ described and posterior evaluations reported in this card.
496
+ - Misinformation and Misuse
497
+ - Generative models can be misused to generate text that is
498
+ false, misleading, or harmful.
499
+ - Guidelines are provided for responsible use with the model, see the
500
+ [Responsible Generative AI Toolkit](https://ai.google.dev/responsible).
501
+ - Transparency and Accountability:
502
+ - This model card summarizes details on the models' architecture,
503
+ capabilities, limitations, and evaluation processes.
504
+ - A responsibly developed open model offers the opportunity to
505
+ share innovation by making generative model technology accessible to
506
+ developers and researchers across the AI ecosystem.
507
+
508
+ Risks identified and mitigations:
509
+
510
+ - **Perpetuation of biases**: It's encouraged to perform continuous monitoring
511
+ (using evaluation metrics, human review) and the exploration of de-biasing
512
+ techniques during model training, fine-tuning, and other use cases.
513
+ - **Generation of harmful content**: Mechanisms and guidelines for content
514
+ safety are essential. Developers are encouraged to exercise caution and
515
+ implement appropriate content safety safeguards based on their specific
516
+ product policies and application use cases.
517
+ - **Misuse for malicious purposes**: Technical limitations and developer
518
+ and end-user education can help mitigate against malicious applications of
519
+ generative models. Educational resources and reporting mechanisms for users
520
+ to flag misuse are provided. Prohibited uses of Gemma models are outlined
521
+ in the
522
+ [Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy).
523
+ - **Privacy violations**: Models were trained on data filtered for removal of
524
+ certain personal information and other sensitive data. Developers are
525
+ encouraged to adhere to privacy regulations with privacy-preserving
526
+ techniques.
527
+
528
+ ### Benefits
529
+
530
+ At the time of release, this family of models provides high-performance open
531
+ generative model implementations designed from the ground up for responsible AI
532
+ development compared to similarly sized models.
533
+
534
+ Using the benchmark evaluation metrics described in this document, these models
535
+ have shown to provide superior performance to other, comparably-sized open model
536
+ alternatives.
chat_template.jinja ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {{ bos_token }}
2
+ {%- if messages[0]['role'] == 'system' -%}
3
+ {%- if messages[0]['content'] is string -%}
4
+ {%- set first_user_prefix = messages[0]['content'] + '
5
+
6
+ ' -%}
7
+ {%- else -%}
8
+ {%- set first_user_prefix = messages[0]['content'][0]['text'] + '
9
+
10
+ ' -%}
11
+ {%- endif -%}
12
+ {%- set loop_messages = messages[1:] -%}
13
+ {%- else -%}
14
+ {%- set first_user_prefix = "" -%}
15
+ {%- set loop_messages = messages -%}
16
+ {%- endif -%}
17
+ {%- for message in loop_messages -%}
18
+ {%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) -%}
19
+ {{ raise_exception("Conversation roles must alternate user/assistant/user/assistant/...") }}
20
+ {%- endif -%}
21
+ {%- if (message['role'] == 'assistant') -%}
22
+ {%- set role = "model" -%}
23
+ {%- else -%}
24
+ {%- set role = message['role'] -%}
25
+ {%- endif -%}
26
+ {{ '<start_of_turn>' + role + '
27
+ ' + (first_user_prefix if loop.first else "") }}
28
+ {%- if message['content'] is string -%}
29
+ {{ message['content'] | trim }}
30
+ {%- elif message['content'] is iterable -%}
31
+ {%- for item in message['content'] -%}
32
+ {%- if item['type'] == 'audio' -%}
33
+ {{ '<audio_soft_token>' }}
34
+ {%- elif item['type'] == 'image' -%}
35
+ {{ '<image_soft_token>' }}
36
+ {%- elif item['type'] == 'text' -%}
37
+ {{ item['text'] | trim }}
38
+ {%- endif -%}
39
+ {%- endfor -%}
40
+ {%- else -%}
41
+ {{ raise_exception("Invalid content type") }}
42
+ {%- endif -%}
43
+ {{ '<end_of_turn>
44
+ ' }}
45
+ {%- endfor -%}
46
+ {%- if add_generation_prompt -%}
47
+ {{'<start_of_turn>model
48
+ '}}
49
+ {%- endif -%}
config.json ADDED
@@ -0,0 +1,227 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Gemma3nForConditionalGeneration"
4
+ ],
5
+ "audio_config": {
6
+ "conf_attention_chunk_size": 12,
7
+ "conf_attention_context_left": 13,
8
+ "conf_attention_context_right": 0,
9
+ "conf_attention_logit_cap": 50.0,
10
+ "conf_conv_kernel_size": 5,
11
+ "conf_num_attention_heads": 8,
12
+ "conf_num_hidden_layers": 12,
13
+ "conf_positional_bias_size": 256,
14
+ "conf_reduction_factor": 4,
15
+ "conf_residual_weight": 0.5,
16
+ "gradient_clipping": 10000000000.0,
17
+ "hidden_size": 1536,
18
+ "input_feat_size": 128,
19
+ "model_type": "gemma3n_audio",
20
+ "rms_norm_eps": 1e-06,
21
+ "sscp_conv_channel_size": [
22
+ 128,
23
+ 32
24
+ ],
25
+ "sscp_conv_eps": 0.001,
26
+ "sscp_conv_group_norm_eps": 0.001,
27
+ "sscp_conv_kernel_size": [
28
+ [
29
+ 3,
30
+ 3
31
+ ],
32
+ [
33
+ 3,
34
+ 3
35
+ ]
36
+ ],
37
+ "sscp_conv_stride_size": [
38
+ [
39
+ 2,
40
+ 2
41
+ ],
42
+ [
43
+ 2,
44
+ 2
45
+ ]
46
+ ],
47
+ "torch_dtype": "bfloat16",
48
+ "vocab_offset": 262272,
49
+ "vocab_size": 128
50
+ },
51
+ "audio_soft_tokens_per_image": 188,
52
+ "audio_token_id": 262273,
53
+ "boa_token_id": 256000,
54
+ "boi_token_id": 255999,
55
+ "eoa_token_id": 262272,
56
+ "eoi_token_id": 262144,
57
+ "eos_token_id": [
58
+ 1,
59
+ 106
60
+ ],
61
+ "image_token_id": 262145,
62
+ "initializer_range": 0.02,
63
+ "model_type": "gemma3n",
64
+ "text_config": {
65
+ "activation_sparsity_pattern": [
66
+ 0.95,
67
+ 0.95,
68
+ 0.95,
69
+ 0.95,
70
+ 0.95,
71
+ 0.95,
72
+ 0.95,
73
+ 0.95,
74
+ 0.95,
75
+ 0.95,
76
+ 0,
77
+ 0,
78
+ 0,
79
+ 0,
80
+ 0,
81
+ 0,
82
+ 0,
83
+ 0,
84
+ 0,
85
+ 0,
86
+ 0,
87
+ 0,
88
+ 0,
89
+ 0,
90
+ 0,
91
+ 0,
92
+ 0,
93
+ 0,
94
+ 0,
95
+ 0,
96
+ 0,
97
+ 0,
98
+ 0,
99
+ 0,
100
+ 0
101
+ ],
102
+ "altup_active_idx": 0,
103
+ "altup_coef_clip": 120.0,
104
+ "altup_correct_scale": true,
105
+ "altup_lr_multiplier": 1.0,
106
+ "altup_num_inputs": 4,
107
+ "attention_bias": false,
108
+ "attention_dropout": 0.0,
109
+ "final_logit_softcapping": 30.0,
110
+ "head_dim": 256,
111
+ "hidden_activation": "gelu_pytorch_tanh",
112
+ "hidden_size": 2048,
113
+ "hidden_size_per_layer_input": 256,
114
+ "initializer_range": 0.02,
115
+ "intermediate_size": [
116
+ 8192,
117
+ 8192,
118
+ 8192,
119
+ 8192,
120
+ 8192,
121
+ 8192,
122
+ 8192,
123
+ 8192,
124
+ 8192,
125
+ 8192,
126
+ 8192,
127
+ 8192,
128
+ 8192,
129
+ 8192,
130
+ 8192,
131
+ 8192,
132
+ 8192,
133
+ 8192,
134
+ 8192,
135
+ 8192,
136
+ 16384,
137
+ 16384,
138
+ 16384,
139
+ 16384,
140
+ 16384,
141
+ 8192,
142
+ 8192,
143
+ 8192,
144
+ 8192,
145
+ 8192,
146
+ 8192,
147
+ 8192,
148
+ 8192,
149
+ 8192,
150
+ 8192
151
+ ],
152
+ "laurel_rank": 64,
153
+ "layer_types": [
154
+ "sliding_attention",
155
+ "sliding_attention",
156
+ "sliding_attention",
157
+ "sliding_attention",
158
+ "full_attention",
159
+ "sliding_attention",
160
+ "sliding_attention",
161
+ "sliding_attention",
162
+ "sliding_attention",
163
+ "full_attention",
164
+ "sliding_attention",
165
+ "sliding_attention",
166
+ "sliding_attention",
167
+ "sliding_attention",
168
+ "full_attention",
169
+ "sliding_attention",
170
+ "sliding_attention",
171
+ "sliding_attention",
172
+ "sliding_attention",
173
+ "full_attention",
174
+ "sliding_attention",
175
+ "sliding_attention",
176
+ "sliding_attention",
177
+ "sliding_attention",
178
+ "full_attention",
179
+ "sliding_attention",
180
+ "sliding_attention",
181
+ "sliding_attention",
182
+ "sliding_attention",
183
+ "full_attention",
184
+ "sliding_attention",
185
+ "sliding_attention",
186
+ "sliding_attention",
187
+ "sliding_attention",
188
+ "full_attention"
189
+ ],
190
+ "max_position_embeddings": 32768,
191
+ "model_type": "gemma3n_text",
192
+ "num_attention_heads": 8,
193
+ "num_hidden_layers": 35,
194
+ "num_key_value_heads": 2,
195
+ "num_kv_shared_layers": 15,
196
+ "query_pre_attn_scalar": 256,
197
+ "rms_norm_eps": 1e-06,
198
+ "rope_local_base_freq": 10000.0,
199
+ "rope_scaling": null,
200
+ "rope_theta": 1000000.0,
201
+ "sliding_window": 512,
202
+ "torch_dtype": "bfloat16",
203
+ "use_cache": true,
204
+ "vocab_size": 262400,
205
+ "vocab_size_per_layer_input": 262144
206
+ },
207
+ "torch_dtype": "bfloat16",
208
+ "transformers_version": "4.53.0",
209
+ "vision_config": {
210
+ "architecture": "mobilenetv5_300m_enc",
211
+ "do_pooling": true,
212
+ "hidden_size": 2048,
213
+ "initializer_range": 0.02,
214
+ "label_names": [
215
+ "LABEL_0",
216
+ "LABEL_1"
217
+ ],
218
+ "model_args": null,
219
+ "model_type": "gemma3n_vision",
220
+ "num_classes": 2,
221
+ "rms_norm_eps": 1e-06,
222
+ "torch_dtype": "bfloat16",
223
+ "vocab_offset": 262144,
224
+ "vocab_size": 128
225
+ },
226
+ "vision_soft_tokens_per_image": 256
227
+ }
model-00001-of-00003.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bbe6c970a97829c4047779c407cee6fd91040a6decdd2f3afffe701d668d4e39
3
+ size 7789419312
model-00002-of-00003.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d8f32096f92c615671ff0b24b3f9d4651ec5d11e02704ec4375f5fcc663237db
3
+ size 4026499632
model-00003-of-00003.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a261b9d7a544c3e9b76b87e36010a3151b2639c73bf7b3f42536da71ff314822
3
+ size 864363152
model.safetensors.index.json ADDED
The diff for this file is too large to render. See raw diff
 
special_tokens_map.json ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "audio_token": "<audio_soft_token>",
3
+ "boa_token": "<start_of_audio>",
4
+ "boi_token": "<start_of_image>",
5
+ "bos_token": {
6
+ "content": "<bos>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false
11
+ },
12
+ "eoa_token": "<end_of_audio>",
13
+ "eoi_token": "<end_of_image>",
14
+ "eos_token": {
15
+ "content": "<eos>",
16
+ "lstrip": false,
17
+ "normalized": false,
18
+ "rstrip": false,
19
+ "single_word": false
20
+ },
21
+ "image_token": "<image_soft_token>",
22
+ "pad_token": {
23
+ "content": "<pad>",
24
+ "lstrip": false,
25
+ "normalized": false,
26
+ "rstrip": false,
27
+ "single_word": false
28
+ },
29
+ "unk_token": {
30
+ "content": "<unk>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false
35
+ }
36
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b6c35ee648c07754b44cd9e371c75d4caa05c4504910b7ad29b1847ee9d8ba5d
3
+ size 33442553
tokenizer_config.json ADDED
The diff for this file is too large to render. See raw diff