danielhanchen commited on
Commit
97ebfb0
·
verified ·
1 Parent(s): f0b07d1

Add files using upload-large-folder tool

Browse files
README.md CHANGED
@@ -1,60 +1,22 @@
1
  ---
2
- base_model: Qwen/Qwen2.5-VL-72B-Instruct
 
 
3
  language:
4
  - en
5
- library_name: transformers
6
  pipeline_tag: image-text-to-text
7
- license: apache-2.0
8
  tags:
9
  - multimodal
10
- - qwen
11
- - qwen2
12
  - unsloth
13
- - transformers
14
- - vision
 
15
  ---
16
 
17
- <div>
18
- <p style="margin-bottom: 0;">
19
- <em>Unsloth's <a href="https://unsloth.ai/blog/dynamic-4bit">Dynamic 4-bit Quants</a> is selectively quantized, greatly improving accuracy over standard 4-bit.</em>
20
- </p>
21
- <div style="display: flex; gap: 5px; align-items: center; ">
22
- <a href="https://github.com/unslothai/unsloth/">
23
- <img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="133">
24
- </a>
25
- <a href="https://discord.gg/unsloth">
26
- <img src="https://github.com/unslothai/unsloth/raw/main/images/Discord%20button.png" width="173">
27
- </a>
28
- <a href="https://docs.unsloth.ai/">
29
- <img src="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/documentation%20green%20button.png" width="143">
30
- </a>
31
- </div>
32
- <h1 style="margin-top: 0rem;">Finetune LLMs 2-5x faster with 70% less memory via Unsloth</h2>
33
- </div>
34
- We have a free Google Colab Tesla T4 notebook for Qwen2-VL (7B) here: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2_VL_(7B)-Vision.ipynb
35
-
36
- ## ✨ Finetune for Free
37
-
38
- All notebooks are **beginner friendly**! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face.
39
-
40
- | Unsloth supports | Free Notebooks | Performance | Memory use |
41
- |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------|
42
- | **Llama-3.2 (3B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb) | 2.4x faster | 58% less |
43
- | **Llama-3.2 (11B vision)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb) | 2x faster | 60% less |
44
- | **Qwen2 VL (7B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2_VL_(7B)-Vision.ipynb) | 1.8x faster | 60% less |
45
- | **Qwen2.5 (7B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2.5_(7B)-Alpaca.ipynb) | 2x faster | 60% less |
46
- | **Llama-3.1 (8B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-Alpaca.ipynb) | 2.4x faster | 58% less |
47
- | **Phi-3.5 (mini)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Phi_3.5_Mini-Conversational.ipynb) | 2x faster | 50% less |
48
- | **Gemma 2 (9B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma2_(9B)-Alpaca.ipynb) | 2.4x faster | 58% less |
49
- | **Mistral (7B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_v0.3_(7B)-Conversational.ipynb) | 2.2x faster | 62% less |
50
-
51
- [<img src="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/documentation%20green%20button.png" width="200"/>](https://docs.unsloth.ai)
52
-
53
- - This [Llama 3.2 conversational notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb) is useful for ShareGPT ChatML / Vicuna templates.
54
- - This [text completion notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_(7B)-Text_Completion.ipynb) is for raw text. This [DPO notebook](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing) replicates Zephyr.
55
- - \* Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster.
56
-
57
- # Qwen2.5-VL
58
 
59
  ## Introduction
60
 
@@ -82,13 +44,12 @@ We extend dynamic resolution to the temporal dimension by adopting dynamic FPS s
82
  <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-VL/qwen2.5vl_arc.jpeg" width="80%"/>
83
  <p>
84
 
85
-
86
  * **Streamlined and Efficient Vision Encoder**
87
 
88
  We enhance both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM.
89
 
90
 
91
- We have three models with 3, 7 and 72 billion parameters. This repo contains the instruction-tuned 7B Qwen2.5-VL model. For more information, visit our [Blog](https://qwenlm.github.io/blog/qwen2.5-vl/) and [GitHub](https://github.com/QwenLM/Qwen2.5-VL).
92
 
93
 
94
 
@@ -96,50 +57,51 @@ We have three models with 3, 7 and 72 billion parameters. This repo contains the
96
 
97
  ### Image benchmark
98
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99
 
100
- | Benchmark | InternVL2.5-8B | MiniCPM-o 2.6 | GPT-4o-mini | Qwen2-VL-7B |**Qwen2.5-VL-7B** |
101
- | :--- | :---: | :---: | :---: | :---: | :---: |
102
- | MMMU<sub>val</sub> | 56 | 50.4 | **60**| 54.1 | 58.6|
103
- | MMMU-Pro<sub>val</sub> | 34.3 | - | 37.6| 30.5 | 41.0|
104
- | DocVQA<sub>test</sub> | 93 | 93 | - | 94.5 | **95.7** |
105
- | InfoVQA<sub>test</sub> | 77.6 | - | - |76.5 | **82.6** |
106
- | ChartQA<sub>test</sub> | 84.8 | - |- | 83.0 |**87.3** |
107
- | TextVQA<sub>val</sub> | 79.1 | 80.1 | -| 84.3 | **84.9**|
108
- | OCRBench | 822 | 852 | 785 | 845 | **864** |
109
- | CC_OCR | 57.7 | | | 61.6 | **77.8**|
110
- | MMStar | 62.8| | |60.7| **63.9**|
111
- | MMBench-V1.1-En<sub>test</sub> | 79.4 | 78.0 | 76.0| 80.7 | **82.6** |
112
- | MMT-Bench<sub>test</sub> | - | - | - |**63.7** |63.6 |
113
- | MMStar | **61.5** | 57.5 | 54.8 | 60.7 |63.9 |
114
- | MMVet<sub>GPT-4-Turbo</sub> | 54.2 | 60.0 | 66.9 | 62.0 | **67.1**|
115
- | HallBench<sub>avg</sub> | 45.2 | 48.1 | 46.1| 50.6 | **52.9**|
116
- | MathVista<sub>testmini</sub> | 58.3 | 60.6 | 52.4 | 58.2 | **68.2**|
117
- | MathVision | - | - | - | 16.3 | **25.07** |
118
-
119
- ### Video Benchmarks
120
-
121
- | Benchmark | Qwen2-VL-7B | **Qwen2.5-VL-7B** |
122
- | :--- | :---: | :---: |
123
- | MVBench | 67.0 | **69.6** |
124
- | PerceptionTest<sub>test</sub> | 66.9 | **70.5** |
125
- | Video-MME<sub>wo/w subs</sub> | 63.3/69.0 | **65.1**/**71.6** |
126
- | LVBench | | 45.3 |
127
- | LongVideoBench | | 54.7 |
128
- | MMBench-Video | 1.44 | 1.79 |
129
- | TempCompass | | 71.7 |
130
- | MLVU | | 70.2 |
131
- | CharadesSTA/mIoU | 43.6|
132
 
133
  ### Agent benchmark
134
- | Benchmarks | Qwen2.5-VL-7B |
135
- |-------------------------|---------------|
136
- | ScreenSpot | 84.7 |
137
- | ScreenSpot Pro | 29.0 |
138
- | AITZ_EM | 81.9 |
139
- | Android Control High_EM | 60.1 |
140
- | Android Control Low_EM | 93.7 |
141
- | AndroidWorld_SR | 25.5 |
142
- | MobileMiniWob++_SR | 91.4 |
 
 
 
143
 
144
  ## Requirements
145
  The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command:
@@ -185,25 +147,25 @@ from qwen_vl_utils import process_vision_info
185
 
186
  # default: Load the model on the available device(s)
187
  model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
188
- "Qwen/Qwen2.5-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
189
  )
190
 
191
  # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
192
  # model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
193
- # "Qwen/Qwen2.5-VL-7B-Instruct",
194
  # torch_dtype=torch.bfloat16,
195
  # attn_implementation="flash_attention_2",
196
  # device_map="auto",
197
  # )
198
 
199
  # default processer
200
- processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
201
 
202
  # The default range for the number of visual tokens per image in the model is 4-16384.
203
  # You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
204
  # min_pixels = 256*28*28
205
  # max_pixels = 1280*28*28
206
- # processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
207
 
208
  messages = [
209
  {
@@ -472,7 +434,7 @@ The model supports a wide range of resolution inputs. By default, it uses the na
472
  min_pixels = 256 * 28 * 28
473
  max_pixels = 1280 * 28 * 28
474
  processor = AutoProcessor.from_pretrained(
475
- "Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
476
  )
477
  ```
478
 
@@ -522,6 +484,7 @@ To handle extensive inputs exceeding 32,768 tokens, we utilize [YaRN](https://ar
522
 
523
  For supported frameworks, you could add the following to `config.json` to enable YaRN:
524
 
 
525
  {
526
  ...,
527
  "type": "yarn",
@@ -533,6 +496,7 @@ For supported frameworks, you could add the following to `config.json` to enable
533
  "factor": 4,
534
  "original_max_position_embeddings": 32768
535
  }
 
536
 
537
  However, it should be noted that this method has a significant impact on the performance of temporal and spatial localization tasks, and is therefore not recommended for use.
538
 
@@ -540,7 +504,6 @@ At the same time, for long video inputs, since MRoPE itself is more economical w
540
 
541
 
542
 
543
-
544
  ## Citation
545
 
546
  If you find our work helpful, feel free to give us a cite.
@@ -568,4 +531,3 @@ If you find our work helpful, feel free to give us a cite.
568
  year={2023}
569
  }
570
  ```
571
-
 
1
  ---
2
+ license: other
3
+ license_name: qwen
4
+ license_link: https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct/blob/main/LICENSE
5
  language:
6
  - en
 
7
  pipeline_tag: image-text-to-text
 
8
  tags:
9
  - multimodal
 
 
10
  - unsloth
11
+ library_name: transformers
12
+ base_model:
13
+ - Qwen/Qwen2.5-VL-72B-Instruct
14
  ---
15
 
16
+ # Qwen2.5-VL-72B-Instruct
17
+ <a href="https://chat.qwenlm.ai/" target="_blank" style="margin: 2px;">
18
+ <img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block; vertical-align: middle;"/>
19
+ </a>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
  ## Introduction
22
 
 
44
  <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-VL/qwen2.5vl_arc.jpeg" width="80%"/>
45
  <p>
46
 
 
47
  * **Streamlined and Efficient Vision Encoder**
48
 
49
  We enhance both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM.
50
 
51
 
52
+ We have three models with 3, 7 and 72 billion parameters. This repo contains the instruction-tuned 72B Qwen2.5-VL model. For more information, visit our [Blog](https://qwenlm.github.io/blog/qwen2.5-vl/) and [GitHub](https://github.com/QwenLM/Qwen2.5-VL).
53
 
54
 
55
 
 
57
 
58
  ### Image benchmark
59
 
60
+ | Benchmarks | GPT4o | Claude3.5 Sonnet | Gemini-2-flash | InternVL2.5-78B | Qwen2-VL-72B | Qwen2.5-VL-72B |
61
+ |-----------------------|-----------|-------------------|-----------------|-----------------|--------------|----------------|
62
+ | MMMU<sub>val</sub> | 70.3 | 70.4 | 70.7 | 70.1 | 64.5 | 70.2 |
63
+ | MMMU_Pro | 54.5 | 54.7 | 57.0 | 48.6 | 46.2 | 51.1 |
64
+ | MathVista_MINI | 63.8 | 65.4 | 73.1 | 76.6 | 70.5 | 74.8 |
65
+ | MathVision_FULL | 30.4 | 38.3 | 41.3 | 32.2 | 25.9 | 38.1 |
66
+ | Hallusion Bench | 55.0 | 55.16 | | 57.4 | 58.1 | 55.16 |
67
+ | MMBench_DEV_EN_V11 | 82.1 | 83.4 | 83.0 | 88.5 | 86.6 | 88 |
68
+ | AI2D_TEST | 84.6 | 81.2 | | 89.1 | 88.1 | 88.4 |
69
+ | ChartQA_TEST | 86.7 | 90.8 | 85.2 | 88.3 | 88.3 | 89.5 |
70
+ | DocVQA_VAL | 91.1 | 95.2 | 92.1 | 96.5 | 96.1 | 96.4 |
71
+ | MMStar | 64.7 | 65.1 | 69.4 | 69.5 | 68.3 | 70.8 |
72
+ | MMVet_turbo | 69.1 | 70.1 | | 72.3 | 74.0 | 76.19 |
73
+ | OCRBench | 736 | 788 | | 854 | 877 | 885 |
74
+ | OCRBench-V2(en/zh) | 46.5/32.3 | 45.2/39.6 | 51.9/43.1 | 45/46.2 | 47.8/46.1 | 61.5/63.7 |
75
+ | CC-OCR | 66.6 | 62.7 | 73.0 | 64.7 | 68.7 |79.8 |
76
+
77
+
78
+ ### Video benchmark
79
+ | Benchmarks | GPT4o | Gemini-1.5-Pro | InternVL2.5-78B | Qwen2VL-72B | Qwen2.5VL-72B |
80
+ |---------------------|-------|----------------|-----------------|-------------|---------------|
81
+ | VideoMME w/o sub. | 71.9 | 75.0 | 72.1 | 71.2 | 73.3 |
82
+ | VideoMME w sub. | 77.2 | 81.3 | 74.0 | 77.8 | 79.1 |
83
+ | MVBench | 64.6 | 60.5 | 76.4 | 73.6 | 70.4 |
84
+ | MMBench-Video | 1.63 | 1.30 | 1.97 | 1.70 | 2.02 |
85
+ | LVBench | 30.8 | 33.1 | - | 41.3 | 47.3 |
86
+ | EgoSchema | 72.2 | 71.2 | - | 77.9 | 76.2 |
87
+ | PerceptionTest_test | - | - | - | 68.0 | 73.2 |
88
+ | MLVU_M-Avg_dev | 64.6 | - | 75.7 | | 74.6 |
89
+ | TempCompass_overall | 73.8 | - | - | | 74.8 |
90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
 
92
  ### Agent benchmark
93
+
94
+ | Benchmarks | GPT4o | Gemini 2.0 | Claude | Aguvis-72B | Qwen2VL-72B | Qwen2.5VL-72B |
95
+ |-------------------------|-------------|------------|--------|------------|-------------|---------------|
96
+ | ScreenSpot | 18.1 | 84.0 | 83.0 | | | 87.1 |
97
+ | ScreenSpot Pro | | | 17.1 | | 1.6 | 43.6 |
98
+ | AITZ_EM | 35.3 | | | | 72.8 | 83.2 |
99
+ | Android Control High_EM | | | | 66.4 | 59.1 | 67.36 |
100
+ | Android Control Low_EM | | | | 84.4 | 59.2 | 93.7 |
101
+ | AndroidWorld_SR | 34.5% (SoM) | | 27.9% | 26.1% | | 35% |
102
+ | MobileMiniWob++_SR | | | | 66% | | 68% |
103
+ | OSWorld | | | 14.90 | 10.26 | | 8.83 |
104
+
105
 
106
  ## Requirements
107
  The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command:
 
147
 
148
  # default: Load the model on the available device(s)
149
  model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
150
+ "Qwen/Qwen2.5-VL-72B-Instruct", torch_dtype="auto", device_map="auto"
151
  )
152
 
153
  # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
154
  # model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
155
+ # "Qwen/Qwen2.5-VL-72B-Instruct",
156
  # torch_dtype=torch.bfloat16,
157
  # attn_implementation="flash_attention_2",
158
  # device_map="auto",
159
  # )
160
 
161
  # default processer
162
+ processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-72B-Instruct")
163
 
164
  # The default range for the number of visual tokens per image in the model is 4-16384.
165
  # You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
166
  # min_pixels = 256*28*28
167
  # max_pixels = 1280*28*28
168
+ # processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-72B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
169
 
170
  messages = [
171
  {
 
434
  min_pixels = 256 * 28 * 28
435
  max_pixels = 1280 * 28 * 28
436
  processor = AutoProcessor.from_pretrained(
437
+ "Qwen/Qwen2.5-VL-72B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
438
  )
439
  ```
440
 
 
484
 
485
  For supported frameworks, you could add the following to `config.json` to enable YaRN:
486
 
487
+ ```json
488
  {
489
  ...,
490
  "type": "yarn",
 
496
  "factor": 4,
497
  "original_max_position_embeddings": 32768
498
  }
499
+ ```
500
 
501
  However, it should be noted that this method has a significant impact on the performance of temporal and spatial localization tasks, and is therefore not recommended for use.
502
 
 
504
 
505
 
506
 
 
507
  ## Citation
508
 
509
  If you find our work helpful, feel free to give us a cite.
 
531
  year={2023}
532
  }
533
  ```
 
chat_template.jinja ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system
2
+ You are a helpful assistant.<|im_end|>
3
+ {% endif %}<|im_start|>{{ message['role'] }}
4
+ {% if message['content'] is string %}{{ message['content'] }}<|im_end|>
5
+ {% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>
6
+ {% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant
7
+ {% endif %}
config.json CHANGED
@@ -1,5 +1,4 @@
1
  {
2
- "_name_or_path": "Qwen/Qwen2.5-VL-72B-Instruct",
3
  "architectures": [
4
  "Qwen2_5_VLForConditionalGeneration"
5
  ],
@@ -10,7 +9,7 @@
10
  "image_token_id": 151655,
11
  "initializer_range": 0.02,
12
  "intermediate_size": 29568,
13
- "max_position_embeddings": 32768,
14
  "max_window_layers": 80,
15
  "model_type": "qwen2_5_vl",
16
  "num_attention_heads": 64,
@@ -31,72 +30,73 @@
31
  "multi_modal_projector",
32
  "merger",
33
  "modality_projection",
34
- "visual.blocks.0.attn",
35
- "visual.blocks.0.mlp",
36
- "visual.blocks.1.attn",
37
- "visual.blocks.1.mlp",
38
- "visual.blocks.2.attn",
39
- "visual.blocks.2.mlp",
40
- "visual.blocks.3.attn",
41
- "visual.blocks.3.mlp",
42
- "visual.blocks.4.attn",
43
- "visual.blocks.4.mlp",
44
- "visual.blocks.5.attn",
45
- "visual.blocks.5.mlp",
46
- "visual.blocks.6.attn",
47
- "visual.blocks.6.mlp",
48
- "visual.blocks.7.attn",
49
- "visual.blocks.7.mlp",
50
- "visual.blocks.8.attn",
51
- "visual.blocks.8.mlp",
52
- "visual.blocks.9.attn",
53
- "visual.blocks.9.mlp",
54
- "visual.blocks.10.attn",
55
- "visual.blocks.10.mlp",
56
- "visual.blocks.11.attn",
57
- "visual.blocks.11.mlp",
58
- "visual.blocks.12.attn",
59
- "visual.blocks.12.mlp",
60
- "visual.blocks.13.attn",
61
- "visual.blocks.13.mlp",
62
- "visual.blocks.14.attn",
63
- "visual.blocks.14.mlp",
64
- "visual.blocks.15.attn",
65
- "visual.blocks.15.mlp",
66
- "visual.blocks.16.attn",
67
- "visual.blocks.16.mlp",
68
- "visual.blocks.17.attn",
69
- "visual.blocks.17.mlp",
70
- "visual.blocks.18.attn",
71
- "visual.blocks.18.mlp",
72
- "visual.blocks.19.attn",
73
- "visual.blocks.19.mlp",
74
- "visual.blocks.20.attn",
75
- "visual.blocks.20.mlp",
76
- "visual.blocks.21.attn",
77
- "visual.blocks.21.mlp",
78
- "visual.blocks.22.attn",
79
- "visual.blocks.22.mlp",
80
- "visual.blocks.23.attn",
81
- "visual.blocks.23.mlp",
82
  "visual.blocks.24.attn",
83
- "visual.blocks.24.mlp",
84
- "visual.blocks.25.attn",
85
- "visual.blocks.25.mlp",
86
- "visual.blocks.26.attn",
87
- "visual.blocks.26.mlp",
88
- "visual.blocks.27.attn",
89
- "visual.blocks.27.mlp",
90
  "visual.blocks.28.attn",
91
- "visual.blocks.28.mlp",
92
  "visual.blocks.29.attn",
 
 
 
93
  "visual.blocks.29.mlp",
 
 
94
  "visual.blocks.30.attn",
 
 
95
  "visual.blocks.30.mlp",
96
  "visual.blocks.31.attn",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97
  "visual.blocks.31.mlp",
98
- "visual.merger.mlp",
99
- "model.layers.1.mlp",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
100
  "visual.blocks.31.mlp.down_proj",
101
  "model.layers.7.self_attn.o_proj",
102
  "model.layers.33.self_attn.o_proj",
@@ -119,22 +119,76 @@
119
  },
120
  "rope_theta": 1000000.0,
121
  "sliding_window": 32768,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
122
  "tie_word_embeddings": false,
123
  "torch_dtype": "bfloat16",
124
- "transformers_version": "4.49.0.dev0",
125
  "unsloth_fixed": true,
126
  "use_cache": true,
127
  "use_sliding_window": false,
128
  "video_token_id": 151656,
129
  "vision_config": {
 
 
 
 
 
 
 
 
130
  "hidden_size": 1280,
 
131
  "in_chans": 3,
 
132
  "intermediate_size": 3456,
133
  "model_type": "qwen2_5_vl",
 
134
  "out_hidden_size": 8192,
 
 
135
  "spatial_patch_size": 14,
 
136
  "tokens_per_second": 2,
137
- "torch_dtype": "bfloat16"
 
138
  },
139
  "vision_end_token_id": 151653,
140
  "vision_start_token_id": 151652,
 
1
  {
 
2
  "architectures": [
3
  "Qwen2_5_VLForConditionalGeneration"
4
  ],
 
9
  "image_token_id": 151655,
10
  "initializer_range": 0.02,
11
  "intermediate_size": 29568,
12
+ "max_position_embeddings": 128000,
13
  "max_window_layers": 80,
14
  "model_type": "qwen2_5_vl",
15
  "num_attention_heads": 64,
 
30
  "multi_modal_projector",
31
  "merger",
32
  "modality_projection",
33
+ "model.layers.23.mlp",
34
+ "model.layers.1.mlp",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
  "visual.blocks.24.attn",
 
 
 
 
 
 
 
36
  "visual.blocks.28.attn",
 
37
  "visual.blocks.29.attn",
38
+ "visual.blocks.21.attn",
39
+ "visual.blocks.25.attn",
40
+ "visual.blocks.27.attn",
41
  "visual.blocks.29.mlp",
42
+ "visual.blocks.26.attn",
43
+ "visual.blocks.22.attn",
44
  "visual.blocks.30.attn",
45
+ "visual.blocks.20.attn",
46
+ "visual.merger.mlp",
47
  "visual.blocks.30.mlp",
48
  "visual.blocks.31.attn",
49
+ "visual.blocks.19.attn",
50
+ "visual.blocks.26.mlp",
51
+ "visual.blocks.10.attn",
52
+ "visual.blocks.17.attn",
53
+ "visual.blocks.27.mlp",
54
+ "visual.blocks.21.mlp",
55
+ "visual.blocks.18.attn",
56
+ "visual.blocks.28.mlp",
57
+ "visual.blocks.16.attn",
58
+ "visual.blocks.24.mlp",
59
+ "visual.blocks.19.mlp",
60
+ "visual.blocks.18.mlp",
61
+ "visual.blocks.22.mlp",
62
+ "visual.blocks.25.mlp",
63
+ "visual.blocks.23.attn",
64
+ "visual.blocks.13.attn",
65
+ "visual.blocks.11.attn",
66
+ "visual.blocks.8.mlp",
67
+ "visual.blocks.14.attn",
68
+ "visual.blocks.20.mlp",
69
+ "visual.blocks.12.attn",
70
+ "visual.blocks.6.mlp",
71
+ "visual.blocks.2.attn",
72
  "visual.blocks.31.mlp",
73
+ "visual.blocks.23.mlp",
74
+ "visual.blocks.9.attn",
75
+ "visual.blocks.5.attn",
76
+ "visual.blocks.11.mlp",
77
+ "visual.blocks.8.attn",
78
+ "visual.blocks.10.mlp",
79
+ "visual.blocks.9.mlp",
80
+ "visual.blocks.6.attn",
81
+ "visual.blocks.3.mlp",
82
+ "visual.blocks.14.mlp",
83
+ "visual.blocks.7.mlp",
84
+ "visual.blocks.12.mlp",
85
+ "visual.blocks.13.mlp",
86
+ "visual.blocks.5.mlp",
87
+ "visual.blocks.4.mlp",
88
+ "visual.blocks.1.attn",
89
+ "visual.blocks.16.mlp",
90
+ "visual.blocks.15.mlp",
91
+ "visual.blocks.2.mlp",
92
+ "visual.blocks.7.attn",
93
+ "visual.blocks.0.attn",
94
+ "visual.blocks.3.attn",
95
+ "visual.blocks.1.mlp",
96
+ "visual.blocks.15.attn",
97
+ "visual.blocks.4.attn",
98
+ "visual.blocks.0.mlp",
99
+ "visual.blocks.17.mlp",
100
  "visual.blocks.31.mlp.down_proj",
101
  "model.layers.7.self_attn.o_proj",
102
  "model.layers.33.self_attn.o_proj",
 
119
  },
120
  "rope_theta": 1000000.0,
121
  "sliding_window": 32768,
122
+ "text_config": {
123
+ "architectures": [
124
+ "Qwen2_5_VLForConditionalGeneration"
125
+ ],
126
+ "attention_dropout": 0.0,
127
+ "bos_token_id": 151643,
128
+ "eos_token_id": 151645,
129
+ "hidden_act": "silu",
130
+ "hidden_size": 8192,
131
+ "image_token_id": null,
132
+ "initializer_range": 0.02,
133
+ "intermediate_size": 29568,
134
+ "max_position_embeddings": 128000,
135
+ "max_window_layers": 80,
136
+ "model_type": "qwen2_5_vl_text",
137
+ "num_attention_heads": 64,
138
+ "num_hidden_layers": 80,
139
+ "num_key_value_heads": 8,
140
+ "rms_norm_eps": 1e-06,
141
+ "rope_scaling": {
142
+ "mrope_section": [
143
+ 16,
144
+ 24,
145
+ 24
146
+ ],
147
+ "rope_type": "default",
148
+ "type": "default"
149
+ },
150
+ "rope_theta": 1000000.0,
151
+ "sliding_window": 32768,
152
+ "torch_dtype": "bfloat16",
153
+ "use_cache": true,
154
+ "use_sliding_window": false,
155
+ "video_token_id": null,
156
+ "vision_end_token_id": 151653,
157
+ "vision_start_token_id": 151652,
158
+ "vision_token_id": 151654,
159
+ "vocab_size": 152064
160
+ },
161
  "tie_word_embeddings": false,
162
  "torch_dtype": "bfloat16",
163
+ "transformers_version": "4.52.0.dev0",
164
  "unsloth_fixed": true,
165
  "use_cache": true,
166
  "use_sliding_window": false,
167
  "video_token_id": 151656,
168
  "vision_config": {
169
+ "depth": 32,
170
+ "fullatt_block_indexes": [
171
+ 7,
172
+ 15,
173
+ 23,
174
+ 31
175
+ ],
176
+ "hidden_act": "silu",
177
  "hidden_size": 1280,
178
+ "in_channels": 3,
179
  "in_chans": 3,
180
+ "initializer_range": 0.02,
181
  "intermediate_size": 3456,
182
  "model_type": "qwen2_5_vl",
183
+ "num_heads": 16,
184
  "out_hidden_size": 8192,
185
+ "patch_size": 14,
186
+ "spatial_merge_size": 2,
187
  "spatial_patch_size": 14,
188
+ "temporal_patch_size": 2,
189
  "tokens_per_second": 2,
190
+ "torch_dtype": "bfloat16",
191
+ "window_size": 112
192
  },
193
  "vision_end_token_id": 151653,
194
  "vision_start_token_id": 151652,
generation_config.json CHANGED
@@ -5,10 +5,9 @@
5
  151645,
6
  151643
7
  ],
8
- "max_length": 32768,
9
  "pad_token_id": 151654,
10
  "repetition_penalty": 1.05,
11
- "top_k": 1,
12
- "top_p": 0.001,
13
- "transformers_version": "4.49.0.dev0"
14
  }
 
5
  151645,
6
  151643
7
  ],
8
+ "max_length": 128000,
9
  "pad_token_id": 151654,
10
  "repetition_penalty": 1.05,
11
+ "temperature": 1e-06,
12
+ "transformers_version": "4.52.0.dev0"
 
13
  }
model-00001-of-00010.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6dd32fb4099fd34a8bac06ad3da42b0352a088b8760de82e8a43d5a237a37c35
3
+ size 4915795445
model-00002-of-00010.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c9691f37b23fcd518b692db2bf61253f924bcdce044e3ae1e8241c165c5231bf
3
+ size 4893987633
model-00003-of-00010.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:02e1a07d2db13e7e0d1b91adb3f334bd2c4ebd12959c31ea5529cfd13c43c375
3
+ size 4981068378
model-00004-of-00010.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4e9def87c1360a3a302518c41819d7f00649b76f6a2b49a85f16c6d45d451752
3
+ size 4994299741
model-00005-of-00010.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0901b17b09079815707571e2238f85bb805c88a3e0c08b6745bc78fbd41ababd
3
+ size 4912374640
model-00006-of-00010.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6b190c3e8299804b2fdd3c105b91e280920b332ea83974edb217e86911fd9327
3
+ size 4981068365
model-00007-of-00010.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:215b3e329799850d08c90761793b60b6d3459302b474f183f7e99efd9d3aa170
3
+ size 4981068351
model-00008-of-00010.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fe9560c21ec4f7fafc8e98f8103fe4f6c219d2b385fb1c9772f38655a9e31f0d
3
+ size 4981068355
model-00009-of-00010.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c582dd6610980616a36ec1d165bd8acdd55d8fbabe25823715eca12f3beba714
3
+ size 2941548592
model-00010-of-00010.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f9b524193abf7718f1c37f3110a7540c6ca4caef5a5584038c77660acc20f02d
3
+ size 2491416704
model.safetensors.index.json CHANGED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json CHANGED
@@ -195,16 +195,16 @@
195
  "<|video_pad|>"
196
  ],
197
  "bos_token": null,
198
- "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are a helpful assistant.' }}\n {%- endif %}\n {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>system\\nYou are a helpful assistant.<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n<tool_call>\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n</tool_call>' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n<tool_response>\\n' }}\n {{- message.content }}\n {{- '\\n</tool_response>' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
199
  "clean_up_tokenization_spaces": false,
200
  "eos_token": "<|im_end|>",
201
  "errors": "replace",
202
  "extra_special_tokens": {},
203
- "model_max_length": 131072,
204
  "pad_token": "<|vision_pad|>",
205
  "padding_side": "left",
206
  "processor_class": "Qwen2_5_VLProcessor",
207
  "split_special_tokens": false,
208
  "tokenizer_class": "Qwen2Tokenizer",
209
- "unk_token": null
210
- }
 
 
195
  "<|video_pad|>"
196
  ],
197
  "bos_token": null,
 
198
  "clean_up_tokenization_spaces": false,
199
  "eos_token": "<|im_end|>",
200
  "errors": "replace",
201
  "extra_special_tokens": {},
202
+ "model_max_length": 128000,
203
  "pad_token": "<|vision_pad|>",
204
  "padding_side": "left",
205
  "processor_class": "Qwen2_5_VLProcessor",
206
  "split_special_tokens": false,
207
  "tokenizer_class": "Qwen2Tokenizer",
208
+ "unk_token": null,
209
+ "chat_template": "{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}"
210
+ }