Update README.md
Browse files
README.md
CHANGED
@@ -11,15 +11,81 @@ base_model:
|
|
11 |
- Qwen/Qwen2.5-VL-7B-Instruct
|
12 |
---
|
13 |
|
14 |
-
#
|
15 |
-
This model was converted to MLX format from [`Qwen/Qwen2.5-VL-7B-Instruct`]() using mlx-vlm version **0.1.11**.
|
16 |
-
Refer to the [original model card](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) for more details on the model.
|
17 |
-
## Use with mlx
|
18 |
|
19 |
-
|
20 |
-
|
21 |
-
|
|
|
|
|
22 |
|
23 |
```bash
|
24 |
-
|
25 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
- Qwen/Qwen2.5-VL-7B-Instruct
|
12 |
---
|
13 |
|
14 |
+
# nexaml/Qwen2.5-VL-7B-Instruct-4bit-MLX
|
|
|
|
|
|
|
15 |
|
16 |
+
|
17 |
+
## Quickstart
|
18 |
+
|
19 |
+
Run them directly with [nexa-sdk](https://github.com/NexaAI/nexa-sdk) installed
|
20 |
+
In nexa-sdk CLI:
|
21 |
|
22 |
```bash
|
23 |
+
nexaml/Qwen2.5-VL-7B-Instruct-4bit-MLX
|
24 |
```
|
25 |
+
|
26 |
+
## Overview
|
27 |
+
|
28 |
+
In the past five months since Qwen2-VL’s release, numerous developers have built new models on the Qwen2-VL vision-language models, providing us with valuable feedback. During this period, we focused on building more useful vision-language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5-VL.
|
29 |
+
|
30 |
+
### Key Enhancements:
|
31 |
+
- **Understand things visually**: Qwen2.5-VL is not only proficient in recognizing common objects such as flowers, birds, fish, and insects, but it is highly capable of analyzing texts, charts, icons, graphics, and layouts within images.
|
32 |
+
|
33 |
+
- **Being agentic**: Qwen2.5-VL directly plays as a visual agent that can reason and dynamically direct tools, which is capable of computer use and phone use.
|
34 |
+
|
35 |
+
- **Understanding long videos and capturing events**: Qwen2.5-VL can comprehend videos of over 1 hour, and this time it has a new ability of cpaturing event by pinpointing the relevant video segments.
|
36 |
+
|
37 |
+
- **Capable of visual localization in different formats**: Qwen2.5-VL can accurately localize objects in an image by generating bounding boxes or points, and it can provide stable JSON outputs for coordinates and attributes.
|
38 |
+
|
39 |
+
- **Generating structured outputs**: for data like scans of invoices, forms, tables, etc. Qwen2.5-VL supports structured outputs of their contents, benefiting usages in finance, commerce, etc.
|
40 |
+
|
41 |
+
|
42 |
+
## Benchmark Results
|
43 |
+
|
44 |
+
### Image benchmark
|
45 |
+
|
46 |
+
|
47 |
+
| Benchmark | InternVL2.5-8B | MiniCPM-o 2.6 | GPT-4o-mini | Qwen2-VL-7B |**Qwen2.5-VL-7B** |
|
48 |
+
| :--- | :---: | :---: | :---: | :---: | :---: |
|
49 |
+
| MMMU<sub>val</sub> | 56 | 50.4 | **60**| 54.1 | 58.6|
|
50 |
+
| MMMU-Pro<sub>val</sub> | 34.3 | - | 37.6| 30.5 | 41.0|
|
51 |
+
| DocVQA<sub>test</sub> | 93 | 93 | - | 94.5 | **95.7** |
|
52 |
+
| InfoVQA<sub>test</sub> | 77.6 | - | - |76.5 | **82.6** |
|
53 |
+
| ChartQA<sub>test</sub> | 84.8 | - |- | 83.0 |**87.3** |
|
54 |
+
| TextVQA<sub>val</sub> | 79.1 | 80.1 | -| 84.3 | **84.9**|
|
55 |
+
| OCRBench | 822 | 852 | 785 | 845 | **864** |
|
56 |
+
| CC_OCR | 57.7 | | | 61.6 | **77.8**|
|
57 |
+
| MMStar | 62.8| | |60.7| **63.9**|
|
58 |
+
| MMBench-V1.1-En<sub>test</sub> | 79.4 | 78.0 | 76.0| 80.7 | **82.6** |
|
59 |
+
| MMT-Bench<sub>test</sub> | - | - | - |**63.7** |63.6 |
|
60 |
+
| MMStar | **61.5** | 57.5 | 54.8 | 60.7 |63.9 |
|
61 |
+
| MMVet<sub>GPT-4-Turbo</sub> | 54.2 | 60.0 | 66.9 | 62.0 | **67.1**|
|
62 |
+
| HallBench<sub>avg</sub> | 45.2 | 48.1 | 46.1| 50.6 | **52.9**|
|
63 |
+
| MathVista<sub>testmini</sub> | 58.3 | 60.6 | 52.4 | 58.2 | **68.2**|
|
64 |
+
| MathVision | - | - | - | 16.3 | **25.07** |
|
65 |
+
|
66 |
+
### Video Benchmarks
|
67 |
+
| Benchmark | Qwen2-VL-7B | **Qwen2.5-VL-7B** |
|
68 |
+
| :--- | :---: | :---: |
|
69 |
+
| MVBench | 67.0 | **69.6** |
|
70 |
+
| PerceptionTest<sub>test</sub> | 66.9 | **70.5** |
|
71 |
+
| Video-MME<sub>wo/w subs</sub> | 63.3/69.0 | **65.1**/**71.6** |
|
72 |
+
| LVBench | | 45.3 |
|
73 |
+
| LongVideoBench | | 54.7 |
|
74 |
+
| MMBench-Video | 1.44 | 1.79 |
|
75 |
+
| TempCompass | | 71.7 |
|
76 |
+
| MLVU | | 70.2 |
|
77 |
+
| CharadesSTA/mIoU | 43.6|
|
78 |
+
|
79 |
+
### Agent benchmark
|
80 |
+
| Benchmarks | Qwen2.5-VL-7B |
|
81 |
+
|-------------------------|---------------|
|
82 |
+
| ScreenSpot | 84.7 |
|
83 |
+
| ScreenSpot Pro | 29.0 |
|
84 |
+
| AITZ_EM | 81.9 |
|
85 |
+
| Android Control High_EM | 60.1 |
|
86 |
+
| Android Control Low_EM | 93.7 |
|
87 |
+
| AndroidWorld_SR | 25.5 |
|
88 |
+
| MobileMiniWob++_SR | 91.4 |
|
89 |
+
|
90 |
+
## Reference
|
91 |
+
**Original model card**: [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)
|