nexaml commited on
Commit
6c04825
·
verified ·
1 Parent(s): 98e9f88

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +74 -8
README.md CHANGED
@@ -11,15 +11,81 @@ base_model:
11
  - Qwen/Qwen2.5-VL-7B-Instruct
12
  ---
13
 
14
- # mlx-community/Qwen2.5-VL-7B-Instruct-4bit
15
- This model was converted to MLX format from [`Qwen/Qwen2.5-VL-7B-Instruct`]() using mlx-vlm version **0.1.11**.
16
- Refer to the [original model card](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) for more details on the model.
17
- ## Use with mlx
18
 
19
- ```bash
20
- pip install -U mlx-vlm
21
- ```
 
 
22
 
23
  ```bash
24
- python -m mlx_vlm.generate --model mlx-community/Qwen2.5-VL-7B-Instruct-4bit --max-tokens 100 --temp 0.0 --prompt "Describe this image." --image <path_to_image>
25
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  - Qwen/Qwen2.5-VL-7B-Instruct
12
  ---
13
 
14
+ # nexaml/Qwen2.5-VL-7B-Instruct-4bit-MLX
 
 
 
15
 
16
+
17
+ ## Quickstart
18
+
19
+ Run them directly with [nexa-sdk](https://github.com/NexaAI/nexa-sdk) installed
20
+ In nexa-sdk CLI:
21
 
22
  ```bash
23
+ nexaml/Qwen2.5-VL-7B-Instruct-4bit-MLX
24
  ```
25
+
26
+ ## Overview
27
+
28
+ In the past five months since Qwen2-VL’s release, numerous developers have built new models on the Qwen2-VL vision-language models, providing us with valuable feedback. During this period, we focused on building more useful vision-language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5-VL.
29
+
30
+ ### Key Enhancements:
31
+ - **Understand things visually**: Qwen2.5-VL is not only proficient in recognizing common objects such as flowers, birds, fish, and insects, but it is highly capable of analyzing texts, charts, icons, graphics, and layouts within images.
32
+
33
+ - **Being agentic**: Qwen2.5-VL directly plays as a visual agent that can reason and dynamically direct tools, which is capable of computer use and phone use.
34
+
35
+ - **Understanding long videos and capturing events**: Qwen2.5-VL can comprehend videos of over 1 hour, and this time it has a new ability of cpaturing event by pinpointing the relevant video segments.
36
+
37
+ - **Capable of visual localization in different formats**: Qwen2.5-VL can accurately localize objects in an image by generating bounding boxes or points, and it can provide stable JSON outputs for coordinates and attributes.
38
+
39
+ - **Generating structured outputs**: for data like scans of invoices, forms, tables, etc. Qwen2.5-VL supports structured outputs of their contents, benefiting usages in finance, commerce, etc.
40
+
41
+
42
+ ## Benchmark Results
43
+
44
+ ### Image benchmark
45
+
46
+
47
+ | Benchmark | InternVL2.5-8B | MiniCPM-o 2.6 | GPT-4o-mini | Qwen2-VL-7B |**Qwen2.5-VL-7B** |
48
+ | :--- | :---: | :---: | :---: | :---: | :---: |
49
+ | MMMU<sub>val</sub> | 56 | 50.4 | **60**| 54.1 | 58.6|
50
+ | MMMU-Pro<sub>val</sub> | 34.3 | - | 37.6| 30.5 | 41.0|
51
+ | DocVQA<sub>test</sub> | 93 | 93 | - | 94.5 | **95.7** |
52
+ | InfoVQA<sub>test</sub> | 77.6 | - | - |76.5 | **82.6** |
53
+ | ChartQA<sub>test</sub> | 84.8 | - |- | 83.0 |**87.3** |
54
+ | TextVQA<sub>val</sub> | 79.1 | 80.1 | -| 84.3 | **84.9**|
55
+ | OCRBench | 822 | 852 | 785 | 845 | **864** |
56
+ | CC_OCR | 57.7 | | | 61.6 | **77.8**|
57
+ | MMStar | 62.8| | |60.7| **63.9**|
58
+ | MMBench-V1.1-En<sub>test</sub> | 79.4 | 78.0 | 76.0| 80.7 | **82.6** |
59
+ | MMT-Bench<sub>test</sub> | - | - | - |**63.7** |63.6 |
60
+ | MMStar | **61.5** | 57.5 | 54.8 | 60.7 |63.9 |
61
+ | MMVet<sub>GPT-4-Turbo</sub> | 54.2 | 60.0 | 66.9 | 62.0 | **67.1**|
62
+ | HallBench<sub>avg</sub> | 45.2 | 48.1 | 46.1| 50.6 | **52.9**|
63
+ | MathVista<sub>testmini</sub> | 58.3 | 60.6 | 52.4 | 58.2 | **68.2**|
64
+ | MathVision | - | - | - | 16.3 | **25.07** |
65
+
66
+ ### Video Benchmarks
67
+ | Benchmark | Qwen2-VL-7B | **Qwen2.5-VL-7B** |
68
+ | :--- | :---: | :---: |
69
+ | MVBench | 67.0 | **69.6** |
70
+ | PerceptionTest<sub>test</sub> | 66.9 | **70.5** |
71
+ | Video-MME<sub>wo/w subs</sub> | 63.3/69.0 | **65.1**/**71.6** |
72
+ | LVBench | | 45.3 |
73
+ | LongVideoBench | | 54.7 |
74
+ | MMBench-Video | 1.44 | 1.79 |
75
+ | TempCompass | | 71.7 |
76
+ | MLVU | | 70.2 |
77
+ | CharadesSTA/mIoU | 43.6|
78
+
79
+ ### Agent benchmark
80
+ | Benchmarks | Qwen2.5-VL-7B |
81
+ |-------------------------|---------------|
82
+ | ScreenSpot | 84.7 |
83
+ | ScreenSpot Pro | 29.0 |
84
+ | AITZ_EM | 81.9 |
85
+ | Android Control High_EM | 60.1 |
86
+ | Android Control Low_EM | 93.7 |
87
+ | AndroidWorld_SR | 25.5 |
88
+ | MobileMiniWob++_SR | 91.4 |
89
+
90
+ ## Reference
91
+ **Original model card**: [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)