NexaAI
/

Qwen2.5-VL-7B-Instruct-4bit-MLX

@@ -11,15 +11,81 @@ base_model:
 - Qwen/Qwen2.5-VL-7B-Instruct
 ---
-# mlx-community/Qwen2.5-VL-7B-Instruct-4bit
-This model was converted to MLX format from [`Qwen/Qwen2.5-VL-7B-Instruct`]() using mlx-vlm version **0.1.11**.
-Refer to the [original model card](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) for more details on the model.
-## Use with mlx
-```bash
-pip install -U mlx-vlm
-```
 ```bash
-python -m mlx_vlm.generate --model mlx-community/Qwen2.5-VL-7B-Instruct-4bit --max-tokens 100 --temp 0.0 --prompt "Describe this image." --image <path_to_image>
 ```

 - Qwen/Qwen2.5-VL-7B-Instruct
 ---
+# nexaml/Qwen2.5-VL-7B-Instruct-4bit-MLX
+## Quickstart
+Run them directly with [nexa-sdk](https://github.com/NexaAI/nexa-sdk) installed
+In nexa-sdk CLI:
 ```bash
+nexaml/Qwen2.5-VL-7B-Instruct-4bit-MLX
 ```
+## Overview
+In the past five months since Qwen2-VL’s release, numerous developers have built new models on the Qwen2-VL vision-language models, providing us with valuable feedback. During this period, we focused on building more useful vision-language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5-VL.
+### Key Enhancements:
+- **Understand things visually**: Qwen2.5-VL is not only proficient in recognizing common objects such as flowers, birds, fish, and insects, but it is highly capable of analyzing texts, charts, icons, graphics, and layouts within images.
+- **Being agentic**: Qwen2.5-VL directly plays as a visual agent that can reason and dynamically direct tools, which is capable of computer use and phone use.
+- **Understanding long videos and capturing events**: Qwen2.5-VL can comprehend videos of over 1 hour, and this time it has a new ability of cpaturing event by pinpointing the relevant video segments.
+- **Capable of visual localization in different formats**: Qwen2.5-VL can accurately localize objects in an image by generating bounding boxes or points, and it can provide stable JSON outputs for coordinates and attributes.
+- **Generating structured outputs**: for data like scans of invoices, forms, tables, etc. Qwen2.5-VL supports structured outputs of their contents, benefiting usages in finance, commerce, etc.
+## Benchmark Results
+### Image benchmark
+| Benchmark | InternVL2.5-8B | MiniCPM-o 2.6 | GPT-4o-mini | Qwen2-VL-7B |**Qwen2.5-VL-7B** |
+| :--- | :---: | :---: | :---: | :---: | :---: |
+| MMMU<sub>val</sub>  | 56 | 50.4 | **60**| 54.1 | 58.6|
+| MMMU-Pro<sub>val</sub>  | 34.3 | - | 37.6| 30.5 | 41.0|
+| DocVQA<sub>test</sub>  | 93 | 93 | - | 94.5 | **95.7** |
+| InfoVQA<sub>test</sub>  | 77.6 | - |  - |76.5 | **82.6** |
+| ChartQA<sub>test</sub>  | 84.8 | - |- | 83.0 |**87.3** |
+| TextVQA<sub>val</sub>  | 79.1 | 80.1 | -| 84.3 | **84.9**|
+| OCRBench | 822 | 852 | 785 | 845 | **864** |
+| CC_OCR | 57.7 |  | | 61.6 | **77.8**|
+| MMStar | 62.8| | |60.7| **63.9**|
+| MMBench-V1.1-En<sub>test</sub>  | 79.4 | 78.0 | 76.0| 80.7 | **82.6** |
+| MMT-Bench<sub>test</sub> | - | - | - |**63.7** |63.6 |
+| MMStar | **61.5** | 57.5 |  54.8 | 60.7 |63.9 |
+| MMVet<sub>GPT-4-Turbo</sub>  | 54.2 | 60.0 | 66.9 | 62.0 | **67.1**|
+| HallBench<sub>avg</sub>  | 45.2 | 48.1 | 46.1| 50.6 | **52.9**|
+| MathVista<sub>testmini</sub>  | 58.3 | 60.6 | 52.4 | 58.2 | **68.2**|
+| MathVision  | - | -  | - | 16.3 | **25.07** |
+### Video Benchmarks
+| Benchmark |  Qwen2-VL-7B | **Qwen2.5-VL-7B** |
+| :--- | :---: | :---: |
+| MVBench |  67.0 | **69.6** |
+| PerceptionTest<sub>test</sub>  | 66.9 | **70.5** |
+| Video-MME<sub>wo/w subs</sub>   | 63.3/69.0 | **65.1**/**71.6** |
+| LVBench  |  | 45.3 |
+| LongVideoBench  |  | 54.7 |
+| MMBench-Video | 1.44 | 1.79 |
+| TempCompass |  | 71.7 |
+| MLVU |  | 70.2 |
+| CharadesSTA/mIoU |  43.6|
+### Agent benchmark
+| Benchmarks              | Qwen2.5-VL-7B |
+|-------------------------|---------------|
+| ScreenSpot              |     84.7    |
+| ScreenSpot Pro          |     29.0    |
+| AITZ_EM                 |  	81.9    |
+| Android Control High_EM |    	60.1    |
+| Android Control Low_EM  |  	93.7    |
+| AndroidWorld_SR         | 	25.5  	|
+| MobileMiniWob++_SR      | 	91.4    |
+## Reference
+**Original model card**: [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)