Update README.md
Browse files
README.md
CHANGED
@@ -1,133 +1,244 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
-
# TODO 什么License?
|
4 |
language:
|
5 |
- en
|
6 |
- zh
|
7 |
-
|
|
|
|
|
8 |
pipeline_tag: audio-text-to-text
|
9 |
tags:
|
10 |
- multimodal
|
11 |
- audio-language-model
|
12 |
- audio
|
13 |
-
# - audio-captioning
|
14 |
-
# - audio-classification
|
15 |
-
# - audio-generation
|
16 |
-
# - audio-question-answering
|
17 |
-
# - audio-understanding
|
18 |
-
# - chat
|
19 |
-
# - speech-recognition
|
20 |
-
# - text-to-speech
|
21 |
-
# TODO 有什么能力
|
22 |
base_model:
|
23 |
- mispeech/dasheng-0.6B
|
24 |
-
- Qwen/Qwen2.5-Omni-
|
25 |
base_model_relation: finetune
|
26 |
-
# TODO 检查是否正确
|
27 |
---
|
28 |
|
29 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
30 |
|
31 |
-
##
|
32 |
|
33 |
-
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
-
|
38 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
39 |
|
40 |
## Usage
|
41 |
|
42 |
-
|
43 |
-
> MiDashengLM uses custom codes to define the model, so you need to set `trust_remote_code=True` when loading it.
|
44 |
-
> You can find these codes in the repository.
|
45 |
|
46 |
```python
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
|
51 |
-
|
52 |
-
|
53 |
-
>>> processor = AutoProcessor.from_pretrained("mispeech/MiDashengLM-HF-dev", trust_remote_code=True)
|
54 |
-
>>> model = AutoModelForCausalLM.from_pretrained("mispeech/MiDashengLM-HF-dev", trust_remote_code=True)
|
55 |
-
>>> model.eval()
|
56 |
-
|
57 |
-
>>> messages = [
|
58 |
-
... {
|
59 |
-
... "role": "system",
|
60 |
-
... "content": [
|
61 |
-
... {"type": "text", "text": "You are a helpful language and speech assistant."}
|
62 |
-
... ],
|
63 |
-
... },
|
64 |
-
... {
|
65 |
-
... "role": "user",
|
66 |
-
... "content": [
|
67 |
-
... {"type": "text", "text": "Caption the audio."},
|
68 |
-
... {
|
69 |
-
... "type": "audio",
|
70 |
-
... "path": "/path/to/audio.wav",
|
71 |
-
... },
|
72 |
-
... ],
|
73 |
-
... },
|
74 |
-
... ]
|
75 |
-
|
76 |
-
>>> import torch
|
77 |
-
>>> with torch.no_grad():
|
78 |
-
... model_inputs = processor.apply_chat_template(
|
79 |
-
... messages,
|
80 |
-
... tokenize=True,
|
81 |
-
... add_generation_prompt=True,
|
82 |
-
... add_special_tokens=True,
|
83 |
-
... return_dict=True,
|
84 |
-
... )
|
85 |
-
... generation = model.generate(**model_inputs)
|
86 |
-
... output = tokenizer.batch_decode(generation, skip_special_tokens=True)
|
87 |
-
|
88 |
-
>>> print(output)
|
89 |
-
["An engine is idling."]
|
90 |
```
|
91 |
|
92 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
93 |
|
94 |
-
|
95 |
|
96 |
```python
|
97 |
-
|
98 |
-
|
99 |
-
|
100 |
-
|
101 |
-
|
102 |
-
|
103 |
-
|
104 |
-
|
105 |
-
|
106 |
-
|
107 |
-
|
108 |
-
... {
|
109 |
-
... "role": "user",
|
110 |
-
... "content": [
|
111 |
-
... {"type": "text", "text": "Caption the audio."},
|
112 |
-
... {"type": "audio", "url": "https://example.com/audio.wav"},
|
113 |
-
... ],
|
114 |
-
... },
|
115 |
-
... ]
|
116 |
-
|
117 |
-
>>> import numpy as np
|
118 |
-
>>> messages_by_data = [
|
119 |
-
... {
|
120 |
-
... "role": "user",
|
121 |
-
... "content": [
|
122 |
-
... {"type": "text", "text": "Caption the audio."},
|
123 |
-
... {"type": "audio", "audio": np.random.randn(16000)},
|
124 |
-
... ],
|
125 |
-
... },
|
126 |
-
... ]
|
127 |
```
|
128 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
129 |
## Citation
|
130 |
|
|
|
|
|
|
|
|
|
131 |
```bibtex
|
132 |
-
|
133 |
-
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
|
|
3 |
language:
|
4 |
- en
|
5 |
- zh
|
6 |
+
- th
|
7 |
+
- id
|
8 |
+
- vi
|
9 |
pipeline_tag: audio-text-to-text
|
10 |
tags:
|
11 |
- multimodal
|
12 |
- audio-language-model
|
13 |
- audio
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
14 |
base_model:
|
15 |
- mispeech/dasheng-0.6B
|
16 |
+
- Qwen/Qwen2.5-Omni-7B
|
17 |
base_model_relation: finetune
|
|
|
18 |
---
|
19 |
|
20 |
+
<div align="center">
|
21 |
+
<h1>
|
22 |
+
MiDashengLM
|
23 |
+
</h1>
|
24 |
+
<b><em>Efficient audio understanding with general audio captions</em></b></em></b>
|
25 |
+
<p>
|
26 |
+
</p>
|
27 |
+
<a href="https://github.com/xiaomi-research/dasheng-lm"><img src="https://img.shields.io/badge/Homepage-GitHub-0366d6" alt="version"></a>
|
28 |
+
<a href="https://arxiv.org/abs/2507.xxxxx"><img src="https://img.shields.io/badge/arXiv-2507.xxxxx-b31b1b" alt="version"></a>
|
29 |
+
<a href="https://huggingface.co/spaces/mispeech/MiDashengLM"><img src="https://img.shields.io/badge/Demo-Gradio-ffcc66" alt="version"></a>
|
30 |
+
<a href="https://frankenliu.github.io/midashenglm_demo/"><img src="https://img.shields.io/badge/Demo-Page-0366d6" alt="version"></a>
|
31 |
+
</div>
|
32 |
|
33 |
+
## 🔥 Key Highlights
|
34 |
|
35 |
+
**State-of-the-Art Performance**
|
36 |
+
- Outperforms Qwen2.5-Omni-7B, Kimi-Audio-Instruct-7B on **multiple key audio understanding tasks**.
|
37 |
+
|
38 |
+
**High Efficiency**
|
39 |
+
- **3.2× throughput speedup** at comparable batch sizes compared to Qwen2.5-Omni-7B.
|
40 |
+
- 20x throughput speedup by increasing furhter batchsizes. We tested up to a **batch size=512** for 30s audio input on 80GB GPUs. Baselines only support batch size = 8.
|
41 |
+
- Time-to-first-token (TTFT) speedup of up to 4x compared to Qwen2.5-Omni-7B.
|
42 |
+
|
43 |
+
**Caption-based Alignment**
|
44 |
+
- Trained with **general audio captions** (instead of ASR transcripts) to achieve holistic audio understanding.
|
45 |
+
|
46 |
+
**Full Transparency**
|
47 |
+
- **Public-source** training data and reproducible pipeline.
|
48 |
+
- Apache License 2.0 for **both research and commercial use**.
|
49 |
+
|
50 |
+
<div align="center">
|
51 |
+
<img src="fig/capabilities_plot_7b-1.png" width="600">
|
52 |
+
</div>
|
53 |
+
|
54 |
+
## Acknowledgment and Model Foundation
|
55 |
+
|
56 |
+
Although MiDashengLM demonstrates superior audio understanding performance and efficiency compared to Qwen2.5-Omni models,
|
57 |
+
we acknowledge **Qwen2.5-Omni as a remarkable and respected foundational work** in the field.
|
58 |
+
Our model specifically uses [Qwen2.5-Omni-7B Thinker](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) as the decoder starting point during training, building upon its robust architecture and weight initialization.
|
59 |
+
|
60 |
+
The audio encoder is built upon [Dasheng](https://github.com/XiaoMi/dasheng), an open-source audio encoder for general audio understanding with state-of-the-art performance.
|
61 |
+
**Dasheng serves as the core foundation enabling MiDashengLM's exceptional performance**.
|
62 |
+
|
63 |
+
## Framework
|
64 |
+
|
65 |
+
MiDashengLM integrates the powerful Dasheng audio encoder with
|
66 |
+
the Qwen2.5-Omni-7B Thinker decoder through a unique caption-based alignment strategy.
|
67 |
+
Unlike conventional ASR-driven approaches,
|
68 |
+
our model leverages general audio captions to capture comprehensive audio representations encompassing speech, environmental sounds, and musical elements
|
69 |
+
in a unified textual format. This design enables holistic audio understanding while maintaining exceptional computational efficiency.
|
70 |
+
|
71 |
+
<img src="fig/Framework-1.png" width="800">
|
72 |
+
|
73 |
+
### Why Captions Instead of ASR?
|
74 |
+
|
75 |
+
ASR Limitations:
|
76 |
+
- Discards huge amount of non-speech audio (music/environmental sounds).
|
77 |
+
- Misses paralinguistic info (speaker emotion, acoustic properties).
|
78 |
+
- Monotonic alignment provides trivial learning signal.
|
79 |
+
|
80 |
+
Caption Advantages:
|
81 |
+
- Utilizes all audio content.
|
82 |
+
- Captures global audio context.
|
83 |
+
- Non-monotonic alignment provides a hard learning signal.
|
84 |
+
|
85 |
+
### Novel Open Source Dataset for Training: ACAVCaps
|
86 |
+
|
87 |
+
ACAVCaps is a meticulously curated 38,662-hour collection of general audio captions derived from the open-source [ACAV100M audio repository](https://acav100m.github.io/).
|
88 |
+
While leveraging ACAV100M's extensive raw audio materials, we completely re-engineered the annotation process to create a dataset for holistic audio understanding.
|
89 |
+
We devide the dataset into six categories:
|
90 |
+
|
91 |
+
| Category | Example Caption |
|
92 |
+
|----------|-----------------|
|
93 |
+
| Pure Speech | "A female voice narrates historical competition with synthetic modulation" |
|
94 |
+
| Pure Sound | "Outdoor scene with wind, birds, duck quacking and background noise" |
|
95 |
+
| Mixed Music | "Crowd cheering with electronic synthesizer-driven soundscape" |
|
96 |
+
| Mixed Music | "The audio features a crowd cheering and clapping alongside electronic music with a synthesizer-driven, dark, and energetic soundscape." |
|
97 |
+
| Mixed Speech | "A Russian voice demonstrates a synthesizer’s capabilities over an experimental electronic backdrop, explaining its sound design and value in a gritty, vocal-fry tone." |
|
98 |
+
| Mixed Sound | "A man speaks in English about entering a city and village, accompanied by the sounds of a running vehicle." |
|
99 |
+
|
100 |
+
The figure below illustrates our data curation pipeline for ACAVCaps:
|
101 |
+
|
102 |
+
<img src="fig/acavcaps-1.png" width="800">
|
103 |
+
|
104 |
+
Each caption is generated through a three-step process:
|
105 |
+
|
106 |
+
1. **Multi-expert analysis** (speech, vocal, music, acoustics)
|
107 |
+
2. **LLM reasoning** synthesizing metadata with [DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1)
|
108 |
+
3. **Filtering** for audio-text consistency with [GLAP](https://github.com/xiaomi-research/dasheng-glap)
|
109 |
+
|
110 |
+
We will **release the ACAVCaps dataset** after the ICASSP 2026 review process.
|
111 |
|
112 |
## Usage
|
113 |
|
114 |
+
### Load Model
|
|
|
|
|
115 |
|
116 |
```python
|
117 |
+
from transformers import AutoModelForCausalLM, AutoProcessor
|
118 |
+
|
119 |
+
model_id = "mispeech/midashenglm-7B"
|
120 |
+
|
121 |
+
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
|
122 |
+
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
123 |
```
|
124 |
|
125 |
+
### Construct Prompt
|
126 |
+
|
127 |
+
```python
|
128 |
+
user_prompt = "Caption the audio." # You may try any other prompt
|
129 |
+
|
130 |
+
messages = [
|
131 |
+
{
|
132 |
+
"role": "system",
|
133 |
+
"content": [
|
134 |
+
{"type": "text", "text": "You are a helpful language and speech assistant."}
|
135 |
+
],
|
136 |
+
},
|
137 |
+
{
|
138 |
+
"role": "user",
|
139 |
+
"content": [
|
140 |
+
{"type": "text", "text": user_prompt},
|
141 |
+
{
|
142 |
+
"type": "audio",
|
143 |
+
"path": "/path/to/example.wav",
|
144 |
+
# or "url": "https://example.com/example.wav"
|
145 |
+
# or "audio": np.random.randn(16000)
|
146 |
+
},
|
147 |
+
],
|
148 |
+
},
|
149 |
+
]
|
150 |
+
```
|
151 |
|
152 |
+
### Generate Output
|
153 |
|
154 |
```python
|
155 |
+
import torch
|
156 |
+
with torch.no_grad():
|
157 |
+
model_inputs = processor.apply_chat_template(
|
158 |
+
messages,
|
159 |
+
tokenize=True,
|
160 |
+
add_generation_prompt=True,
|
161 |
+
add_special_tokens=True,
|
162 |
+
return_dict=True,
|
163 |
+
)
|
164 |
+
generation = model.generate(**model_inputs)
|
165 |
+
output = processor.batch_decode(generation, skip_special_tokens=True) # ["An engine is idling."]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
166 |
```
|
167 |
|
168 |
+
## Results
|
169 |
+
|
170 |
+
MiDashengLM delivers solid performance across diverse audio understanding tasks.
|
171 |
+
|
172 |
+
### Audio Captioning Results
|
173 |
+
|
174 |
+
| Domain | Dataset | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct |
|
175 |
+
|:--------:|:--------------:|:--------------:|:----------------:|:-------------------:|
|
176 |
+
| Music | MusicCaps | **59.71** | 43.71 | 35.43 |
|
177 |
+
| Music | Songdescriber | **45.39** | 45.31 | 44.63 |
|
178 |
+
| Sound | AudioCaps | **62.18** | 60.79 | 49.00 |
|
179 |
+
| Sound | ClothoV2 | **49.20** | 47.55 | 48.01 |
|
180 |
+
| Sound | AutoACD | **66.52** | 55.93 | 44.76 |
|
181 |
+
|
182 |
+
*Metrics: FENSE (higher is better).*
|
183 |
+
|
184 |
+
### Audio and Paralinguistic Classification
|
185 |
+
|
186 |
+
| Dataset | Metric | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct |
|
187 |
+
|:----------------:|:------:|:--------------:|:----------------:|:------------------:|
|
188 |
+
| VoxCeleb1 | ACC↑ | **92.36** | 59.71 | 82.72 |
|
189 |
+
| VoxLingua107 | ACC↑ | **93.41** | 51.03 | 73.65 |
|
190 |
+
| VoxCeleb-Gender | ACC↑ | 96.12 | **99.82** | 99.69 |
|
191 |
+
| VGGSound | ACC↑ | **52.11** | 0.97 | 2.20 |
|
192 |
+
| Cochlscene | ACC↑ | **74.06** | 23.88 | 18.34 |
|
193 |
+
| NSynth | ACC↑ | **80.52** | 60.45 | 38.09 |
|
194 |
+
| FMA | ACC↑ | 63.73 | **66.77** | 27.91 |
|
195 |
+
| FSDKaggle2018 | ACC↑ | **75.25** | 31.38 | 24.75 |
|
196 |
+
| AudioSet | mAP↑ | **8.86** | 6.48 | 3.47 |
|
197 |
+
| FSD50K | mAP↑ | **37.58** | 23.87 | 27.23 |
|
198 |
+
|
199 |
+
### ASR Performance
|
200 |
+
|
201 |
+
| Dataset | Language | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct |
|
202 |
+
|:------------------:|:-----------:|:--------------:|:------------:|:-------------------:|
|
203 |
+
| LibriSpeech test-clean | English | 3.7 | 1.7 | **1.3** |
|
204 |
+
| LibriSpeech test-other | English | 6.2 | 3.4 | **2.4** |
|
205 |
+
| People's Speech | English | 27.8 | 28.6 | **22.3** |
|
206 |
+
| AISHELL2 Mic | Chinese | 3.2 | **2.5** | 2.7 |
|
207 |
+
| AISHELL2 iOS | Chinese | 2.9 | **2.6** | **2.6** |
|
208 |
+
| AISHELL2 Android | Chinese | 3.1 | 2.7 | **2.6** |
|
209 |
+
| GigaSpeech2 | Indonesian | **20.8** | 21.2 | >100 |
|
210 |
+
| GigaSpeech2 | Thai | **36.9** | 53.8 | >100 |
|
211 |
+
| GigaSpeech2 | Viet | **18.1** | 18.6 | >100 |
|
212 |
+
|
213 |
+
*Metrics: WER/CER (lower is better).*
|
214 |
+
|
215 |
+
### Question Answering Results
|
216 |
+
|
217 |
+
| Dataset | Subset | Metric | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct |
|
218 |
+
|:------------:|:-------:|:------:|:--------------:|:----------------:|:-------------------:|
|
219 |
+
| MuChoMusic | | ACC↑ | **71.35** | 64.79 | 67.40 |
|
220 |
+
| MMAU | Sound | ACC↑ | 68.47 | 67.87 | **74.17** |
|
221 |
+
| MMAU | Music | ACC↑ | 66.77 | **69.16** | 61.08 |
|
222 |
+
| MMAU | Speech | ACC↑ | **63.66** | 59.76 | 57.66 |
|
223 |
+
| MMAU | Average | ACC↑ | **66.30** | 65.60 | 64.30 |
|
224 |
+
| MusicQA | | FENSE↑ | **62.35** | 60.60 | 40.00 |
|
225 |
+
| AudioCaps-QA | | FENSE↑ | **54.31** | 53.28 | 47.34 |
|
226 |
+
|
227 |
+
*Metrics: Higher is better.*
|
228 |
+
|
229 |
+
### Reproduction Instructions
|
230 |
+
|
231 |
+
To reproduce our results, please refer to https://github.com/xiaomi-research/dasheng-lm#reproduction-instructions
|
232 |
+
|
233 |
## Citation
|
234 |
|
235 |
+
MiDashengLM is under the Apache License 2.0, and we encourage its use in **both research and business applications**.
|
236 |
+
|
237 |
+
If you find MiDashengLM useful in your research, please consider citing our work:
|
238 |
+
|
239 |
```bibtex
|
240 |
+
@misc{midashenglm7b,
|
241 |
+
title = {MiDashengLM: Efficient Audio Understanding with General Audio Captions},
|
242 |
+
author = {Xiaomi MiLM Plus Horizon Team},
|
243 |
+
year = {2025},
|
244 |
+
}
|