jimbozhang commited on
Commit
4c11a4f
·
verified ·
1 Parent(s): 7c3ae10

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +212 -101
README.md CHANGED
@@ -1,133 +1,244 @@
1
  ---
2
  license: apache-2.0
3
- # TODO 什么License?
4
  language:
5
  - en
6
  - zh
7
- # TODO 明确支持的语言
 
 
8
  pipeline_tag: audio-text-to-text
9
  tags:
10
  - multimodal
11
  - audio-language-model
12
  - audio
13
- # - audio-captioning
14
- # - audio-classification
15
- # - audio-generation
16
- # - audio-question-answering
17
- # - audio-understanding
18
- # - chat
19
- # - speech-recognition
20
- # - text-to-speech
21
- # TODO 有什么能力
22
  base_model:
23
  - mispeech/dasheng-0.6B
24
- - Qwen/Qwen2.5-Omni-3B
25
  base_model_relation: finetune
26
- # TODO 检查是否正确
27
  ---
28
 
29
- # MiDashengLM
 
 
 
 
 
 
 
 
 
 
 
30
 
31
- ## Requirements
32
 
33
- <!-- Qwen2.5-Omni requires transformers >= 4.52, which requires Python >= 3.9. -->
34
- <!-- torchaudio is required to process audio and load audio files. -->
35
- - Python >= 3.9
36
- - `transformers[torch]` >= 4.52
37
- - `torchaudio`
38
- - `librosa`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
 
40
  ## Usage
41
 
42
- > [!NOTE]
43
- > MiDashengLM uses custom codes to define the model, so you need to set `trust_remote_code=True` when loading it.
44
- > You can find these codes in the repository.
45
 
46
  ```python
47
- >>> from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
48
- >>> model = AutoModelForCausalLM.from_pretrained("zhoukz/MiDashengLM-HF-dev", trust_remote_code=True)
49
- >>> model.eval()
50
- >>> processor = AutoProcessor.from_pretrained("zhoukz/MiDashengLM-HF-dev", trust_remote_code=True)
51
-
52
- >>> tokenizer = AutoTokenizer.from_pretrained("mispeech/MiDashengLM-HF-dev")
53
- >>> processor = AutoProcessor.from_pretrained("mispeech/MiDashengLM-HF-dev", trust_remote_code=True)
54
- >>> model = AutoModelForCausalLM.from_pretrained("mispeech/MiDashengLM-HF-dev", trust_remote_code=True)
55
- >>> model.eval()
56
-
57
- >>> messages = [
58
- ... {
59
- ... "role": "system",
60
- ... "content": [
61
- ... {"type": "text", "text": "You are a helpful language and speech assistant."}
62
- ... ],
63
- ... },
64
- ... {
65
- ... "role": "user",
66
- ... "content": [
67
- ... {"type": "text", "text": "Caption the audio."},
68
- ... {
69
- ... "type": "audio",
70
- ... "path": "/path/to/audio.wav",
71
- ... },
72
- ... ],
73
- ... },
74
- ... ]
75
-
76
- >>> import torch
77
- >>> with torch.no_grad():
78
- ... model_inputs = processor.apply_chat_template(
79
- ... messages,
80
- ... tokenize=True,
81
- ... add_generation_prompt=True,
82
- ... add_special_tokens=True,
83
- ... return_dict=True,
84
- ... )
85
- ... generation = model.generate(**model_inputs)
86
- ... output = tokenizer.batch_decode(generation, skip_special_tokens=True)
87
-
88
- >>> print(output)
89
- ["An engine is idling."]
90
  ```
91
 
92
- [`processor.apply_chat_template`] accepts audio inputs specified in various ways, including file paths, URLs, and `np.ndarray`:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
93
 
94
- [`processor.apply_chat_template`]: https://huggingface.co/docs/transformers/v4.53.1/en/main_classes/processors#transformers.ProcessorMixin.apply_chat_template
95
 
96
  ```python
97
- >>> messages_by_path = [
98
- ... {
99
- ... "role": "user",
100
- ... "content": [
101
- ... {"type": "text", "text": "Caption the audio."},
102
- ... {"type": "audio", "path": "/path/to/audio.wav"},
103
- ... ],
104
- ... },
105
- ... ]
106
-
107
- >>> messages_by_url = [
108
- ... {
109
- ... "role": "user",
110
- ... "content": [
111
- ... {"type": "text", "text": "Caption the audio."},
112
- ... {"type": "audio", "url": "https://example.com/audio.wav"},
113
- ... ],
114
- ... },
115
- ... ]
116
-
117
- >>> import numpy as np
118
- >>> messages_by_data = [
119
- ... {
120
- ... "role": "user",
121
- ... "content": [
122
- ... {"type": "text", "text": "Caption the audio."},
123
- ... {"type": "audio", "audio": np.random.randn(16000)},
124
- ... ],
125
- ... },
126
- ... ]
127
  ```
128
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
129
  ## Citation
130
 
 
 
 
 
131
  ```bibtex
132
- TODO
133
- ```
 
 
 
 
1
  ---
2
  license: apache-2.0
 
3
  language:
4
  - en
5
  - zh
6
+ - th
7
+ - id
8
+ - vi
9
  pipeline_tag: audio-text-to-text
10
  tags:
11
  - multimodal
12
  - audio-language-model
13
  - audio
 
 
 
 
 
 
 
 
 
14
  base_model:
15
  - mispeech/dasheng-0.6B
16
+ - Qwen/Qwen2.5-Omni-7B
17
  base_model_relation: finetune
 
18
  ---
19
 
20
+ <div align="center">
21
+ <h1>
22
+ MiDashengLM
23
+ </h1>
24
+ <b><em>Efficient audio understanding with general audio captions</em></b></em></b>
25
+ <p>
26
+ </p>
27
+ <a href="https://github.com/xiaomi-research/dasheng-lm"><img src="https://img.shields.io/badge/Homepage-GitHub-0366d6" alt="version"></a>
28
+ <a href="https://arxiv.org/abs/2507.xxxxx"><img src="https://img.shields.io/badge/arXiv-2507.xxxxx-b31b1b" alt="version"></a>
29
+ <a href="https://huggingface.co/spaces/mispeech/MiDashengLM"><img src="https://img.shields.io/badge/Demo-Gradio-ffcc66" alt="version"></a>
30
+ <a href="https://frankenliu.github.io/midashenglm_demo/"><img src="https://img.shields.io/badge/Demo-Page-0366d6" alt="version"></a>
31
+ </div>
32
 
33
+ ## 🔥 Key Highlights
34
 
35
+ **State-of-the-Art Performance**
36
+ - Outperforms Qwen2.5-Omni-7B, Kimi-Audio-Instruct-7B on **multiple key audio understanding tasks**.
37
+
38
+ **High Efficiency**
39
+ - **3.2× throughput speedup** at comparable batch sizes compared to Qwen2.5-Omni-7B.
40
+ - 20x throughput speedup by increasing furhter batchsizes. We tested up to a **batch size=512** for 30s audio input on 80GB GPUs. Baselines only support batch size = 8.
41
+ - Time-to-first-token (TTFT) speedup of up to 4x compared to Qwen2.5-Omni-7B.
42
+
43
+ **Caption-based Alignment**
44
+ - Trained with **general audio captions** (instead of ASR transcripts) to achieve holistic audio understanding.
45
+
46
+ **Full Transparency**
47
+ - **Public-source** training data and reproducible pipeline.
48
+ - Apache License 2.0 for **both research and commercial use**.
49
+
50
+ <div align="center">
51
+ <img src="fig/capabilities_plot_7b-1.png" width="600">
52
+ </div>
53
+
54
+ ## Acknowledgment and Model Foundation
55
+
56
+ Although MiDashengLM demonstrates superior audio understanding performance and efficiency compared to Qwen2.5-Omni models,
57
+ we acknowledge **Qwen2.5-Omni as a remarkable and respected foundational work** in the field.
58
+ Our model specifically uses [Qwen2.5-Omni-7B Thinker](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) as the decoder starting point during training, building upon its robust architecture and weight initialization.
59
+
60
+ The audio encoder is built upon [Dasheng](https://github.com/XiaoMi/dasheng), an open-source audio encoder for general audio understanding with state-of-the-art performance.
61
+ **Dasheng serves as the core foundation enabling MiDashengLM's exceptional performance**.
62
+
63
+ ## Framework
64
+
65
+ MiDashengLM integrates the powerful Dasheng audio encoder with
66
+ the Qwen2.5-Omni-7B Thinker decoder through a unique caption-based alignment strategy.
67
+ Unlike conventional ASR-driven approaches,
68
+ our model leverages general audio captions to capture comprehensive audio representations encompassing speech, environmental sounds, and musical elements
69
+ in a unified textual format. This design enables holistic audio understanding while maintaining exceptional computational efficiency.
70
+
71
+ <img src="fig/Framework-1.png" width="800">
72
+
73
+ ### Why Captions Instead of ASR?
74
+
75
+ ASR Limitations:
76
+ - Discards huge amount of non-speech audio (music/environmental sounds).
77
+ - Misses paralinguistic info (speaker emotion, acoustic properties).
78
+ - Monotonic alignment provides trivial learning signal.
79
+
80
+ Caption Advantages:
81
+ - Utilizes all audio content.
82
+ - Captures global audio context.
83
+ - Non-monotonic alignment provides a hard learning signal.
84
+
85
+ ### Novel Open Source Dataset for Training: ACAVCaps
86
+
87
+ ACAVCaps is a meticulously curated 38,662-hour collection of general audio captions derived from the open-source [ACAV100M audio repository](https://acav100m.github.io/).
88
+ While leveraging ACAV100M's extensive raw audio materials, we completely re-engineered the annotation process to create a dataset for holistic audio understanding.
89
+ We devide the dataset into six categories:
90
+
91
+ | Category | Example Caption |
92
+ |----------|-----------------|
93
+ | Pure Speech | "A female voice narrates historical competition with synthetic modulation" |
94
+ | Pure Sound | "Outdoor scene with wind, birds, duck quacking and background noise" |
95
+ | Mixed Music | "Crowd cheering with electronic synthesizer-driven soundscape" |
96
+ | Mixed Music | "The audio features a crowd cheering and clapping alongside electronic music with a synthesizer-driven, dark, and energetic soundscape." |
97
+ | Mixed Speech | "A Russian voice demonstrates a synthesizer’s capabilities over an experimental electronic backdrop, explaining its sound design and value in a gritty, vocal-fry tone." |
98
+ | Mixed Sound | "A man speaks in English about entering a city and village, accompanied by the sounds of a running vehicle." |
99
+
100
+ The figure below illustrates our data curation pipeline for ACAVCaps:
101
+
102
+ <img src="fig/acavcaps-1.png" width="800">
103
+
104
+ Each caption is generated through a three-step process:
105
+
106
+ 1. **Multi-expert analysis** (speech, vocal, music, acoustics)
107
+ 2. **LLM reasoning** synthesizing metadata with [DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1)
108
+ 3. **Filtering** for audio-text consistency with [GLAP](https://github.com/xiaomi-research/dasheng-glap)
109
+
110
+ We will **release the ACAVCaps dataset** after the ICASSP 2026 review process.
111
 
112
  ## Usage
113
 
114
+ ### Load Model
 
 
115
 
116
  ```python
117
+ from transformers import AutoModelForCausalLM, AutoProcessor
118
+
119
+ model_id = "mispeech/midashenglm-7B"
120
+
121
+ model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
122
+ processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
123
  ```
124
 
125
+ ### Construct Prompt
126
+
127
+ ```python
128
+ user_prompt = "Caption the audio." # You may try any other prompt
129
+
130
+ messages = [
131
+ {
132
+ "role": "system",
133
+ "content": [
134
+ {"type": "text", "text": "You are a helpful language and speech assistant."}
135
+ ],
136
+ },
137
+ {
138
+ "role": "user",
139
+ "content": [
140
+ {"type": "text", "text": user_prompt},
141
+ {
142
+ "type": "audio",
143
+ "path": "/path/to/example.wav",
144
+ # or "url": "https://example.com/example.wav"
145
+ # or "audio": np.random.randn(16000)
146
+ },
147
+ ],
148
+ },
149
+ ]
150
+ ```
151
 
152
+ ### Generate Output
153
 
154
  ```python
155
+ import torch
156
+ with torch.no_grad():
157
+ model_inputs = processor.apply_chat_template(
158
+ messages,
159
+ tokenize=True,
160
+ add_generation_prompt=True,
161
+ add_special_tokens=True,
162
+ return_dict=True,
163
+ )
164
+ generation = model.generate(**model_inputs)
165
+ output = processor.batch_decode(generation, skip_special_tokens=True) # ["An engine is idling."]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
166
  ```
167
 
168
+ ## Results
169
+
170
+ MiDashengLM delivers solid performance across diverse audio understanding tasks.
171
+
172
+ ### Audio Captioning Results
173
+
174
+ | Domain | Dataset | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct |
175
+ |:--------:|:--------------:|:--------------:|:----------------:|:-------------------:|
176
+ | Music | MusicCaps | **59.71** | 43.71 | 35.43 |
177
+ | Music | Songdescriber | **45.39** | 45.31 | 44.63 |
178
+ | Sound | AudioCaps | **62.18** | 60.79 | 49.00 |
179
+ | Sound | ClothoV2 | **49.20** | 47.55 | 48.01 |
180
+ | Sound | AutoACD | **66.52** | 55.93 | 44.76 |
181
+
182
+ *Metrics: FENSE (higher is better).*
183
+
184
+ ### Audio and Paralinguistic Classification
185
+
186
+ | Dataset | Metric | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct |
187
+ |:----------------:|:------:|:--------------:|:----------------:|:------------------:|
188
+ | VoxCeleb1 | ACC↑ | **92.36** | 59.71 | 82.72 |
189
+ | VoxLingua107 | ACC↑ | **93.41** | 51.03 | 73.65 |
190
+ | VoxCeleb-Gender | ACC↑ | 96.12 | **99.82** | 99.69 |
191
+ | VGGSound | ACC↑ | **52.11** | 0.97 | 2.20 |
192
+ | Cochlscene | ACC↑ | **74.06** | 23.88 | 18.34 |
193
+ | NSynth | ACC↑ | **80.52** | 60.45 | 38.09 |
194
+ | FMA | ACC↑ | 63.73 | **66.77** | 27.91 |
195
+ | FSDKaggle2018 | ACC↑ | **75.25** | 31.38 | 24.75 |
196
+ | AudioSet | mAP↑ | **8.86** | 6.48 | 3.47 |
197
+ | FSD50K | mAP↑ | **37.58** | 23.87 | 27.23 |
198
+
199
+ ### ASR Performance
200
+
201
+ | Dataset | Language | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct |
202
+ |:------------------:|:-----------:|:--------------:|:------------:|:-------------------:|
203
+ | LibriSpeech test-clean | English | 3.7 | 1.7 | **1.3** |
204
+ | LibriSpeech test-other | English | 6.2 | 3.4 | **2.4** |
205
+ | People's Speech | English | 27.8 | 28.6 | **22.3** |
206
+ | AISHELL2 Mic | Chinese | 3.2 | **2.5** | 2.7 |
207
+ | AISHELL2 iOS | Chinese | 2.9 | **2.6** | **2.6** |
208
+ | AISHELL2 Android | Chinese | 3.1 | 2.7 | **2.6** |
209
+ | GigaSpeech2 | Indonesian | **20.8** | 21.2 | >100 |
210
+ | GigaSpeech2 | Thai | **36.9** | 53.8 | >100 |
211
+ | GigaSpeech2 | Viet | **18.1** | 18.6 | >100 |
212
+
213
+ *Metrics: WER/CER (lower is better).*
214
+
215
+ ### Question Answering Results
216
+
217
+ | Dataset | Subset | Metric | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct |
218
+ |:------------:|:-------:|:------:|:--------------:|:----------------:|:-------------------:|
219
+ | MuChoMusic | | ACC↑ | **71.35** | 64.79 | 67.40 |
220
+ | MMAU | Sound | ACC↑ | 68.47 | 67.87 | **74.17** |
221
+ | MMAU | Music | ACC↑ | 66.77 | **69.16** | 61.08 |
222
+ | MMAU | Speech | ACC↑ | **63.66** | 59.76 | 57.66 |
223
+ | MMAU | Average | ACC↑ | **66.30** | 65.60 | 64.30 |
224
+ | MusicQA | | FENSE↑ | **62.35** | 60.60 | 40.00 |
225
+ | AudioCaps-QA | | FENSE↑ | **54.31** | 53.28 | 47.34 |
226
+
227
+ *Metrics: Higher is better.*
228
+
229
+ ### Reproduction Instructions
230
+
231
+ To reproduce our results, please refer to https://github.com/xiaomi-research/dasheng-lm#reproduction-instructions
232
+
233
  ## Citation
234
 
235
+ MiDashengLM is under the Apache License 2.0, and we encourage its use in **both research and business applications**.
236
+
237
+ If you find MiDashengLM useful in your research, please consider citing our work:
238
+
239
  ```bibtex
240
+ @misc{midashenglm7b,
241
+ title = {MiDashengLM: Efficient Audio Understanding with General Audio Captions},
242
+ author = {Xiaomi MiLM Plus Horizon Team},
243
+ year = {2025},
244
+ }