mispeech
/

midashenglm-7b

@@ -1,133 +1,244 @@
 ---
 license: apache-2.0
-# TODO 什么License？
 language:
 - en
 - zh
-# TODO 明确支持的语言
 pipeline_tag: audio-text-to-text
 tags:
 - multimodal
 - audio-language-model
 - audio
-# - audio-captioning
-# - audio-classification
-# - audio-generation
-# - audio-question-answering
-# - audio-understanding
-# - chat
-# - speech-recognition
-# - text-to-speech
-# TODO 有什么能力
 base_model:
 - mispeech/dasheng-0.6B
-- Qwen/Qwen2.5-Omni-3B
 base_model_relation: finetune
-# TODO 检查是否正确
 ---
-# MiDashengLM
-## Requirements
-<!-- Qwen2.5-Omni requires transformers >= 4.52, which requires Python >= 3.9. -->
-<!-- torchaudio is required to process audio and load audio files. -->
-- Python >= 3.9
-- `transformers[torch]` >= 4.52
-- `torchaudio`
-- `librosa`
 ## Usage
-> [!NOTE]
-> MiDashengLM uses custom codes to define the model, so you need to set `trust_remote_code=True` when loading it.
-> You can find these codes in the repository.
 ```python
->>> from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
->>> model = AutoModelForCausalLM.from_pretrained("zhoukz/MiDashengLM-HF-dev", trust_remote_code=True)
->>> model.eval()
->>> processor = AutoProcessor.from_pretrained("zhoukz/MiDashengLM-HF-dev", trust_remote_code=True)
->>> tokenizer = AutoTokenizer.from_pretrained("mispeech/MiDashengLM-HF-dev")
->>> processor = AutoProcessor.from_pretrained("mispeech/MiDashengLM-HF-dev", trust_remote_code=True)
->>> model = AutoModelForCausalLM.from_pretrained("mispeech/MiDashengLM-HF-dev", trust_remote_code=True)
->>> model.eval()
->>> messages = [
-...     {
-...         "role": "system",
-...         "content": [
-...             {"type": "text", "text": "You are a helpful language and speech assistant."}
-...         ],
-...     },
-...     {
-...         "role": "user",
-...         "content": [
-...             {"type": "text", "text": "Caption the audio."},
-...             {
-...                 "type": "audio",
-...                 "path": "/path/to/audio.wav",
-...             },
-...         ],
-...     },
-... ]
->>> import torch
->>> with torch.no_grad():
-...     model_inputs = processor.apply_chat_template(
-...         messages,
-...         tokenize=True,
-...         add_generation_prompt=True,
-...         add_special_tokens=True,
-...         return_dict=True,
-...     )
-...     generation = model.generate(**model_inputs)
-...     output = tokenizer.batch_decode(generation, skip_special_tokens=True)
->>> print(output)
-["An engine is idling."]
 ```
-[`processor.apply_chat_template`] accepts audio inputs specified in various ways, including file paths, URLs, and `np.ndarray`:
-[`processor.apply_chat_template`]: https://huggingface.co/docs/transformers/v4.53.1/en/main_classes/processors#transformers.ProcessorMixin.apply_chat_template
 ```python
->>> messages_by_path = [
-...     {
-...         "role": "user",
-...         "content": [
-...             {"type": "text", "text": "Caption the audio."},
-...             {"type": "audio", "path": "/path/to/audio.wav"},
-...         ],
-...     },
-... ]
->>> messages_by_url = [
-...     {
-...         "role": "user",
-...         "content": [
-...             {"type": "text", "text": "Caption the audio."},
-...             {"type": "audio", "url": "https://example.com/audio.wav"},
-...         ],
-...     },
-... ]
->>> import numpy as np
->>> messages_by_data = [
-...     {
-...         "role": "user",
-...         "content": [
-...             {"type": "text", "text": "Caption the audio."},
-...             {"type": "audio", "audio": np.random.randn(16000)},
-...         ],
-...     },
-... ]
 ```
 ## Citation
 ```bibtex
-TODO
-```

 ---
 license: apache-2.0
 language:
 - en
 - zh
+- th
+- id
+- vi
 pipeline_tag: audio-text-to-text
 tags:
 - multimodal
 - audio-language-model
 - audio
 base_model:
 - mispeech/dasheng-0.6B
+- Qwen/Qwen2.5-Omni-7B
 base_model_relation: finetune
 ---
+<div align="center">
+    <h1>
+    MiDashengLM
+    </h1>
+    <b><em>Efficient audio understanding with general audio captions</em></b></em></b>
+    <p>
+    </p>
+    <a href="https://github.com/xiaomi-research/dasheng-lm"><img src="https://img.shields.io/badge/Homepage-GitHub-0366d6" alt="version"></a>
+    <a href="https://arxiv.org/abs/2507.xxxxx"><img src="https://img.shields.io/badge/arXiv-2507.xxxxx-b31b1b" alt="version"></a>
+    <a href="https://huggingface.co/spaces/mispeech/MiDashengLM"><img src="https://img.shields.io/badge/Demo-Gradio-ffcc66" alt="version"></a>
+    <a href="https://frankenliu.github.io/midashenglm_demo/"><img src="https://img.shields.io/badge/Demo-Page-0366d6" alt="version"></a>
+</div>
+## 🔥 Key Highlights
+**State-of-the-Art Performance**
+   - Outperforms Qwen2.5-Omni-7B, Kimi-Audio-Instruct-7B on **multiple key audio understanding tasks**.
+**High Efficiency**
+   - **3.2× throughput speedup** at comparable batch sizes compared to Qwen2.5-Omni-7B.
+   - 20x throughput speedup by increasing furhter batchsizes. We tested up to a **batch size=512** for 30s audio input on 80GB GPUs. Baselines only support batch size = 8.
+   - Time-to-first-token (TTFT) speedup of up to 4x compared to Qwen2.5-Omni-7B.
+**Caption-based Alignment**
+   - Trained with **general audio captions** (instead of ASR transcripts) to achieve holistic audio understanding.
+**Full Transparency**
+   - **Public-source** training data and reproducible pipeline.
+   - Apache License 2.0 for **both research and commercial use**.
+<div align="center">
+    <img src="fig/capabilities_plot_7b-1.png" width="600">
+</div>
+## Acknowledgment and Model Foundation
+Although MiDashengLM demonstrates superior audio understanding performance and efficiency compared to Qwen2.5-Omni models,
+we acknowledge **Qwen2.5-Omni as a remarkable and respected foundational work** in the field.
+Our model specifically uses [Qwen2.5-Omni-7B Thinker](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) as the decoder starting point during training, building upon its robust architecture and weight initialization.
+The audio encoder is built upon [Dasheng](https://github.com/XiaoMi/dasheng), an open-source audio encoder for general audio understanding with state-of-the-art performance.
+**Dasheng serves as the core foundation enabling MiDashengLM's exceptional performance**.
+## Framework
+MiDashengLM integrates the powerful Dasheng audio encoder with
+the Qwen2.5-Omni-7B Thinker decoder through a unique caption-based alignment strategy.
+Unlike conventional ASR-driven approaches,
+our model leverages general audio captions to capture comprehensive audio representations encompassing speech, environmental sounds, and musical elements
+in a unified textual format. This design enables holistic audio understanding while maintaining exceptional computational efficiency.
+<img src="fig/Framework-1.png" width="800">
+### Why Captions Instead of ASR?
+ASR Limitations:
+  - Discards huge amount of non-speech audio (music/environmental sounds).
+  - Misses paralinguistic info (speaker emotion, acoustic properties).
+  - Monotonic alignment provides trivial learning signal.
+Caption Advantages:
+  - Utilizes all audio content.
+  - Captures global audio context.
+  - Non-monotonic alignment provides a hard learning signal.
+### Novel Open Source Dataset for Training: ACAVCaps
+ACAVCaps is a meticulously curated 38,662-hour collection of general audio captions derived from the open-source [ACAV100M audio repository](https://acav100m.github.io/).
+While leveraging ACAV100M's extensive raw audio materials, we completely re-engineered the annotation process to create a dataset for holistic audio understanding.
+We devide the dataset into six categories:
+| Category | Example Caption |
+|----------|-----------------|
+| Pure Speech | "A female voice narrates historical competition with synthetic modulation" |
+| Pure Sound | "Outdoor scene with wind, birds, duck quacking and background noise" |
+| Mixed Music | "Crowd cheering with electronic synthesizer-driven soundscape" |
+| Mixed Music | "The audio features a crowd cheering and clapping alongside electronic music with a synthesizer-driven, dark, and energetic soundscape." |
+| Mixed Speech | "A Russian voice demonstrates a synthesizer’s capabilities over an experimental electronic backdrop, explaining its sound design and value in a gritty, vocal-fry tone." |
+| Mixed Sound | "A man speaks in English about entering a city and village, accompanied by the sounds of a running vehicle." |
+The figure below illustrates our data curation pipeline for ACAVCaps:
+<img src="fig/acavcaps-1.png" width="800">
+Each caption is generated through a three-step process:
+1. **Multi-expert analysis** (speech, vocal, music, acoustics)
+2. **LLM reasoning** synthesizing metadata with [DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1)
+3. **Filtering** for audio-text consistency with [GLAP](https://github.com/xiaomi-research/dasheng-glap)
+We will **release the ACAVCaps dataset** after the ICASSP 2026 review process.
 ## Usage
+### Load Model
 ```python
+from transformers import AutoModelForCausalLM, AutoProcessor
+model_id = "mispeech/midashenglm-7B"
+model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
+processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
 ```
+### Construct Prompt
+```python
+user_prompt = "Caption the audio."  # You may try any other prompt
+messages = [
+    {
+        "role": "system",
+        "content": [
+            {"type": "text", "text": "You are a helpful language and speech assistant."}
+        ],
+    },
+    {
+        "role": "user",
+        "content": [
+            {"type": "text", "text": user_prompt},
+            {
+                "type": "audio",
+                "path": "/path/to/example.wav",
+                # or "url": "https://example.com/example.wav"
+                # or "audio": np.random.randn(16000)
+            },
+        ],
+    },
+]
+```
+### Generate Output
 ```python
+import torch
+with torch.no_grad():
+    model_inputs = processor.apply_chat_template(
+        messages,
+        tokenize=True,
+        add_generation_prompt=True,
+        add_special_tokens=True,
+        return_dict=True,
+    )
+    generation = model.generate(**model_inputs)
+    output = processor.batch_decode(generation, skip_special_tokens=True)  # ["An engine is idling."]
 ```
+## Results
+MiDashengLM delivers solid performance across diverse audio understanding tasks.
+### Audio Captioning Results
+| Domain   | Dataset        | MiDashengLM    | Qwen2.5-Omni-7B  | Kimi-Audio-Instruct |
+|:--------:|:--------------:|:--------------:|:----------------:|:-------------------:|
+| Music    | MusicCaps      | **59.71**      | 43.71            | 35.43               |
+| Music    | Songdescriber  | **45.39**      | 45.31            | 44.63               |
+| Sound    | AudioCaps      | **62.18**      | 60.79            | 49.00               |
+| Sound    | ClothoV2       | **49.20**      | 47.55            | 48.01               |
+| Sound    | AutoACD        | **66.52**      | 55.93            | 44.76               |
+*Metrics: FENSE (higher is better).*
+### Audio and Paralinguistic Classification
+| Dataset          | Metric | MiDashengLM    | Qwen2.5-Omni-7B | Kimi-Audio-Instruct |
+|:----------------:|:------:|:--------------:|:----------------:|:------------------:|
+| VoxCeleb1        | ACC↑   | **92.36**      | 59.71            | 82.72              |
+| VoxLingua107     | ACC↑   | **93.41**      | 51.03            | 73.65              |
+| VoxCeleb-Gender  | ACC↑   | 96.12          | **99.82**        | 99.69              |
+| VGGSound         | ACC↑   | **52.11**      | 0.97             | 2.20               |
+| Cochlscene       | ACC↑   | **74.06**      | 23.88            | 18.34              |
+| NSynth           | ACC↑   | **80.52**      | 60.45            | 38.09              |
+| FMA              | ACC↑   | 63.73          | **66.77**        | 27.91              |
+| FSDKaggle2018    | ACC↑   | **75.25**      | 31.38            | 24.75              |
+| AudioSet         | mAP↑   | **8.86**       | 6.48             | 3.47               |
+| FSD50K           | mAP↑   | **37.58**      | 23.87            | 27.23              |
+### ASR Performance
+| Dataset            | Language    | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct |
+|:------------------:|:-----------:|:--------------:|:------------:|:-------------------:|
+| LibriSpeech test-clean  | English | 3.7           | 1.7          | **1.3**             |
+| LibriSpeech test-other  | English | 6.2           | 3.4          | **2.4**             |
+| People's Speech    | English     | 27.8           | 28.6         | **22.3**            |
+| AISHELL2 Mic       | Chinese     | 3.2            | **2.5**      | 2.7                 |
+| AISHELL2 iOS       | Chinese     | 2.9            | **2.6**      | **2.6**             |
+| AISHELL2 Android   | Chinese     | 3.1            | 2.7          | **2.6**             |
+| GigaSpeech2        | Indonesian  | **20.8**       | 21.2         | >100                |
+| GigaSpeech2        | Thai        | **36.9**       | 53.8         | >100                |
+| GigaSpeech2        | Viet        | **18.1**       | 18.6         | >100                |
+*Metrics: WER/CER (lower is better).*
+### Question Answering Results
+| Dataset      | Subset  | Metric | MiDashengLM    | Qwen2.5-Omni-7B  | Kimi-Audio-Instruct |
+|:------------:|:-------:|:------:|:--------------:|:----------------:|:-------------------:|
+| MuChoMusic   |         | ACC↑   | **71.35**      | 64.79            | 67.40               |
+| MMAU         | Sound   | ACC↑   | 68.47          | 67.87            | **74.17**           |
+| MMAU         | Music   | ACC↑   | 66.77          | **69.16**        | 61.08               |
+| MMAU         | Speech  | ACC↑   | **63.66**      | 59.76            | 57.66               |
+| MMAU         | Average | ACC↑   | **66.30**      | 65.60            | 64.30               |
+| MusicQA      |         | FENSE↑ | **62.35**      | 60.60            | 40.00               |
+| AudioCaps-QA |         | FENSE↑ | **54.31**      | 53.28            | 47.34               |
+*Metrics: Higher is better.*
+### Reproduction Instructions
+To reproduce our results, please refer to https://github.com/xiaomi-research/dasheng-lm#reproduction-instructions
 ## Citation
+MiDashengLM is under the Apache License 2.0, and we encourage its use in **both research and business applications**.
+If you find MiDashengLM useful in your research, please consider citing our work:
 ```bibtex
+@misc{midashenglm7b,
+    title = {MiDashengLM: Efficient Audio Understanding with General Audio Captions},
+    author = {Xiaomi MiLM Plus Horizon Team},
+    year = {2025},
+}