MiDashengLM

Efficient audio understanding with general audio captions

version version version version version

πŸ”₯ Key Highlights

State-of-the-Art Performance

  • Outperforms Qwen2.5-Omni-7B, Kimi-Audio-Instruct-7B on multiple key audio understanding tasks.

High Efficiency

  • 3.2Γ— throughput speedup at comparable batch sizes compared to Qwen2.5-Omni-7B.
  • 20x throughput speedup by increasing furhter batchsizes. We tested up to a batch size=512 for 30s audio input on 80GB GPUs. Baselines only support batch size = 8.
  • Time-to-first-token (TTFT) speedup of up to 4x compared to Qwen2.5-Omni-7B.

Caption-based Alignment

  • Trained with general audio captions (instead of ASR transcripts) to achieve holistic audio understanding.

Full Transparency

  • Public-source training data and reproducible pipeline.
  • Apache License 2.0 for both research and commercial use.

Acknowledgment and Model Foundation

Although MiDashengLM demonstrates superior audio understanding performance and efficiency compared to Qwen2.5-Omni models, we acknowledge Qwen2.5-Omni as a remarkable and respected foundational work in the field. Our model specifically uses Qwen2.5-Omni-7B Thinker as the initialization for decoder training, building upon its robust architecture and weight initialization.

The audio encoder is built upon Dasheng, an open-source audio encoder for general audio understanding with state-of-the-art performance. Dasheng serves as the core foundation enabling MiDashengLM's exceptional performance.

Framework

MiDashengLM integrates the powerful Dasheng audio encoder with the Qwen2.5-Omni-7B Thinker decoder through a unique caption-based alignment strategy. Unlike conventional ASR-driven approaches, our model leverages general audio captions to capture comprehensive audio representations encompassing speech, environmental sounds, and musical elements in a unified textual format. This design enables holistic audio understanding while maintaining exceptional computational efficiency.

Why Captions Instead of ASR?

ASR Limitations:

  • Discards huge amount of non-speech audio (music/environmental sounds).
  • Misses paralinguistic info (speaker emotion, acoustic properties).
  • Monotonic alignment provides trivial learning signal.

Caption Advantages:

  • Utilizes all audio content.
  • Captures global audio context.
  • Non-monotonic alignment provides a hard learning signal.

Novel Open Source Dataset for Training: ACAVCaps

ACAVCaps is a meticulously curated 38,662-hour collection of general audio captions derived from the open-source ACAV100M audio repository. While leveraging ACAV100M's extensive raw audio materials, we completely re-engineered the annotation process to create a dataset for holistic audio understanding. We devide the dataset into six categories:

Category Example Caption
Pure Speech "A female voice narrates historical competition with synthetic modulation"
Pure Sound "Outdoor scene with wind, birds, duck quacking and background noise"
Pure Music "Crowd cheering with electronic synthesizer-driven soundscape"
Mixed Music "The audio features a crowd cheering and clapping alongside electronic music with a synthesizer-driven, dark, and energetic soundscape."
Mixed Speech "A Russian voice demonstrates a synthesizer’s capabilities over an experimental electronic backdrop, explaining its sound design and value in a gritty, vocal-fry tone."
Mixed Sound "A man speaks in English about entering a city and village, accompanied by the sounds of a running vehicle."

The figure below illustrates our data curation pipeline for ACAVCaps:

Each caption is generated through a three-step process:

  1. Multi-expert analysis (speech, vocal, music, acoustics)
  2. LLM reasoning synthesizing metadata with DeepSeek-R1
  3. Filtering for audio-text consistency with Dasheng-GLAP

We will release the ACAVCaps dataset after the ICASSP 2026 review process.

Usage

Load Model

from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer

model_id = "mispeech/midashenglm-7b"

model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

Construct Prompt

user_prompt = "Caption the audio."  # You may try any other prompt

messages = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are a helpful language and speech assistant."}
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "text", "text": user_prompt},
            {
                "type": "audio",
                "path": "/path/to/example.wav",
                # or "url": "https://example.com/example.wav"
                # or "audio": np.random.randn(16000)
            },
        ],
    },
]

Generate Output

import torch

with torch.no_grad():
    model_inputs = processor.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        add_special_tokens=True,
        return_dict=True,
    )
    generation = model.generate(**model_inputs)
    output = tokenizer.batch_decode(generation, skip_special_tokens=True)  # ["An engine is idling."]

Results

MiDashengLM delivers solid performance across diverse audio understanding tasks.

Audio Captioning Results

Domain Dataset MiDashengLM Qwen2.5-Omni-7B Kimi-Audio-Instruct
Music MusicCaps 59.71 43.71 35.43
Music Songdescriber 45.39 45.31 44.63
Sound AudioCaps 62.18 60.79 49.00
Sound ClothoV2 49.20 47.55 48.01
Sound AutoACD 66.52 55.93 44.76

Metrics: FENSE (higher is better).

Audio and Paralinguistic Classification

Dataset Metric MiDashengLM Qwen2.5-Omni-7B Kimi-Audio-Instruct
VoxCeleb1 ACC↑ 92.36 59.71 82.72
VoxLingua107 ACC↑ 93.41 51.03 73.65
VoxCeleb-Gender ACC↑ 96.12 99.82 99.69
VGGSound ACC↑ 52.11 0.97 2.20
Cochlscene ACC↑ 74.06 23.88 18.34
NSynth ACC↑ 80.52 60.45 38.09
FMA ACC↑ 63.73 66.77 27.91
FSDKaggle2018 ACC↑ 75.25 31.38 24.75
AudioSet mAP↑ 8.86 6.48 3.47
FSD50K mAP↑ 37.58 23.87 27.23

ASR Performance

Dataset Language MiDashengLM Qwen2.5-Omni-7B Kimi-Audio-Instruct
LibriSpeech test-clean English 3.7 1.7 1.3
LibriSpeech test-other English 6.2 3.4 2.4
People's Speech English 27.8 28.6 22.3
AISHELL2 Mic Chinese 3.2 2.5 2.7
AISHELL2 iOS Chinese 2.9 2.6 2.6
AISHELL2 Android Chinese 3.1 2.7 2.6
GigaSpeech2 Indonesian 20.8 21.2 >100
GigaSpeech2 Thai 36.9 53.8 >100
GigaSpeech2 Viet 18.1 18.6 >100

Metrics: WER/CER (lower is better).

Question Answering Results

Dataset Subset Metric MiDashengLM Qwen2.5-Omni-7B Kimi-Audio-Instruct
MuChoMusic ACC↑ 71.35 64.79 67.40
MMAU Sound ACC↑ 68.47 67.87 74.17
MMAU Music ACC↑ 66.77 69.16 61.08
MMAU Speech ACC↑ 63.66 59.76 57.66
MMAU Average ACC↑ 66.30 65.60 64.30
MusicQA FENSE↑ 62.35 60.60 40.00
AudioCaps-QA FENSE↑ 54.31 53.28 47.34

Metrics: Higher is better.

Reproduction Instructions

To reproduce our results, we provide:

  • Prompts (prompt.csv)
  • Evaluation scripts
  • Example JSONL files

1. Install Dependencies for Evaluation (No need this for inference)

pip install -r requirements.txt

2. Generate Model Outputs

Generate responses using the model's official framework with prompts from prompt.csv.

3. Convert Outputs to JSONL Format

Format model outputs using the example JSONL files:

Task Example File
Automatic Speech Recognition MiDashengLM_LibriSpeech_test-clean.jsonl
Single-target Audio Tagging MiDashengLM_NSynth.jsonl
Gender Recognition MiDashengLM_VoxCeleb-Gender.jsonl
Multi-target Audio Tagging MiDashengLM_FSD50K.jsonl
Audio Captioning MiDashengLM_AutoACD.jsonl
Open Audio Question Answering MiDashengLM_MusicQA.jsonl
Audio QA with Options MiDashengLM_MuChoMusic.jsonl

4. Evaluate Results

Execute the corresponding evaluation scripts:

# Automatic Speech Recognition (WER)
# Uses: lang, text, model_output
python evaluate/wer/compute_wer.py -i evaluate/jsonl/MiDashengLM_LibriSpeech_test-clean.jsonl

# Single-target Audio Tagging (ACC)
# Uses: label, model_output
python evaluate/compute_at_acc.py -i evaluate/jsonl/MiDashengLM_NSynth.jsonl

# Gender Recognition (ACC)
# Uses: label, model_output
python evaluate/compute_gender_acc.py -i evaluate/jsonl/MiDashengLM_VoxCeleb-Gender.jsonl

# Multi-target Audio Tagging (mAP)
# Uses: dataset_name, label, model_output, model_name
python evaluate/compute_map.py -i evaluate/jsonl/MiDashengLM_FSD50K.jsonl

# Audio Captioning (FENSE)
# Uses: audio, text, model_output
python evaluate/compute_fense.py -i evaluate/jsonl/MiDashengLM_AutoACD.jsonl

# Open Audio QA (FENSE)
# Uses: audio, answer, model_output
python evaluate/compute_fense.py -i evaluate/jsonl/MiDashengLM_MusicQA.jsonl

# Audio QA with Options (ACC)
# Uses: answer, model_output
python evaluate/compute_qa_acc.py -i evaluate/jsonl/MiDashengLM_MuChoMusic.jsonl

5. Evaluate on MECAT and MMAU benchmarks

Please refer to the official repositories for evaluation on the MECAT and MMAU benchmarks.

Efficiency

MiDashengLM demonstrates superior inference efficiency compared to Qwen2.5-Omni-7B, achieving 3.2Γ— speedup at comparable batch sizes and an overall potential speedup of 20.2Γ— with larger batches.

Batch Size MiDashengLM (samples/s) Qwen2.5-Omni-7B (samples/s) Speedup
1 0.45 0.36 1.25x
4 1.40 0.91 1.53x
8 2.72 1.15 2.36x
16 5.18 OOM -
32 9.78 OOM -
64 17.07 OOM -
128 22.73 OOM -
200 25.15 OOM -

Tested on 80GB GPU with 30s audio, 100-token output.

Training Data

MiDashengLM is trained exclusively on publicly available datasets across five categories: Speech, Sound and General Audio, Speech and Paralinguistic, Music, and Question Answering. All datasets are listed below with their respective tasks, lengths, and supervised fine-tuning (SFT) usage.

Speech Training Data

This table lists speech-related datasets used for tasks like Automatic Speech Recognition (ASR), keyword spotting (KWS), and speech-to-text translation (S2TT). The column β€œSFT?” indicates whether the dataset is used for supervised fine-tuning.

Data Task Length(h) SFT?
LibriSpeech ASR 960 √
LibriHeavy ASR 50,000 X
GigaSpeech ASR 10,000 √
GigaSpeech2 ASR 30,000 √
WeNetSpeech ASR 10,000 √
Yodas ASR 320,000 X
CommonVoice-17.0 ASR 5,000 √
AISHELL-1 ASR 100 √
AISHELL-2 ASR 1,000 √
AISHELL-3 ASR 70 √
LJSpeech-1.1 ASR 37 X
LibriTTS ASR 585 X
MultiLingualSpokenWords KWS 5,000 X
Emilia ASR 101,000 √
CovoST-v2 S2TT 2,880 √
Fleurs S2TT 1,224 X
MSR-86K ASR, LangID 86,000 √
ACAV100M-Speech ASR 55,754 X
Must-C ASR,S2TT 1,000 √
MLS ASR 50,000 X
SpgiSpeech ASR 5,000 X
PeoplesSpeech ASR 30,000 X
KeSpeech ASR 1,400 √
LAION-300M Caption 230,000 X
Total 997,010 258.410

Sound and General Audio Datasets

Dataset Task Length(h) SFT?
FSD50k Sound Event 77 √
AudioSet Sound Event 5,200
AudioSet-strong Sound Event 220 X
VGGSound Sound Event 540 √
FSDKaggle2018 Sound Event 20 √
FSDKaggle2019 Sound Event 100
ARCA23k Sound Event 120 X
AutoACD Audio(Sound) Caption 5,200 √
AudioSetCaps Audio(Sound) Caption 6,000 √
SoundVECaps Audio(Sound) Caption 5,000 √
WavCaps Audio(Sound) Caption 7,567 √
Audiocaps Audio(Sound) Caption 100 √
Clothov2 Audio(Sound) Caption 17 √
TACOS Audio(Sound) Caption 98 √
CochlScene SoundScape 500 √
BirdSet SoundScape 7,000 X
ACAVCaps General Caption 38,662 √
Total 76.421 69.081

Speech and Paralinguistic Datasets

Dataset Task Length(hours) SFT?
IEMOCAP Emotion 8 √
Meld Emotion 12 √
SUBESCO Emotion 9 X
RAVDESS-Speech Emotion 2 X
RAVDESS-Song Emotion 1 X
CREMA-D Emotion 4 X
ESD Emotion 29 X
VocalSound Vocal sound classification 20 √
NonSpeech7k Vocal sound classification 3 √
VoxLingua107 Language identification 7,200 √
CommonLanguage Language identification 45 √
YLACombe Language identification 5 X
VoxCeleb1 Speaker verification 76 √
CNCeleb Speaker verification & age 2,100 √
VoxCeleb2 Speaker verification 1,000 √
VoxBlink1 Speaker verification 1,300
VoxBlink2 Speaker verification 2,600 √
VoxTube Language identification 5,200 √
LibriCount Speaker counting 8 √
FluentSpeechCommands Intent classification & gender 17 X
SpeechOcean762 Speaker age 5 X
ASVSpoof5 Spoof detection 603 X
Total 20,247 19,572

Music-Related Datasets

Covers music captioning, genre recognition, instrument classification, and singing style identification.

Dataset Task Length(h) SFT?
MusicCaps Music Caption 15 √
Songdescriber Music Caption 23 √
LPMusicCaps-MTT Music Caption 18 √
LPMusicCaps-MSD Music Caption 1,000 √
VocalSet Singing style identification 10 X
FreeMusicArchive Genre recognition 610 √
MTG-Jamendo Instrument classification Genre recognition 3,768 √
NSynth Instrument classification 360 √
GoodSounds Instrument classification 28 √
chMusic Instrument classification 1 √
CTIS Instrument classification 1 √
Total 5,824 5,814

Question Answering Datasets

Used for training on audio-visual QA, environment QA, and music QA tasks. Most support SFT.

Dataset Task # QA SFT?
AVQA Environment QA 36,114 √
ClothoAQA Environment QA 6,175 √
TACOS+ Environment QA 40,019 √
MusicQA Music QA 112,878 √
SIFT-50M Speech QA 21,430,000 √
ACAV-QA General QA 24,371 √

Citation

MiDashengLM is under the Apache License 2.0, and we encourage its use in both research and business applications.

If you find MiDashengLM useful in your research, please consider citing our work:

@misc{midashenglm7b,
    title = {MiDashengLM: Efficient Audio Understanding with General Audio Captions},
    author = {Xiaomi MiLM Plus Horizon Team},
    year = {2025},
}
Downloads last month
344
Safetensors
Model size
8.28B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mispeech/midashenglm-7b

Finetuned
(25)
this model

Space using mispeech/midashenglm-7b 1