AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs
Abstract
AU-Harness is an efficient and comprehensive evaluation framework for Large Audio Language Models (LALMs) that addresses issues of speed, reproducibility, and task coverage, revealing gaps in temporal understanding and spoken language reasoning.
Large Audio Language Models (LALMs) are rapidly advancing, but evaluating them remains challenging due to inefficient toolkits that limit fair comparison and systematic assessment. Current frameworks suffer from three critical issues: slow processing that bottlenecks large-scale studies, inconsistent prompting that hurts reproducibility, and narrow task coverage that misses important audio reasoning capabilities. We introduce AU-Harness, an efficient and comprehensive evaluation framework for LALMs. Our system achieves a speedup of up to 127% over existing toolkits through optimized batch processing and parallel execution, enabling large-scale evaluations previously impractical. We provide standardized prompting protocols and flexible configurations for fair model comparison across diverse scenarios. Additionally, we introduce two new evaluation categories: LLM-Adaptive Diarization for temporal audio understanding and Spoken Language Reasoning for complex audio-based cognitive tasks. Through evaluation across 380+ tasks, we reveal significant gaps in current LALMs, particularly in temporal understanding and complex spoken language reasoning tasks. Our findings also highlight a lack of standardization in instruction modality existent across audio benchmarks, which can lead up performance differences up to 9.5 absolute points on the challenging complex instruction following downstream tasks. AU-Harness provides both practical evaluation tools and insights into model limitations, advancing systematic LALM development.
Community
Voice is becoming central to AI assistants as the ultimate UI. But evaluation has remained fragmented, narrow, and slow.
AU-Harness brings it all together and is:
โก Blazing fast and inference-efficient
๐ ๏ธ Customizable for accents, languages, long-form audio, and multi-turn dialogue
๐ Broad task coverage: ASR โ paralinguistics โ understanding โ reasoning โ safety
๐ฆ Modular & extensible for easy experimentation
๐ 50+ datasets | 380+ subsets | 21 tasks | 9 metrics
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence (2025)
- DIFFA: Large Language Diffusion Models Can Listen and Understand (2025)
- LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model (2025)
- SpeechR: A Benchmark for Speech Reasoning in Large Audio-Language Models (2025)
- CodecBench: A Comprehensive Benchmark for Acoustic and Semantic Evaluation (2025)
- Multi-TW: Benchmarking Multimodal Models on Traditional Chinese Question Answering in Taiwan (2025)
- AHELM: A Holistic Evaluation of Audio-Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper