Papers
arxiv:2509.08031

AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs

Published on Sep 9
ยท Submitted by amant555 on Sep 12
Authors:
,
,
,
,
,
,

Abstract

AU-Harness is an efficient and comprehensive evaluation framework for Large Audio Language Models (LALMs) that addresses issues of speed, reproducibility, and task coverage, revealing gaps in temporal understanding and spoken language reasoning.

AI-generated summary

Large Audio Language Models (LALMs) are rapidly advancing, but evaluating them remains challenging due to inefficient toolkits that limit fair comparison and systematic assessment. Current frameworks suffer from three critical issues: slow processing that bottlenecks large-scale studies, inconsistent prompting that hurts reproducibility, and narrow task coverage that misses important audio reasoning capabilities. We introduce AU-Harness, an efficient and comprehensive evaluation framework for LALMs. Our system achieves a speedup of up to 127% over existing toolkits through optimized batch processing and parallel execution, enabling large-scale evaluations previously impractical. We provide standardized prompting protocols and flexible configurations for fair model comparison across diverse scenarios. Additionally, we introduce two new evaluation categories: LLM-Adaptive Diarization for temporal audio understanding and Spoken Language Reasoning for complex audio-based cognitive tasks. Through evaluation across 380+ tasks, we reveal significant gaps in current LALMs, particularly in temporal understanding and complex spoken language reasoning tasks. Our findings also highlight a lack of standardization in instruction modality existent across audio benchmarks, which can lead up performance differences up to 9.5 absolute points on the challenging complex instruction following downstream tasks. AU-Harness provides both practical evaluation tools and insights into model limitations, advancing systematic LALM development.

Community

Paper author Paper submitter

Voice is becoming central to AI assistants as the ultimate UI. But evaluation has remained fragmented, narrow, and slow.

AU-Harness brings it all together and is:
โšก Blazing fast and inference-efficient
๐Ÿ› ๏ธ Customizable for accents, languages, long-form audio, and multi-turn dialogue
๐Ÿ“Š Broad task coverage: ASR โ†’ paralinguistics โ†’ understanding โ†’ reasoning โ†’ safety
๐Ÿ“ฆ Modular & extensible for easy experimentation
๐Ÿ‘‰ 50+ datasets | 380+ subsets | 21 tasks | 9 metrics

@librarian-bot recommend

ยท

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.08031 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.08031 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.08031 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.