Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey
Abstract
A survey proposes a systematic taxonomy for evaluating large audio-language models across dimensions including auditory awareness, knowledge reasoning, dialogue ability, and fairness, to address fragmented benchmarks in the field.
With advancements in large audio-language models (LALMs), which enhance large language models (LLMs) with auditory capabilities, these models are expected to demonstrate universal proficiency across various auditory tasks. While numerous benchmarks have emerged to assess LALMs' performance, they remain fragmented and lack a structured taxonomy. To bridge this gap, we conduct a comprehensive survey and propose a systematic taxonomy for LALM evaluations, categorizing them into four dimensions based on their objectives: (1) General Auditory Awareness and Processing, (2) Knowledge and Reasoning, (3) Dialogue-oriented Ability, and (4) Fairness, Safety, and Trustworthiness. We provide detailed overviews within each category and highlight challenges in this field, offering insights into promising future directions. To the best of our knowledge, this is the first survey specifically focused on the evaluations of LALMs, providing clear guidelines for the community. We will release the collection of the surveyed papers and actively maintain it to support ongoing advancements in the field.
Community
The survey of works for evaluating large audio-language models on various aspects. Project page: https://github.com/ckyang1124/LALM-Evaluation-Survey
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VideoLLM Benchmarks and Evaluation: A Survey (2025)
- AudSemThinker: Enhancing Audio-Language Models through Reasoning over Semantics of Sound (2025)
- Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models (2025)
- AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models (2025)
- Multilingual Prompt Engineering in Large Language Models: A Survey Across NLP Tasks (2025)
- MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix (2025)
- SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper