Add README

Browse files

Files changed (8) hide show

.gitattributes +5 -0
LICENSE +190 -0
README.md +464 -0
fig/Framework-1.png +3 -0
fig/acavcaps-1.png +3 -0
fig/batchsize_1_comparison_7b-1.png +3 -0
fig/capabilities_plot_7b-1.png +3 -0
fig/pretraining_sampling_rates-1.png +3 -0

.gitattributes CHANGED Viewed

@@ -34,3 +34,8 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text

 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text
+fig/acavcaps-1.png filter=lfs diff=lfs merge=lfs -text
+fig/batchsize_1_comparison_7b-1.png filter=lfs diff=lfs merge=lfs -text
+fig/capabilities_plot_7b-1.png filter=lfs diff=lfs merge=lfs -text
+fig/Framework-1.png filter=lfs diff=lfs merge=lfs -text
+fig/pretraining_sampling_rates-1.png filter=lfs diff=lfs merge=lfs -text

LICENSE ADDED Viewed

	@@ -0,0 +1,190 @@

+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   Copyright 2025 Xiaomi Inc., China
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

README.md ADDED Viewed

	@@ -0,0 +1,464 @@

+---
+license: apache-2.0
+language:
+- en
+- zh
+- th
+- id
+- vi
+pipeline_tag: audio-text-to-text
+tags:
+- multimodal
+- audio-language-model
+- audio
+base_model:
+- mispeech/dasheng-0.6B
+- Qwen/Qwen2.5-Omni-7B
+base_model_relation: finetune
+---
+<div align="center">
+    <h1>
+    MiDashengLM
+    </h1>
+    <b><em>Efficient audio understanding with general audio captions</em></b></em></b>
+    <p>
+    </p>
+    <a href="https://arxiv.org/abs/2508.03983"><img src="https://img.shields.io/badge/arXiv-2508.03983-b31b1b" alt="version"></a>
+    <a href="https://github.com/xiaomi-research/dasheng-lm"><img src="https://img.shields.io/badge/Homepage-GitHub-0366d6" alt="version"></a>
+    <a href="https://modelscope.cn/models/midasheng/midashenglm-7b"><img src="https://img.shields.io/badge/ModelScope-7B-7448ce" alt="version"></a>
+    <a href="https://modelscope.cn/studios/midasheng/MiDashengLM-7B"><img src="https://img.shields.io/badge/Demo-Gradio-ffcc66" alt="version"></a>
+    <a href="https://xiaomi-research.github.io/dasheng-lm/"><img src="https://img.shields.io/badge/Demo-Page-0366d6" alt="version"></a>
+</div>
+> [!TIP]
+> This repository contains the **fp8 quantized** weights of the original model, which provides substantial memory savings and faster inference throughput while retaining overall task performance close to the [bf16 release](https://huggingface.co/mispeech/midashenglm-7b-bf16). As quantization introduces numerical approximations, individual outputs may differ slightly from the full-precision model. If you need maximum numerical fidelity (e.g., strict reproduction), use the [fp32 model](https://huggingface.co/mispeech/midashenglm-7b).
+## 🔥 Key Highlights
+**State-of-the-Art Performance**
+   - Outperforms Qwen2.5-Omni-7B, Kimi-Audio-Instruct-7B on **multiple key audio understanding tasks**.
+**High Efficiency**
+   - **3.2×** throughput speedup at comparable batch sizes compared to Qwen2.5-Omni-7B.
+   - **20x** throughput speedup by increasing furhter batchsizes. We tested up to a **batch size=512** for 30s audio input on 80GB GPUs. Baselines only support batch size = 8.
+   - Time-to-first-token (TTFT) speedup of up to **4x** compared to Qwen2.5-Omni-7B.
+**Caption-based Alignment**
+   - Trained with **general audio captions** (instead of ASR transcripts) to achieve holistic audio understanding.
+**Full Transparency**
+   - **Public-source** training data and reproducible pipeline.
+   - Apache License 2.0 for **both research and commercial use**.
+<div align="center">
+    <img src="fig/capabilities_plot_7b-1.png" width="600">
+</div>
+## Acknowledgment and Model Foundation
+Although MiDashengLM demonstrates superior audio understanding performance and efficiency compared to Qwen2.5-Omni models,
+we acknowledge **Qwen2.5-Omni as a remarkable and respected foundational work** in the field.
+Our model specifically uses [Qwen2.5-Omni-7B Thinker](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) as the initialization for decoder training, building upon its robust architecture and weight initialization.
+The audio encoder is built upon [Dasheng](https://github.com/XiaoMi/dasheng), an open-source audio encoder for general audio understanding with state-of-the-art performance.
+**Dasheng serves as the core foundation enabling MiDashengLM's exceptional performance**.
+## Framework
+MiDashengLM integrates the powerful Dasheng audio encoder with
+the Qwen2.5-Omni-7B Thinker decoder through a unique caption-based alignment strategy.
+Unlike conventional ASR-driven approaches,
+our model leverages general audio captions to capture comprehensive audio representations encompassing speech, environmental sounds, and musical elements
+in a unified textual format. This design enables holistic audio understanding while maintaining exceptional computational efficiency.
+<img src="fig/Framework-1.png" width="800">
+### Why Captions Instead of ASR?
+ASR Limitations:
+  - Discards huge amount of non-speech audio (music/environmental sounds).
+  - Misses paralinguistic info (speaker emotion, acoustic properties).
+  - Monotonic alignment provides trivial learning signal.
+Caption Advantages:
+  - Utilizes all audio content.
+  - Captures global audio context.
+  - Non-monotonic alignment provides a hard learning signal.
+### Novel Open Source Dataset for Training: ACAVCaps
+ACAVCaps is a meticulously curated 38,662-hour collection of general audio captions derived from the open-source [ACAV100M audio repository](https://acav100m.github.io/).
+While leveraging ACAV100M's extensive raw audio materials, we completely re-engineered the annotation process to create a dataset for holistic audio understanding.
+We devide the dataset into six categories:
+| Category | Example Caption |
+|----------|-----------------|
+| Pure Speech | "A female voice narrates historical competition with synthetic modulation" |
+| Pure Sound | "Outdoor scene with wind, birds, duck quacking and background noise" |
+| Pure Music | "Crowd cheering with electronic synthesizer-driven soundscape" |
+| Mixed Music | "The audio features a crowd cheering and clapping alongside electronic music with a synthesizer-driven, dark, and energetic soundscape." |
+| Mixed Speech | "A Russian voice demonstrates a synthesizer’s capabilities over an experimental electronic backdrop, explaining its sound design and value in a gritty, vocal-fry tone." |
+| Mixed Sound | "A man speaks in English about entering a city and village, accompanied by the sounds of a running vehicle." |
+The figure below illustrates our data curation pipeline for ACAVCaps:
+<img src="fig/acavcaps-1.png" width="800">
+Each caption is generated through a three-step process:
+1. **Multi-expert analysis** (speech, vocal, music, acoustics)
+2. **LLM reasoning** synthesizing metadata with [DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1)
+3. **Filtering** for audio-text consistency with [Dasheng-GLAP](https://github.com/xiaomi-research/dasheng-glap)
+We will **release the ACAVCaps dataset** after the ICASSP 2026 review process.
+## Usage
+### Load Model
+```python
+from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
+model_id = "mispeech/midashenglm-7b-fp8"
+model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
+```
+### Construct Prompt
+```python
+user_prompt = "Caption the audio."  # You may try any other prompt
+messages = [
+    {
+        "role": "system",
+        "content": [
+            {"type": "text", "text": "You are a helpful language and speech assistant."}
+        ],
+    },
+    {
+        "role": "user",
+        "content": [
+            {"type": "text", "text": user_prompt},
+            {
+                "type": "audio",
+                "path": "/path/to/example.wav",
+                # or "url": "https://example.com/example.wav"
+                # or "audio": np.random.randn(16000)
+            },
+        ],
+    },
+]
+```
+### Generate Output
+```python
+import torch
+with torch.no_grad():
+    model_inputs = processor.apply_chat_template(
+        messages,
+        tokenize=True,
+        add_generation_prompt=True,
+        add_special_tokens=True,
+        return_dict=True,
+    ).to(device=model.device, dtype=model.dtype)
+    generation = model.generate(**model_inputs)
+    output = tokenizer.batch_decode(generation, skip_special_tokens=True)  # ["An engine is idling."]
+```
+## Results
+MiDashengLM delivers solid performance across diverse audio understanding tasks.
+### Audio Captioning Results
+| Domain   | Dataset        | MiDashengLM    | Qwen2.5-Omni-7B  | Kimi-Audio-Instruct |
+|:--------:|:--------------:|:--------------:|:----------------:|:-------------------:|
+| Music    | MusicCaps      | **59.71**      | 43.71            | 35.43               |
+| Music    | Songdescriber  | **45.39**      | 45.31            | 44.63               |
+| Sound    | AudioCaps      | **62.18**      | 60.79            | 49.00               |
+| Sound    | ClothoV2       | **49.20**      | 47.55            | 48.01               |
+| Sound    | AutoACD        | **66.52**      | 55.93            | 44.76               |
+*Metrics: FENSE (higher is better).*
+### Audio and Paralinguistic Classification
+| Dataset          | Metric | MiDashengLM    | Qwen2.5-Omni-7B | Kimi-Audio-Instruct |
+|:----------------:|:------:|:--------------:|:----------------:|:------------------:|
+| VoxCeleb1        | ACC↑   | **92.36**      | 59.71            | 82.72              |
+| VoxLingua107     | ACC↑   | **93.41**      | 51.03            | 73.65              |
+| VoxCeleb-Gender  | ACC↑   | 96.12          | **99.82**        | 99.69              |
+| VGGSound         | ACC↑   | **52.11**      | 0.97             | 2.20               |
+| Cochlscene       | ACC↑   | **74.06**      | 23.88            | 18.34              |
+| NSynth           | ACC↑   | **80.52**      | 60.45            | 38.09              |
+| FMA              | ACC↑   | 63.73          | **66.77**        | 27.91              |
+| FSDKaggle2018    | ACC↑   | **75.25**      | 31.38            | 24.75              |
+| AudioSet         | mAP↑   | **8.86**       | 6.48             | 3.47               |
+| FSD50K           | mAP↑   | **37.58**      | 23.87            | 27.23              |
+### ASR Performance
+| Dataset            | Language    | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct |
+|:------------------:|:-----------:|:--------------:|:------------:|:-------------------:|
+| LibriSpeech test-clean  | English | 3.7           | 1.7          | **1.3**             |
+| LibriSpeech test-other  | English | 6.2           | 3.4          | **2.4**             |
+| People's Speech    | English     | 27.8           | 28.6         | **22.3**            |
+| AISHELL2 Mic       | Chinese     | 3.2            | **2.5**      | 2.7                 |
+| AISHELL2 iOS       | Chinese     | 2.9            | **2.6**      | **2.6**             |
+| AISHELL2 Android   | Chinese     | 3.1            | 2.7          | **2.6**             |
+| GigaSpeech2        | Indonesian  | **20.8**       | 21.2         | >100                |
+| GigaSpeech2        | Thai        | **36.9**       | 53.8         | >100                |
+| GigaSpeech2        | Viet        | **18.1**       | 18.6         | >100                |
+*Metrics: WER/CER (lower is better).*
+### Question Answering Results
+| Dataset      | Subset  | Metric | MiDashengLM    | Qwen2.5-Omni-7B  | Kimi-Audio-Instruct |
+|:------------:|:-------:|:------:|:--------------:|:----------------:|:-------------------:|
+| MuChoMusic   |         | ACC↑   | **71.35**      | 64.79            | 67.40               |
+| MMAU         | Sound   | ACC↑   | 68.47          | 67.87            | **74.17**           |
+| MMAU         | Music   | ACC↑   | 66.77          | **69.16**        | 61.08               |
+| MMAU         | Speech  | ACC↑   | **63.66**      | 59.76            | 57.66               |
+| MMAU         | Average | ACC↑   | **66.30**      | 65.60            | 64.30               |
+| MusicQA      |         | FENSE↑ | **62.35**      | 60.60            | 40.00               |
+| AudioCaps-QA |         | FENSE↑ | **54.31**      | 53.28            | 47.34               |
+*Metrics: Higher is better.*
+### Reproduction Instructions
+To reproduce our results, we provide:
+- Prompts ([prompt.csv](evaluate/prompt.csv))
+- Evaluation scripts
+- Example JSONL files
+#### 1. Install Dependencies for Evaluation (No need this for inference)
+```bash
+pip install -r requirements.txt
+```
+#### 2. Generate Model Outputs
+Generate responses using the model's official framework with prompts from [prompt.csv](evaluate/prompt.csv).
+#### 3. Convert Outputs to JSONL Format
+Format model outputs using the [example JSONL](evaluate/jsonl) files:
+| Task | Example File |
+|------|--------------|
+| Automatic Speech Recognition | [MiDashengLM_LibriSpeech_test-clean.jsonl](evaluate/jsonl/MiDashengLM_LibriSpeech_test-clean.jsonl) |
+| Single-target Audio Tagging | [MiDashengLM_NSynth.jsonl](evaluate/jsonl/MiDashengLM_NSynth.jsonl) |
+| Gender Recognition | [MiDashengLM_VoxCeleb-Gender.jsonl](evaluate/jsonl/MiDashengLM_VoxCeleb-Gender.jsonl) |
+| Multi-target Audio Tagging | [MiDashengLM_FSD50K.jsonl](evaluate/jsonl/MiDashengLM_FSD50K.jsonl) |
+| Audio Captioning | [MiDashengLM_AutoACD.jsonl](evaluate/jsonl/MiDashengLM_AutoACD.jsonl) |
+| Open Audio Question Answering | [MiDashengLM_MusicQA.jsonl](evaluate/jsonl/MiDashengLM_MusicQA.jsonl) |
+| Audio QA with Options | [MiDashengLM_MuChoMusic.jsonl](evaluate/jsonl/MiDashengLM_MuChoMusic.jsonl) |
+#### 4. Evaluate Results
+Execute the corresponding evaluation scripts:
+```bash
+# Automatic Speech Recognition (WER)
+# Uses: lang, text, model_output
+python evaluate/wer/compute_wer.py -i evaluate/jsonl/MiDashengLM_LibriSpeech_test-clean.jsonl
+# Single-target Audio Tagging (ACC)
+# Uses: label, model_output
+python evaluate/compute_at_acc.py -i evaluate/jsonl/MiDashengLM_NSynth.jsonl
+# Gender Recognition (ACC)
+# Uses: label, model_output
+python evaluate/compute_gender_acc.py -i evaluate/jsonl/MiDashengLM_VoxCeleb-Gender.jsonl
+# Multi-target Audio Tagging (mAP)
+# Uses: dataset_name, label, model_output, model_name
+python evaluate/compute_map.py -i evaluate/jsonl/MiDashengLM_FSD50K.jsonl
+# Audio Captioning (FENSE)
+# Uses: audio, text, model_output
+python evaluate/compute_fense.py -i evaluate/jsonl/MiDashengLM_AutoACD.jsonl
+# Open Audio QA (FENSE)
+# Uses: audio, answer, model_output
+python evaluate/compute_fense.py -i evaluate/jsonl/MiDashengLM_MusicQA.jsonl
+# Audio QA with Options (ACC)
+# Uses: answer, model_output
+python evaluate/compute_qa_acc.py -i evaluate/jsonl/MiDashengLM_MuChoMusic.jsonl
+```
+#### 5. Evaluate on MECAT and MMAU benchmarks
+Please refer to the official repositories for evaluation on the [MECAT](https://github.com/xiaomi-research/mecat)
+and [MMAU](https://github.com/Sakshi113/mmau) benchmarks.
+## Efficiency
+MiDashengLM demonstrates superior inference efficiency compared to Qwen2.5-Omni-7B,
+achieving 3.2× speedup at comparable batch sizes and an overall potential speedup of 20.2× with larger batches.
+<img src="fig/batchsize_1_comparison_7b-1.png" width="800">
+| Batch Size | MiDashengLM (samples/s) | Qwen2.5-Omni-7B (samples/s) | Speedup |
+|:----------:|:-----------------------:|:----------------------------:|:-------:|
+| 1          | 0.45                    | 0.36                         | 1.25x   |
+| 4          | 1.40                    | 0.91                         | 1.53x   |
+| 8          | 2.72                    | 1.15                         | 2.36x   |
+| 16         | 5.18                    | OOM                          | -       |
+| 32         | 9.78                    | OOM                          | -       |
+| 64         | 17.07                   | OOM                          | -       |
+| 128        | 22.73                   | OOM                          | -       |
+| 200        | 25.15                   | OOM                          | -       |
+*Tested on 80GB GPU with 30s audio, 100-token output.*
+## Training Data
+MiDashengLM is trained exclusively on publicly available datasets across five categories: Speech, Sound and General Audio, Speech and Paralinguistic, Music, and Question Answering. All datasets are listed below with their respective tasks, lengths, and supervised fine-tuning (SFT) usage.
+<img src="fig/pretraining_sampling_rates-1.png" width="1200">
+### Speech Training Data
+This table lists speech-related datasets used for tasks like Automatic Speech Recognition (ASR), keyword spotting (KWS), and speech-to-text translation (S2TT).
+The column “SFT?” indicates whether the dataset is used for supervised fine-tuning.
+| Data                   | Task      | Length(h) | SFT? |
+|:----------------------:|:---------:|:---------:|:----:|
+| LibriSpeech            | ASR       | 960       | √    |
+| LibriHeavy             | ASR       | 50,000    | X    |
+| GigaSpeech             | ASR       | 10,000    | √    |
+| GigaSpeech2            | ASR       | 30,000    | √    |
+| WeNetSpeech            | ASR       | 10,000    | √    |
+| Yodas                  | ASR       | 320,000   | X    |
+| CommonVoice-17.0       | ASR       | 5,000     | √    |
+| AISHELL-1              | ASR       | 100       | √    |
+| AISHELL-2              | ASR       | 1,000     | √    |
+| AISHELL-3              | ASR       | 70        | √    |
+| LJSpeech-1.1           | ASR       | 37        | X    |
+| LibriTTS               | ASR       | 585       | X    |
+| MultiLingualSpokenWords| KWS       | 5,000     | X    |
+| Emilia                 | ASR       | 101,000   | √    |
+| CovoST-v2              | S2TT      | 2,880     | √    |
+| Fleurs                 | S2TT      | 1,224     | X    |
+| MSR-86K                | ASR, LangID| 86,000    | √    |
+| ACAV100M-Speech        | ASR       | 55,754    | X    |
+| Must-C                 | ASR,S2TT  | 1,000     | √    |
+| MLS                    | ASR       | 50,000    | X    |
+| SpgiSpeech             | ASR       | 5,000     | X    |
+| PeoplesSpeech          | ASR       | 30,000    | X    |
+| KeSpeech               | ASR       | 1,400     | √    |
+| LAION-300M             | Caption   | 230,000   | X    |
+| **Total**              |           | **997,010**| **258.410** |
+### Sound and General Audio Datasets
+| Dataset         | Task                     | Length(h) | SFT? |
+|:--------------:|:------------------------:|:---------:|:----:|
+| FSD50k         | Sound Event              | 77        | √    |
+| AudioSet       | Sound Event              | 5,200     |      |
+| AudioSet-strong| Sound Event              | 220       | X    |
+| VGGSound       | Sound Event              | 540       | √    |
+| FSDKaggle2018  | Sound Event              | 20        | √    |
+| FSDKaggle2019  | Sound Event              | 100       |      |
+| ARCA23k        | Sound Event              | 120       | X    |
+| AutoACD        | Audio(Sound) Caption     | 5,200     | √    |
+| AudioSetCaps   | Audio(Sound) Caption     | 6,000     | √    |
+| SoundVECaps    | Audio(Sound) Caption     | 5,000     | √    |
+| WavCaps        | Audio(Sound) Caption     | 7,567     | √    |
+| Audiocaps      | Audio(Sound) Caption     | 100       | √    |
+| Clothov2       | Audio(Sound) Caption     | 17        | √    |
+| TACOS          | Audio(Sound) Caption     | 98        | √    |
+| CochlScene     | SoundScape               | 500       | √    |
+| BirdSet        | SoundScape               | 7,000     | X    |
+| ACAVCaps       | General Caption          | 38,662    | √    |
+| **Total**      |                          | **76.421**| **69.081** |
+### Speech and Paralinguistic Datasets
+| Dataset            | Task                          | Length(hours) | SFT? |
+|:------------------:|:-----------------------------:|:-------------:|:----:|
+| IEMOCAP            | Emotion                       | 8             | √    |
+| Meld               | Emotion                       | 12            | √    |
+| SUBESCO            | Emotion                       | 9             | X    |
+| RAVDESS-Speech     | Emotion                       | 2             | X    |
+| RAVDESS-Song       | Emotion                       | 1             | X    |
+| CREMA-D            | Emotion                       | 4             | X    |
+| ESD                | Emotion                       | 29            | X    |
+| VocalSound         | Vocal sound classification    | 20            | √    |
+| NonSpeech7k        | Vocal sound classification    | 3             | √    |
+| VoxLingua107       | Language identification       | 7,200         | √    |
+| CommonLanguage     | Language identification       | 45            | √    |
+| YLACombe           | Language identification       | 5             | X    |
+| VoxCeleb1          | Speaker verification          | 76            | √    |
+| CNCeleb            | Speaker verification & age    | 2,100         | √    |
+| VoxCeleb2          | Speaker verification          | 1,000         | √    |
+| VoxBlink1          | Speaker verification          | 1,300         |      |
+| VoxBlink2          | Speaker verification          | 2,600         | √    |
+| VoxTube            | Language identification       | 5,200         | √    |
+| LibriCount         | Speaker counting              | 8             | √    |
+| FluentSpeechCommands | Intent classification & gender | 17          | X    |
+| SpeechOcean762     | Speaker age                   | 5             | X    |
+| ASVSpoof5          | Spoof detection               | 603           | X    |
+| **Total**          |                               | **20,247**    | **19,572** |
+### Music-Related Datasets
+Covers music captioning, genre recognition, instrument classification, and singing style identification.
+| Dataset          | Task                              | Length(h) | SFT? |
+|:---------------:|:---------------------------------:|:---------:|:----:|
+| MusicCaps       | Music Caption                     | 15        | √    |
+| Songdescriber   | Music Caption                     | 23        | √    |
+| LPMusicCaps-MTT | Music Caption                     | 18        | √    |
+| LPMusicCaps-MSD | Music Caption                     | 1,000     | √    |
+| VocalSet        | Singing style identification      | 10        | X    |
+| FreeMusicArchive| Genre recognition                 | 610       | √    |
+| MTG-Jamendo     | Instrument classification Genre recognition | 3,768 | √    |
+| NSynth          | Instrument classification         | 360       | √    |
+| GoodSounds      | Instrument classification         | 28        | √    |
+| chMusic         | Instrument classification         | 1         | √    |
+| CTIS            | Instrument classification         | 1         | √    |
+| **Total**       |                                   | **5,824** | **5,814** |
+### Question Answering Datasets
+Used for training on audio-visual QA, environment QA, and music QA tasks. Most support SFT.
+| Dataset    | Task            | # QA     | SFT? |
+|:---------:|:---------------:|:--------:|:----:|
+| AVQA      | Environment QA  | 36,114   | √    |
+| ClothoAQA | Environment QA  | 6,175    | √    |
+| TACOS+    | Environment QA  | 40,019   | √    |
+| MusicQA   | Music QA        | 112,878  | √    |
+| SIFT-50M  | Speech QA       | 21,430,000 | √  |
+| ACAV-QA   | General QA      | 24,371   | √    |
+## Citation
+MiDashengLM is under the Apache License 2.0, and we encourage its use in **both research and business applications**.
+If you find MiDashengLM useful in your research, please consider citing our work:
+```bibtex
+@techreport{midashenglm7b,
+  title      = {MiDashengLM: Efficient Audio Understanding with General Audio Captions},
+  author     = {{Horizon Team, MiLM Plus}},
+  institution= {Xiaomi Inc.},
+  year       = {2025},
+  note       = {Contributors: Heinrich Dinkel et al. (listed alphabetically in Appendix B)},
+  url        = {https://arxiv.org/abs/2508.03983},
+  eprint     = {2508.03983},
+}
+```