zhoukz commited on
Commit
51de817
Β·
unverified Β·
1 Parent(s): 8a0522d

Add README

Browse files
.gitattributes CHANGED
@@ -34,3 +34,8 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  tokenizer.json filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  tokenizer.json filter=lfs diff=lfs merge=lfs -text
37
+ fig/acavcaps-1.png filter=lfs diff=lfs merge=lfs -text
38
+ fig/batchsize_1_comparison_7b-1.png filter=lfs diff=lfs merge=lfs -text
39
+ fig/capabilities_plot_7b-1.png filter=lfs diff=lfs merge=lfs -text
40
+ fig/Framework-1.png filter=lfs diff=lfs merge=lfs -text
41
+ fig/pretraining_sampling_rates-1.png filter=lfs diff=lfs merge=lfs -text
LICENSE ADDED
@@ -0,0 +1,190 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Apache License
2
+ Version 2.0, January 2004
3
+ http://www.apache.org/licenses/
4
+
5
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6
+
7
+ 1. Definitions.
8
+
9
+ "License" shall mean the terms and conditions for use, reproduction,
10
+ and distribution as defined by Sections 1 through 9 of this document.
11
+
12
+ "Licensor" shall mean the copyright owner or entity authorized by
13
+ the copyright owner that is granting the License.
14
+
15
+ "Legal Entity" shall mean the union of the acting entity and all
16
+ other entities that control, are controlled by, or are under common
17
+ control with that entity. For the purposes of this definition,
18
+ "control" means (i) the power, direct or indirect, to cause the
19
+ direction or management of such entity, whether by contract or
20
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
21
+ outstanding shares, or (iii) beneficial ownership of such entity.
22
+
23
+ "You" (or "Your") shall mean an individual or Legal Entity
24
+ exercising permissions granted by this License.
25
+
26
+ "Source" form shall mean the preferred form for making modifications,
27
+ including but not limited to software source code, documentation
28
+ source, and configuration files.
29
+
30
+ "Object" form shall mean any form resulting from mechanical
31
+ transformation or translation of a Source form, including but
32
+ not limited to compiled object code, generated documentation,
33
+ and conversions to other media types.
34
+
35
+ "Work" shall mean the work of authorship, whether in Source or
36
+ Object form, made available under the License, as indicated by a
37
+ copyright notice that is included in or attached to the work
38
+ (an example is provided in the Appendix below).
39
+
40
+ "Derivative Works" shall mean any work, whether in Source or Object
41
+ form, that is based on (or derived from) the Work and for which the
42
+ editorial revisions, annotations, elaborations, or other modifications
43
+ represent, as a whole, an original work of authorship. For the purposes
44
+ of this License, Derivative Works shall not include works that remain
45
+ separable from, or merely link (or bind by name) to the interfaces of,
46
+ the Work and Derivative Works thereof.
47
+
48
+ "Contribution" shall mean any work of authorship, including
49
+ the original version of the Work and any modifications or additions
50
+ to that Work or Derivative Works thereof, that is intentionally
51
+ submitted to Licensor for inclusion in the Work by the copyright owner
52
+ or by an individual or Legal Entity authorized to submit on behalf of
53
+ the copyright owner. For the purposes of this definition, "submitted"
54
+ means any form of electronic, verbal, or written communication sent
55
+ to the Licensor or its representatives, including but not limited to
56
+ communication on electronic mailing lists, source code control systems,
57
+ and issue tracking systems that are managed by, or on behalf of, the
58
+ Licensor for the purpose of discussing and improving the Work, but
59
+ excluding communication that is conspicuously marked or otherwise
60
+ designated in writing by the copyright owner as "Not a Contribution."
61
+
62
+ "Contributor" shall mean Licensor and any individual or Legal Entity
63
+ on behalf of whom a Contribution has been received by Licensor and
64
+ subsequently incorporated within the Work.
65
+
66
+ 2. Grant of Copyright License. Subject to the terms and conditions of
67
+ this License, each Contributor hereby grants to You a perpetual,
68
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69
+ copyright license to reproduce, prepare Derivative Works of,
70
+ publicly display, publicly perform, sublicense, and distribute the
71
+ Work and such Derivative Works in Source or Object form.
72
+
73
+ 3. Grant of Patent License. Subject to the terms and conditions of
74
+ this License, each Contributor hereby grants to You a perpetual,
75
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76
+ (except as stated in this section) patent license to make, have made,
77
+ use, offer to sell, sell, import, and otherwise transfer the Work,
78
+ where such license applies only to those patent claims licensable
79
+ by such Contributor that are necessarily infringed by their
80
+ Contribution(s) alone or by combination of their Contribution(s)
81
+ with the Work to which such Contribution(s) was submitted. If You
82
+ institute patent litigation against any entity (including a
83
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
84
+ or a Contribution incorporated within the Work constitutes direct
85
+ or contributory patent infringement, then any patent licenses
86
+ granted to You under this License for that Work shall terminate
87
+ as of the date such litigation is filed.
88
+
89
+ 4. Redistribution. You may reproduce and distribute copies of the
90
+ Work or Derivative Works thereof in any medium, with or without
91
+ modifications, and in Source or Object form, provided that You
92
+ meet the following conditions:
93
+
94
+ (a) You must give any other recipients of the Work or
95
+ Derivative Works a copy of this License; and
96
+
97
+ (b) You must cause any modified files to carry prominent notices
98
+ stating that You changed the files; and
99
+
100
+ (c) You must retain, in the Source form of any Derivative Works
101
+ that You distribute, all copyright, patent, trademark, and
102
+ attribution notices from the Source form of the Work,
103
+ excluding those notices that do not pertain to any part of
104
+ the Derivative Works; and
105
+
106
+ (d) If the Work includes a "NOTICE" text file as part of its
107
+ distribution, then any Derivative Works that You distribute must
108
+ include a readable copy of the attribution notices contained
109
+ within such NOTICE file, excluding those notices that do not
110
+ pertain to any part of the Derivative Works, in at least one
111
+ of the following places: within a NOTICE text file distributed
112
+ as part of the Derivative Works; within the Source form or
113
+ documentation, if provided along with the Derivative Works; or,
114
+ within a display generated by the Derivative Works, if and
115
+ wherever such third-party notices normally appear. The contents
116
+ of the NOTICE file are for informational purposes only and
117
+ do not modify the License. You may add Your own attribution
118
+ notices within Derivative Works that You distribute, alongside
119
+ or as an addendum to the NOTICE text from the Work, provided
120
+ that such additional attribution notices cannot be construed
121
+ as modifying the License.
122
+
123
+ You may add Your own copyright statement to Your modifications and
124
+ may provide additional or different license terms and conditions
125
+ for use, reproduction, or distribution of Your modifications, or
126
+ for any such Derivative Works as a whole, provided Your use,
127
+ reproduction, and distribution of the Work otherwise complies with
128
+ the conditions stated in this License.
129
+
130
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
131
+ any Contribution intentionally submitted for inclusion in the Work
132
+ by You to the Licensor shall be under the terms and conditions of
133
+ this License, without any additional terms or conditions.
134
+ Notwithstanding the above, nothing herein shall supersede or modify
135
+ the terms of any separate license agreement you may have executed
136
+ with Licensor regarding such Contributions.
137
+
138
+ 6. Trademarks. This License does not grant permission to use the trade
139
+ names, trademarks, service marks, or product names of the Licensor,
140
+ except as required for reasonable and customary use in describing the
141
+ origin of the Work and reproducing the content of the NOTICE file.
142
+
143
+ 7. Disclaimer of Warranty. Unless required by applicable law or
144
+ agreed to in writing, Licensor provides the Work (and each
145
+ Contributor provides its Contributions) on an "AS IS" BASIS,
146
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147
+ implied, including, without limitation, any warranties or conditions
148
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149
+ PARTICULAR PURPOSE. You are solely responsible for determining the
150
+ appropriateness of using or redistributing the Work and assume any
151
+ risks associated with Your exercise of permissions under this License.
152
+
153
+ 8. Limitation of Liability. In no event and under no legal theory,
154
+ whether in tort (including negligence), contract, or otherwise,
155
+ unless required by applicable law (such as deliberate and grossly
156
+ negligent acts) or agreed to in writing, shall any Contributor be
157
+ liable to You for damages, including any direct, indirect, special,
158
+ incidental, or consequential damages of any character arising as a
159
+ result of this License or out of the use or inability to use the
160
+ Work (including but not limited to damages for loss of goodwill,
161
+ work stoppage, computer failure or malfunction, or any and all
162
+ other commercial damages or losses), even if such Contributor
163
+ has been advised of the possibility of such damages.
164
+
165
+ 9. Accepting Warranty or Additional Liability. While redistributing
166
+ the Work or Derivative Works thereof, You may choose to offer,
167
+ and charge a fee for, acceptance of support, warranty, indemnity,
168
+ or other liability obligations and/or rights consistent with this
169
+ License. However, in accepting such obligations, You may act only
170
+ on Your own behalf and on Your sole responsibility, not on behalf
171
+ of any other Contributor, and only if You agree to indemnify,
172
+ defend, and hold each Contributor harmless for any liability
173
+ incurred by, or claims asserted against, such Contributor by reason
174
+ of your accepting any such warranty or additional liability.
175
+
176
+ END OF TERMS AND CONDITIONS
177
+
178
+ Copyright 2025 Xiaomi Inc., China
179
+
180
+ Licensed under the Apache License, Version 2.0 (the "License");
181
+ you may not use this file except in compliance with the License.
182
+ You may obtain a copy of the License at
183
+
184
+ http://www.apache.org/licenses/LICENSE-2.0
185
+
186
+ Unless required by applicable law or agreed to in writing, software
187
+ distributed under the License is distributed on an "AS IS" BASIS,
188
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
189
+ See the License for the specific language governing permissions and
190
+ limitations under the License.
README.md ADDED
@@ -0,0 +1,464 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - zh
6
+ - th
7
+ - id
8
+ - vi
9
+ pipeline_tag: audio-text-to-text
10
+ tags:
11
+ - multimodal
12
+ - audio-language-model
13
+ - audio
14
+ base_model:
15
+ - mispeech/dasheng-0.6B
16
+ - Qwen/Qwen2.5-Omni-7B
17
+ base_model_relation: finetune
18
+ ---
19
+
20
+ <div align="center">
21
+ <h1>
22
+ MiDashengLM
23
+ </h1>
24
+ <b><em>Efficient audio understanding with general audio captions</em></b></em></b>
25
+ <p>
26
+ </p>
27
+ <a href="https://arxiv.org/abs/2508.03983"><img src="https://img.shields.io/badge/arXiv-2508.03983-b31b1b" alt="version"></a>
28
+ <a href="https://github.com/xiaomi-research/dasheng-lm"><img src="https://img.shields.io/badge/Homepage-GitHub-0366d6" alt="version"></a>
29
+ <a href="https://modelscope.cn/models/midasheng/midashenglm-7b"><img src="https://img.shields.io/badge/ModelScope-7B-7448ce" alt="version"></a>
30
+ <a href="https://modelscope.cn/studios/midasheng/MiDashengLM-7B"><img src="https://img.shields.io/badge/Demo-Gradio-ffcc66" alt="version"></a>
31
+ <a href="https://xiaomi-research.github.io/dasheng-lm/"><img src="https://img.shields.io/badge/Demo-Page-0366d6" alt="version"></a>
32
+ </div>
33
+
34
+ > [!TIP]
35
+ > This repository contains the **fp8 quantized** weights of the original model, which provides substantial memory savings and faster inference throughput while retaining overall task performance close to the [bf16 release](https://huggingface.co/mispeech/midashenglm-7b-bf16). As quantization introduces numerical approximations, individual outputs may differ slightly from the full-precision model. If you need maximum numerical fidelity (e.g., strict reproduction), use the [fp32 model](https://huggingface.co/mispeech/midashenglm-7b).
36
+
37
+ ## πŸ”₯ Key Highlights
38
+
39
+ **State-of-the-Art Performance**
40
+ - Outperforms Qwen2.5-Omni-7B, Kimi-Audio-Instruct-7B on **multiple key audio understanding tasks**.
41
+
42
+ **High Efficiency**
43
+ - **3.2Γ—** throughput speedup at comparable batch sizes compared to Qwen2.5-Omni-7B.
44
+ - **20x** throughput speedup by increasing furhter batchsizes. We tested up to a **batch size=512** for 30s audio input on 80GB GPUs. Baselines only support batch size = 8.
45
+ - Time-to-first-token (TTFT) speedup of up to **4x** compared to Qwen2.5-Omni-7B.
46
+
47
+ **Caption-based Alignment**
48
+ - Trained with **general audio captions** (instead of ASR transcripts) to achieve holistic audio understanding.
49
+
50
+ **Full Transparency**
51
+ - **Public-source** training data and reproducible pipeline.
52
+ - Apache License 2.0 for **both research and commercial use**.
53
+
54
+ <div align="center">
55
+ <img src="fig/capabilities_plot_7b-1.png" width="600">
56
+ </div>
57
+
58
+ ## Acknowledgment and Model Foundation
59
+
60
+ Although MiDashengLM demonstrates superior audio understanding performance and efficiency compared to Qwen2.5-Omni models,
61
+ we acknowledge **Qwen2.5-Omni as a remarkable and respected foundational work** in the field.
62
+ Our model specifically uses [Qwen2.5-Omni-7B Thinker](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) as the initialization for decoder training, building upon its robust architecture and weight initialization.
63
+
64
+ The audio encoder is built upon [Dasheng](https://github.com/XiaoMi/dasheng), an open-source audio encoder for general audio understanding with state-of-the-art performance.
65
+ **Dasheng serves as the core foundation enabling MiDashengLM's exceptional performance**.
66
+
67
+ ## Framework
68
+
69
+ MiDashengLM integrates the powerful Dasheng audio encoder with
70
+ the Qwen2.5-Omni-7B Thinker decoder through a unique caption-based alignment strategy.
71
+ Unlike conventional ASR-driven approaches,
72
+ our model leverages general audio captions to capture comprehensive audio representations encompassing speech, environmental sounds, and musical elements
73
+ in a unified textual format. This design enables holistic audio understanding while maintaining exceptional computational efficiency.
74
+
75
+ <img src="fig/Framework-1.png" width="800">
76
+
77
+ ### Why Captions Instead of ASR?
78
+
79
+ ASR Limitations:
80
+ - Discards huge amount of non-speech audio (music/environmental sounds).
81
+ - Misses paralinguistic info (speaker emotion, acoustic properties).
82
+ - Monotonic alignment provides trivial learning signal.
83
+
84
+ Caption Advantages:
85
+ - Utilizes all audio content.
86
+ - Captures global audio context.
87
+ - Non-monotonic alignment provides a hard learning signal.
88
+
89
+ ### Novel Open Source Dataset for Training: ACAVCaps
90
+
91
+ ACAVCaps is a meticulously curated 38,662-hour collection of general audio captions derived from the open-source [ACAV100M audio repository](https://acav100m.github.io/).
92
+ While leveraging ACAV100M's extensive raw audio materials, we completely re-engineered the annotation process to create a dataset for holistic audio understanding.
93
+ We devide the dataset into six categories:
94
+
95
+ | Category | Example Caption |
96
+ |----------|-----------------|
97
+ | Pure Speech | "A female voice narrates historical competition with synthetic modulation" |
98
+ | Pure Sound | "Outdoor scene with wind, birds, duck quacking and background noise" |
99
+ | Pure Music | "Crowd cheering with electronic synthesizer-driven soundscape" |
100
+ | Mixed Music | "The audio features a crowd cheering and clapping alongside electronic music with a synthesizer-driven, dark, and energetic soundscape." |
101
+ | Mixed Speech | "A Russian voice demonstrates a synthesizer’s capabilities over an experimental electronic backdrop, explaining its sound design and value in a gritty, vocal-fry tone." |
102
+ | Mixed Sound | "A man speaks in English about entering a city and village, accompanied by the sounds of a running vehicle." |
103
+
104
+ The figure below illustrates our data curation pipeline for ACAVCaps:
105
+
106
+ <img src="fig/acavcaps-1.png" width="800">
107
+
108
+ Each caption is generated through a three-step process:
109
+
110
+ 1. **Multi-expert analysis** (speech, vocal, music, acoustics)
111
+ 2. **LLM reasoning** synthesizing metadata with [DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1)
112
+ 3. **Filtering** for audio-text consistency with [Dasheng-GLAP](https://github.com/xiaomi-research/dasheng-glap)
113
+
114
+ We will **release the ACAVCaps dataset** after the ICASSP 2026 review process.
115
+
116
+ ## Usage
117
+
118
+ ### Load Model
119
+
120
+ ```python
121
+ from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
122
+
123
+ model_id = "mispeech/midashenglm-7b-fp8"
124
+
125
+ model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
126
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
127
+ processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
128
+ ```
129
+
130
+ ### Construct Prompt
131
+
132
+ ```python
133
+ user_prompt = "Caption the audio." # You may try any other prompt
134
+
135
+ messages = [
136
+ {
137
+ "role": "system",
138
+ "content": [
139
+ {"type": "text", "text": "You are a helpful language and speech assistant."}
140
+ ],
141
+ },
142
+ {
143
+ "role": "user",
144
+ "content": [
145
+ {"type": "text", "text": user_prompt},
146
+ {
147
+ "type": "audio",
148
+ "path": "/path/to/example.wav",
149
+ # or "url": "https://example.com/example.wav"
150
+ # or "audio": np.random.randn(16000)
151
+ },
152
+ ],
153
+ },
154
+ ]
155
+ ```
156
+
157
+ ### Generate Output
158
+
159
+ ```python
160
+ import torch
161
+
162
+ with torch.no_grad():
163
+ model_inputs = processor.apply_chat_template(
164
+ messages,
165
+ tokenize=True,
166
+ add_generation_prompt=True,
167
+ add_special_tokens=True,
168
+ return_dict=True,
169
+ ).to(device=model.device, dtype=model.dtype)
170
+ generation = model.generate(**model_inputs)
171
+ output = tokenizer.batch_decode(generation, skip_special_tokens=True) # ["An engine is idling."]
172
+ ```
173
+
174
+ ## Results
175
+
176
+ MiDashengLM delivers solid performance across diverse audio understanding tasks.
177
+
178
+ ### Audio Captioning Results
179
+
180
+ | Domain | Dataset | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct |
181
+ |:--------:|:--------------:|:--------------:|:----------------:|:-------------------:|
182
+ | Music | MusicCaps | **59.71** | 43.71 | 35.43 |
183
+ | Music | Songdescriber | **45.39** | 45.31 | 44.63 |
184
+ | Sound | AudioCaps | **62.18** | 60.79 | 49.00 |
185
+ | Sound | ClothoV2 | **49.20** | 47.55 | 48.01 |
186
+ | Sound | AutoACD | **66.52** | 55.93 | 44.76 |
187
+
188
+ *Metrics: FENSE (higher is better).*
189
+
190
+ ### Audio and Paralinguistic Classification
191
+
192
+ | Dataset | Metric | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct |
193
+ |:----------------:|:------:|:--------------:|:----------------:|:------------------:|
194
+ | VoxCeleb1 | ACC↑ | **92.36** | 59.71 | 82.72 |
195
+ | VoxLingua107 | ACC↑ | **93.41** | 51.03 | 73.65 |
196
+ | VoxCeleb-Gender | ACC↑ | 96.12 | **99.82** | 99.69 |
197
+ | VGGSound | ACC↑ | **52.11** | 0.97 | 2.20 |
198
+ | Cochlscene | ACC↑ | **74.06** | 23.88 | 18.34 |
199
+ | NSynth | ACC↑ | **80.52** | 60.45 | 38.09 |
200
+ | FMA | ACC↑ | 63.73 | **66.77** | 27.91 |
201
+ | FSDKaggle2018 | ACC↑ | **75.25** | 31.38 | 24.75 |
202
+ | AudioSet | mAP↑ | **8.86** | 6.48 | 3.47 |
203
+ | FSD50K | mAP↑ | **37.58** | 23.87 | 27.23 |
204
+
205
+ ### ASR Performance
206
+
207
+ | Dataset | Language | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct |
208
+ |:------------------:|:-----------:|:--------------:|:------------:|:-------------------:|
209
+ | LibriSpeech test-clean | English | 3.7 | 1.7 | **1.3** |
210
+ | LibriSpeech test-other | English | 6.2 | 3.4 | **2.4** |
211
+ | People's Speech | English | 27.8 | 28.6 | **22.3** |
212
+ | AISHELL2 Mic | Chinese | 3.2 | **2.5** | 2.7 |
213
+ | AISHELL2 iOS | Chinese | 2.9 | **2.6** | **2.6** |
214
+ | AISHELL2 Android | Chinese | 3.1 | 2.7 | **2.6** |
215
+ | GigaSpeech2 | Indonesian | **20.8** | 21.2 | >100 |
216
+ | GigaSpeech2 | Thai | **36.9** | 53.8 | >100 |
217
+ | GigaSpeech2 | Viet | **18.1** | 18.6 | >100 |
218
+
219
+ *Metrics: WER/CER (lower is better).*
220
+
221
+ ### Question Answering Results
222
+
223
+ | Dataset | Subset | Metric | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct |
224
+ |:------------:|:-------:|:------:|:--------------:|:----------------:|:-------------------:|
225
+ | MuChoMusic | | ACC↑ | **71.35** | 64.79 | 67.40 |
226
+ | MMAU | Sound | ACC↑ | 68.47 | 67.87 | **74.17** |
227
+ | MMAU | Music | ACC↑ | 66.77 | **69.16** | 61.08 |
228
+ | MMAU | Speech | ACC↑ | **63.66** | 59.76 | 57.66 |
229
+ | MMAU | Average | ACC↑ | **66.30** | 65.60 | 64.30 |
230
+ | MusicQA | | FENSE↑ | **62.35** | 60.60 | 40.00 |
231
+ | AudioCaps-QA | | FENSE↑ | **54.31** | 53.28 | 47.34 |
232
+
233
+ *Metrics: Higher is better.*
234
+
235
+ ### Reproduction Instructions
236
+
237
+ To reproduce our results, we provide:
238
+
239
+ - Prompts ([prompt.csv](evaluate/prompt.csv))
240
+ - Evaluation scripts
241
+ - Example JSONL files
242
+
243
+ #### 1. Install Dependencies for Evaluation (No need this for inference)
244
+
245
+ ```bash
246
+ pip install -r requirements.txt
247
+ ```
248
+
249
+ #### 2. Generate Model Outputs
250
+
251
+ Generate responses using the model's official framework with prompts from [prompt.csv](evaluate/prompt.csv).
252
+
253
+ #### 3. Convert Outputs to JSONL Format
254
+
255
+ Format model outputs using the [example JSONL](evaluate/jsonl) files:
256
+
257
+ | Task | Example File |
258
+ |------|--------------|
259
+ | Automatic Speech Recognition | [MiDashengLM_LibriSpeech_test-clean.jsonl](evaluate/jsonl/MiDashengLM_LibriSpeech_test-clean.jsonl) |
260
+ | Single-target Audio Tagging | [MiDashengLM_NSynth.jsonl](evaluate/jsonl/MiDashengLM_NSynth.jsonl) |
261
+ | Gender Recognition | [MiDashengLM_VoxCeleb-Gender.jsonl](evaluate/jsonl/MiDashengLM_VoxCeleb-Gender.jsonl) |
262
+ | Multi-target Audio Tagging | [MiDashengLM_FSD50K.jsonl](evaluate/jsonl/MiDashengLM_FSD50K.jsonl) |
263
+ | Audio Captioning | [MiDashengLM_AutoACD.jsonl](evaluate/jsonl/MiDashengLM_AutoACD.jsonl) |
264
+ | Open Audio Question Answering | [MiDashengLM_MusicQA.jsonl](evaluate/jsonl/MiDashengLM_MusicQA.jsonl) |
265
+ | Audio QA with Options | [MiDashengLM_MuChoMusic.jsonl](evaluate/jsonl/MiDashengLM_MuChoMusic.jsonl) |
266
+
267
+ #### 4. Evaluate Results
268
+
269
+ Execute the corresponding evaluation scripts:
270
+
271
+ ```bash
272
+ # Automatic Speech Recognition (WER)
273
+ # Uses: lang, text, model_output
274
+ python evaluate/wer/compute_wer.py -i evaluate/jsonl/MiDashengLM_LibriSpeech_test-clean.jsonl
275
+
276
+ # Single-target Audio Tagging (ACC)
277
+ # Uses: label, model_output
278
+ python evaluate/compute_at_acc.py -i evaluate/jsonl/MiDashengLM_NSynth.jsonl
279
+
280
+ # Gender Recognition (ACC)
281
+ # Uses: label, model_output
282
+ python evaluate/compute_gender_acc.py -i evaluate/jsonl/MiDashengLM_VoxCeleb-Gender.jsonl
283
+
284
+ # Multi-target Audio Tagging (mAP)
285
+ # Uses: dataset_name, label, model_output, model_name
286
+ python evaluate/compute_map.py -i evaluate/jsonl/MiDashengLM_FSD50K.jsonl
287
+
288
+ # Audio Captioning (FENSE)
289
+ # Uses: audio, text, model_output
290
+ python evaluate/compute_fense.py -i evaluate/jsonl/MiDashengLM_AutoACD.jsonl
291
+
292
+ # Open Audio QA (FENSE)
293
+ # Uses: audio, answer, model_output
294
+ python evaluate/compute_fense.py -i evaluate/jsonl/MiDashengLM_MusicQA.jsonl
295
+
296
+ # Audio QA with Options (ACC)
297
+ # Uses: answer, model_output
298
+ python evaluate/compute_qa_acc.py -i evaluate/jsonl/MiDashengLM_MuChoMusic.jsonl
299
+ ```
300
+
301
+ #### 5. Evaluate on MECAT and MMAU benchmarks
302
+
303
+ Please refer to the official repositories for evaluation on the [MECAT](https://github.com/xiaomi-research/mecat)
304
+ and [MMAU](https://github.com/Sakshi113/mmau) benchmarks.
305
+
306
+ ## Efficiency
307
+
308
+ MiDashengLM demonstrates superior inference efficiency compared to Qwen2.5-Omni-7B,
309
+ achieving 3.2Γ— speedup at comparable batch sizes and an overall potential speedup of 20.2Γ— with larger batches.
310
+
311
+ <img src="fig/batchsize_1_comparison_7b-1.png" width="800">
312
+
313
+ | Batch Size | MiDashengLM (samples/s) | Qwen2.5-Omni-7B (samples/s) | Speedup |
314
+ |:----------:|:-----------------------:|:----------------------------:|:-------:|
315
+ | 1 | 0.45 | 0.36 | 1.25x |
316
+ | 4 | 1.40 | 0.91 | 1.53x |
317
+ | 8 | 2.72 | 1.15 | 2.36x |
318
+ | 16 | 5.18 | OOM | - |
319
+ | 32 | 9.78 | OOM | - |
320
+ | 64 | 17.07 | OOM | - |
321
+ | 128 | 22.73 | OOM | - |
322
+ | 200 | 25.15 | OOM | - |
323
+
324
+ *Tested on 80GB GPU with 30s audio, 100-token output.*
325
+
326
+ ## Training Data
327
+
328
+ MiDashengLM is trained exclusively on publicly available datasets across five categories: Speech, Sound and General Audio, Speech and Paralinguistic, Music, and Question Answering. All datasets are listed below with their respective tasks, lengths, and supervised fine-tuning (SFT) usage.
329
+
330
+ <img src="fig/pretraining_sampling_rates-1.png" width="1200">
331
+
332
+ ### Speech Training Data
333
+
334
+ This table lists speech-related datasets used for tasks like Automatic Speech Recognition (ASR), keyword spotting (KWS), and speech-to-text translation (S2TT).
335
+ The column β€œSFT?” indicates whether the dataset is used for supervised fine-tuning.
336
+
337
+ | Data | Task | Length(h) | SFT? |
338
+ |:----------------------:|:---------:|:---------:|:----:|
339
+ | LibriSpeech | ASR | 960 | √ |
340
+ | LibriHeavy | ASR | 50,000 | X |
341
+ | GigaSpeech | ASR | 10,000 | √ |
342
+ | GigaSpeech2 | ASR | 30,000 | √ |
343
+ | WeNetSpeech | ASR | 10,000 | √ |
344
+ | Yodas | ASR | 320,000 | X |
345
+ | CommonVoice-17.0 | ASR | 5,000 | √ |
346
+ | AISHELL-1 | ASR | 100 | √ |
347
+ | AISHELL-2 | ASR | 1,000 | √ |
348
+ | AISHELL-3 | ASR | 70 | √ |
349
+ | LJSpeech-1.1 | ASR | 37 | X |
350
+ | LibriTTS | ASR | 585 | X |
351
+ | MultiLingualSpokenWords| KWS | 5,000 | X |
352
+ | Emilia | ASR | 101,000 | √ |
353
+ | CovoST-v2 | S2TT | 2,880 | √ |
354
+ | Fleurs | S2TT | 1,224 | X |
355
+ | MSR-86K | ASR, LangID| 86,000 | √ |
356
+ | ACAV100M-Speech | ASR | 55,754 | X |
357
+ | Must-C | ASR,S2TT | 1,000 | √ |
358
+ | MLS | ASR | 50,000 | X |
359
+ | SpgiSpeech | ASR | 5,000 | X |
360
+ | PeoplesSpeech | ASR | 30,000 | X |
361
+ | KeSpeech | ASR | 1,400 | √ |
362
+ | LAION-300M | Caption | 230,000 | X |
363
+ | **Total** | | **997,010**| **258.410** |
364
+
365
+ ### Sound and General Audio Datasets
366
+
367
+ | Dataset | Task | Length(h) | SFT? |
368
+ |:--------------:|:------------------------:|:---------:|:----:|
369
+ | FSD50k | Sound Event | 77 | √ |
370
+ | AudioSet | Sound Event | 5,200 | |
371
+ | AudioSet-strong| Sound Event | 220 | X |
372
+ | VGGSound | Sound Event | 540 | √ |
373
+ | FSDKaggle2018 | Sound Event | 20 | √ |
374
+ | FSDKaggle2019 | Sound Event | 100 | |
375
+ | ARCA23k | Sound Event | 120 | X |
376
+ | AutoACD | Audio(Sound) Caption | 5,200 | √ |
377
+ | AudioSetCaps | Audio(Sound) Caption | 6,000 | √ |
378
+ | SoundVECaps | Audio(Sound) Caption | 5,000 | √ |
379
+ | WavCaps | Audio(Sound) Caption | 7,567 | √ |
380
+ | Audiocaps | Audio(Sound) Caption | 100 | √ |
381
+ | Clothov2 | Audio(Sound) Caption | 17 | √ |
382
+ | TACOS | Audio(Sound) Caption | 98 | √ |
383
+ | CochlScene | SoundScape | 500 | √ |
384
+ | BirdSet | SoundScape | 7,000 | X |
385
+ | ACAVCaps | General Caption | 38,662 | √ |
386
+ | **Total** | | **76.421**| **69.081** |
387
+
388
+ ### Speech and Paralinguistic Datasets
389
+
390
+ | Dataset | Task | Length(hours) | SFT? |
391
+ |:------------------:|:-----------------------------:|:-------------:|:----:|
392
+ | IEMOCAP | Emotion | 8 | √ |
393
+ | Meld | Emotion | 12 | √ |
394
+ | SUBESCO | Emotion | 9 | X |
395
+ | RAVDESS-Speech | Emotion | 2 | X |
396
+ | RAVDESS-Song | Emotion | 1 | X |
397
+ | CREMA-D | Emotion | 4 | X |
398
+ | ESD | Emotion | 29 | X |
399
+ | VocalSound | Vocal sound classification | 20 | √ |
400
+ | NonSpeech7k | Vocal sound classification | 3 | √ |
401
+ | VoxLingua107 | Language identification | 7,200 | √ |
402
+ | CommonLanguage | Language identification | 45 | √ |
403
+ | YLACombe | Language identification | 5 | X |
404
+ | VoxCeleb1 | Speaker verification | 76 | √ |
405
+ | CNCeleb | Speaker verification & age | 2,100 | √ |
406
+ | VoxCeleb2 | Speaker verification | 1,000 | √ |
407
+ | VoxBlink1 | Speaker verification | 1,300 | |
408
+ | VoxBlink2 | Speaker verification | 2,600 | √ |
409
+ | VoxTube | Language identification | 5,200 | √ |
410
+ | LibriCount | Speaker counting | 8 | √ |
411
+ | FluentSpeechCommands | Intent classification & gender | 17 | X |
412
+ | SpeechOcean762 | Speaker age | 5 | X |
413
+ | ASVSpoof5 | Spoof detection | 603 | X |
414
+ | **Total** | | **20,247** | **19,572** |
415
+
416
+ ### Music-Related Datasets
417
+
418
+ Covers music captioning, genre recognition, instrument classification, and singing style identification.
419
+
420
+ | Dataset | Task | Length(h) | SFT? |
421
+ |:---------------:|:---------------------------------:|:---------:|:----:|
422
+ | MusicCaps | Music Caption | 15 | √ |
423
+ | Songdescriber | Music Caption | 23 | √ |
424
+ | LPMusicCaps-MTT | Music Caption | 18 | √ |
425
+ | LPMusicCaps-MSD | Music Caption | 1,000 | √ |
426
+ | VocalSet | Singing style identification | 10 | X |
427
+ | FreeMusicArchive| Genre recognition | 610 | √ |
428
+ | MTG-Jamendo | Instrument classification Genre recognition | 3,768 | √ |
429
+ | NSynth | Instrument classification | 360 | √ |
430
+ | GoodSounds | Instrument classification | 28 | √ |
431
+ | chMusic | Instrument classification | 1 | √ |
432
+ | CTIS | Instrument classification | 1 | √ |
433
+ | **Total** | | **5,824** | **5,814** |
434
+
435
+ ### Question Answering Datasets
436
+
437
+ Used for training on audio-visual QA, environment QA, and music QA tasks. Most support SFT.
438
+
439
+ | Dataset | Task | # QA | SFT? |
440
+ |:---------:|:---------------:|:--------:|:----:|
441
+ | AVQA | Environment QA | 36,114 | √ |
442
+ | ClothoAQA | Environment QA | 6,175 | √ |
443
+ | TACOS+ | Environment QA | 40,019 | √ |
444
+ | MusicQA | Music QA | 112,878 | √ |
445
+ | SIFT-50M | Speech QA | 21,430,000 | √ |
446
+ | ACAV-QA | General QA | 24,371 | √ |
447
+
448
+ ## Citation
449
+
450
+ MiDashengLM is under the Apache License 2.0, and we encourage its use in **both research and business applications**.
451
+
452
+ If you find MiDashengLM useful in your research, please consider citing our work:
453
+
454
+ ```bibtex
455
+ @techreport{midashenglm7b,
456
+ title = {MiDashengLM: Efficient Audio Understanding with General Audio Captions},
457
+ author = {{Horizon Team, MiLM Plus}},
458
+ institution= {Xiaomi Inc.},
459
+ year = {2025},
460
+ note = {Contributors: Heinrich Dinkel et al. (listed alphabetically in Appendix B)},
461
+ url = {https://arxiv.org/abs/2508.03983},
462
+ eprint = {2508.03983},
463
+ }
464
+ ```
fig/Framework-1.png ADDED

Git LFS Details

  • SHA256: fbc2c4bd9674b69478fa273d092f9b27d26c11c249bbdba47d6a29c12d0849bb
  • Pointer size: 132 Bytes
  • Size of remote file: 3.23 MB
fig/acavcaps-1.png ADDED

Git LFS Details

  • SHA256: c3e4774bdf5fbf78ee972af07f23c9b230a55d20e1292cd2ef56d9497d172651
  • Pointer size: 132 Bytes
  • Size of remote file: 1.85 MB
fig/batchsize_1_comparison_7b-1.png ADDED

Git LFS Details

  • SHA256: 849d0933ac04d314e185cddc8c5a4eb3efc3ca43b79be3a28b746ae2169cdd6f
  • Pointer size: 131 Bytes
  • Size of remote file: 350 kB
fig/capabilities_plot_7b-1.png ADDED

Git LFS Details

  • SHA256: 7134e7a090ee4776e53182db0dd28553028ad86b15282b69c70c09d740f2e9e3
  • Pointer size: 132 Bytes
  • Size of remote file: 1.39 MB
fig/pretraining_sampling_rates-1.png ADDED

Git LFS Details

  • SHA256: 68842e19d89517ac31c7fcafc391b8cc9ded16db21b4a486c7f5da42745d5013
  • Pointer size: 132 Bytes
  • Size of remote file: 1.8 MB