nexaml commited on
Commit
2b800e4
·
verified ·
1 Parent(s): e720861

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +668 -0
README.md ADDED
@@ -0,0 +1,668 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: qwen-research
4
+ license_link: LICENSE
5
+ language:
6
+ - en
7
+ tags:
8
+ - multimodal
9
+ library_name: transformers
10
+ pipeline_tag: any-to-any
11
+ ---
12
+
13
+ # nexaml/Qwen2.5-Omni-3B-GGUF
14
+
15
+ ## Quickstart
16
+
17
+ Run them directly with [nexa-sdk](https://github.com/NexaAI/nexa-sdk) installed
18
+ In nexa-sdk CLI:
19
+
20
+ ```bash
21
+ nexaml/Qwen2.5-Omni-3B-GGUF
22
+ ```
23
+
24
+ #### Available Quantizations
25
+ | Filename | Quant type | File Size | Split | Description |
26
+ | -------- | ---------- | --------- | ----- | ----------- |
27
+ | | | | | |
28
+
29
+ ## Overview
30
+
31
+ ### Introduction
32
+ Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner.
33
+
34
+ <p align="center">
35
+ <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/qwen_omni.png" width="80%"/>
36
+ <p>
37
+ ### Key Features
38
+
39
+ * **Omni and Novel Architecture**: We propose Thinker-Talker architecture, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. We propose a novel position embedding, named TMRoPE (Time-aligned Multimodal RoPE), to synchronize the timestamps of video inputs with audio.
40
+
41
+ * **Real-Time Voice and Video Chat**: Architecture designed for fully real-time interactions, supporting chunked input and immediate output.
42
+
43
+ * **Natural and Robust Speech Generation**: Surpassing many existing streaming and non-streaming alternatives, demonstrating superior robustness and naturalness in speech generation.
44
+
45
+ * **Strong Performance Across Modalities**: Exhibiting exceptional performance across all modalities when benchmarked against similarly sized single-modality models. Qwen2.5-Omni outperforms the similarly sized Qwen2-Audio in audio capabilities and achieves comparable performance to Qwen2.5-VL-7B.
46
+
47
+ * **Excellent End-to-End Speech Instruction Following**: Qwen2.5-Omni shows performance in end-to-end speech instruction following that rivals its effectiveness with text inputs, evidenced by benchmarks such as MMLU and GSM8K.
48
+
49
+ ### Model Architecture
50
+
51
+ <p align="center">
52
+ <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/overview.png" width="80%"/>
53
+ <p>
54
+ ### Performance
55
+
56
+ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates strong performance across all modalities when compared to similarly sized single-modality models and closed-source models like Qwen2.5-VL-7B, Qwen2-Audio, and Gemini-1.5-pro. In tasks requiring the integration of multiple modalities, such as OmniBench, Qwen2.5-Omni achieves state-of-the-art performance. Furthermore, in single-modality tasks, it excels in areas including speech recognition (Common Voice), translation (CoVoST2), audio understanding (MMAU), image reasoning (MMMU, MMStar), video understanding (MVBench), and speech generation (Seed-tts-eval and subjective naturalness).
57
+
58
+ <p align="center">
59
+ <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/bar.png" width="80%"/>
60
+ <p>
61
+ <details>
62
+ <summary>Multimodality -> Text</summary>
63
+
64
+ <table class="tg"><thead>
65
+ <tr>
66
+ <th class="tg-0lax">Datasets</th>
67
+ <th class="tg-0lax">Model</th>
68
+ <th class="tg-0lax">Performance</th>
69
+ </tr></thead>
70
+ <tbody>
71
+ <tr>
72
+ <td class="tg-0lax" rowspan="10">OmniBench<br>Speech | Sound Event | Music | Avg</td>
73
+ <td class="tg-0lax">Gemini-1.5-Pro</td>
74
+ <td class="tg-0lax">42.67%|42.26%|46.23%|42.91%</td>
75
+ </tr>
76
+ <tr>
77
+ <td class="tg-0lax">MIO-Instruct</td>
78
+ <td class="tg-0lax">36.96%|33.58%|11.32%|33.80%</td>
79
+ </tr>
80
+ <tr>
81
+ <td class="tg-0lax">AnyGPT (7B)</td>
82
+ <td class="tg-0lax">17.77%|20.75%|13.21%|18.04%</td>
83
+ </tr>
84
+ <tr>
85
+ <td class="tg-0lax">video-SALMONN</td>
86
+ <td class="tg-0lax">34.11%|31.70%|<strong>56.60%</strong>|35.64%</td>
87
+ </tr>
88
+ <tr>
89
+ <td class="tg-0lax">UnifiedIO2-xlarge</td>
90
+ <td class="tg-0lax">39.56%|36.98%|29.25%|38.00%</td>
91
+ </tr>
92
+ <tr>
93
+ <td class="tg-0lax">UnifiedIO2-xxlarge</td>
94
+ <td class="tg-0lax">34.24%|36.98%|24.53%|33.98%</td>
95
+ </tr>
96
+ <tr>
97
+ <td class="tg-0lax">MiniCPM-o</td>
98
+ <td class="tg-0lax">-|-|-|40.50%</td>
99
+ </tr>
100
+ <tr>
101
+ <td class="tg-0lax">Baichuan-Omni-1.5</td>
102
+ <td class="tg-0lax">-|-|-|42.90%</td>
103
+ </tr>
104
+ <tr>
105
+ <td class="tg-0lax">Qwen2.5-Omni-3B</td>
106
+ <td class="tg-0lax">52.14%|52.08%|52.83%|52.19%</td>
107
+ </tr>
108
+ <tr>
109
+ <td class="tg-0lax">Qwen2.5-Omni-7B</td>
110
+ <td class="tg-0lax"><strong>55.25%</strong>|<strong>60.00%</strong>|52.83%|<strong>56.13%</strong></td>
111
+ </tr>
112
+ </tbody></table>
113
+ </details>
114
+
115
+ <details>
116
+ <summary>Audio -> Text</summary>
117
+
118
+
119
+ <table class="tg"><thead>
120
+ <tr>
121
+ <th class="tg-0lax">Datasets</th>
122
+ <th class="tg-0lax">Model</th>
123
+ <th class="tg-0lax">Performance</th>
124
+ </tr></thead>
125
+ <tbody>
126
+ <tr>
127
+ <td class="tg-9j4x" colspan="3">ASR</td>
128
+ </tr>
129
+ <tr>
130
+ <td class="tg-0lax" rowspan="12">Librispeech<br>dev-clean | dev other | test-clean | test-other</td>
131
+ <td class="tg-0lax">SALMONN</td>
132
+ <td class="tg-0lax">-|-|2.1|4.9</td>
133
+ </tr>
134
+ <tr>
135
+ <td class="tg-0lax">SpeechVerse</td>
136
+ <td class="tg-0lax">-|-|2.1|4.4</td>
137
+ </tr>
138
+ <tr>
139
+ <td class="tg-0lax">Whisper-large-v3</td>
140
+ <td class="tg-0lax">-|-|1.8|3.6</td>
141
+ </tr>
142
+ <tr>
143
+ <td class="tg-0lax">Llama-3-8B</td>
144
+ <td class="tg-0lax">-|-|-|3.4</td>
145
+ </tr>
146
+ <tr>
147
+ <td class="tg-0lax">Llama-3-70B</td>
148
+ <td class="tg-0lax">-|-|-|3.1</td>
149
+ </tr>
150
+ <tr>
151
+ <td class="tg-0lax">Seed-ASR-Multilingual</td>
152
+ <td class="tg-0lax">-|-|<strong>1.6</strong>|<strong>2.8</strong></td>
153
+ </tr>
154
+ <tr>
155
+ <td class="tg-0lax">MiniCPM-o</td>
156
+ <td class="tg-0lax">-|-|1.7|-</td>
157
+ </tr>
158
+ <tr>
159
+ <td class="tg-0lax">MinMo</td>
160
+ <td class="tg-0lax">-|-|1.7|3.9</td>
161
+ </tr>
162
+ <tr>
163
+ <td class="tg-0lax">Qwen-Audio</td>
164
+ <td class="tg-0lax">1.8|4.0|2.0|4.2</td>
165
+ </tr>
166
+ <tr>
167
+ <td class="tg-0lax">Qwen2-Audio</td>
168
+ <td class="tg-0lax"><strong>1.3</strong>|<strong>3.4</strong>|<strong>1.6</strong>|3.6</td>
169
+ </tr>
170
+ <tr>
171
+ <td class="tg-0lax">Qwen2.5-Omni-3B</td>
172
+ <td class="tg-0lax">2.0|4.1|2.2|4.5</td>
173
+ </tr>
174
+ <tr>
175
+ <td class="tg-0lax">Qwen2.5-Omni-7B</td>
176
+ <td class="tg-0lax">1.6|3.5|1.8|3.4</td>
177
+ </tr>
178
+ <tr>
179
+ <td class="tg-0lax" rowspan="5">Common Voice 15<br>en | zh | yue | fr</td>
180
+ <td class="tg-0lax">Whisper-large-v3</td>
181
+ <td class="tg-0lax">9.3|12.8|10.9|10.8</td>
182
+ </tr>
183
+ <tr>
184
+ <td class="tg-0lax">MinMo</td>
185
+ <td class="tg-0lax">7.9|6.3|6.4|8.5</td>
186
+ </tr>
187
+ <tr>
188
+ <td class="tg-0lax">Qwen2-Audio</td>
189
+ <td class="tg-0lax">8.6|6.9|<strong>5.9</strong>|9.6</td>
190
+ </tr>
191
+ <tr>
192
+ <td class="tg-0lax">Qwen2.5-Omni-3B</td>
193
+ <td class="tg-0lax">9.1|6.0|11.6|9.6</td>
194
+ </tr>
195
+ <tr>
196
+ <td class="tg-0lax">Qwen2.5-Omni-7B</td>
197
+ <td class="tg-0lax"><strong>7.6</strong>|<strong>5.2</strong>|7.3|<strong>7.5</strong></td>
198
+ </tr>
199
+ <tr>
200
+ <td class="tg-0lax" rowspan="8">Fleurs<br>zh | en</td>
201
+ <td class="tg-0lax">Whisper-large-v3</td>
202
+ <td class="tg-0lax">7.7|4.1</td>
203
+ </tr>
204
+ <tr>
205
+ <td class="tg-0lax">Seed-ASR-Multilingual</td>
206
+ <td class="tg-0lax">-|<strong>3.4</strong></td>
207
+ </tr>
208
+ <tr>
209
+ <td class="tg-0lax">Megrez-3B-Omni</td>
210
+ <td class="tg-0lax">10.8|-</td>
211
+ </tr>
212
+ <tr>
213
+ <td class="tg-0lax">MiniCPM-o</td>
214
+ <td class="tg-0lax">4.4|-</td>
215
+ </tr>
216
+ <tr>
217
+ <td class="tg-0lax">MinMo</td>
218
+ <td class="tg-0lax">3.0|3.8</td>
219
+ </tr>
220
+ <tr>
221
+ <td class="tg-0lax">Qwen2-Audio</td>
222
+ <td class="tg-0lax">7.5|-</td>
223
+ </tr>
224
+ <tr>
225
+ <td class="tg-0lax">Qwen2.5-Omni-3B</td>
226
+ <td class="tg-0lax">3.2|5.4</td>
227
+ </tr>
228
+ <tr>
229
+ <td class="tg-0lax">Qwen2.5-Omni-7B</td>
230
+ <td class="tg-0lax"><strong>3.0</strong>|4.1</td>
231
+ </tr>
232
+ <tr>
233
+ <td class="tg-0lax" rowspan="6">Wenetspeech<br>test-net | test-meeting</td>
234
+ <td class="tg-0lax">Seed-ASR-Chinese</td>
235
+ <td class="tg-0lax"><strong>4.7|5.7</strong></td>
236
+ </tr>
237
+ <tr>
238
+ <td class="tg-0lax">Megrez-3B-Omni</td>
239
+ <td class="tg-0lax">-|16.4</td>
240
+ </tr>
241
+ <tr>
242
+ <td class="tg-0lax">MiniCPM-o</td>
243
+ <td class="tg-0lax">6.9|-</td>
244
+ </tr>
245
+ <tr>
246
+ <td class="tg-0lax">MinMo</td>
247
+ <td class="tg-0lax">6.8|7.4</td>
248
+ </tr>
249
+ <tr>
250
+ <td class="tg-0lax">Qwen2.5-Omni-3B</td>
251
+ <td class="tg-0lax">6.3|8.1</td>
252
+ </tr>
253
+ <tr>
254
+ <td class="tg-0lax">Qwen2.5-Omni-7B</td>
255
+ <td class="tg-0lax">5.9|7.7</td>
256
+ </tr>
257
+ <tr>
258
+ <td class="tg-0lax" rowspan="4">Voxpopuli-V1.0-en</td>
259
+ <td class="tg-0lax">Llama-3-8B</td>
260
+ <td class="tg-0lax">6.2</td>
261
+ </tr>
262
+ <tr>
263
+ <td class="tg-0lax">Llama-3-70B</td>
264
+ <td class="tg-0lax"><strong>5.7</strong></td>
265
+ </tr>
266
+ <tr>
267
+ <td class="tg-0lax">Qwen2.5-Omni-3B</td>
268
+ <td class="tg-0lax">6.6</td>
269
+ </tr>
270
+ <tr>
271
+ <td class="tg-0lax">Qwen2.5-Omni-7B</td>
272
+ <td class="tg-0lax">5.8</td>
273
+ </tr>
274
+ <tr>
275
+ <td class="tg-9j4x" colspan="3">S2TT</td>
276
+ </tr>
277
+ <tr>
278
+ <td class="tg-0lax" rowspan="9">CoVoST2<br>en-de | de-en | en-zh | zh-en</td>
279
+ <td class="tg-0lax">SALMONN</td>
280
+ <td class="tg-0lax">18.6|-|33.1|-</td>
281
+ </tr>
282
+ <tr>
283
+ <td class="tg-0lax">SpeechLLaMA</td>
284
+ <td class="tg-0lax">-|27.1|-|12.3</td>
285
+ </tr>
286
+ <tr>
287
+ <td class="tg-0lax">BLSP</td>
288
+ <td class="tg-0lax">14.1|-|-|-</td>
289
+ </tr>
290
+ <tr>
291
+ <td class="tg-0lax">MiniCPM-o</td>
292
+ <td class="tg-0lax">-|-|<strong>48.2</strong>|27.2</td>
293
+ </tr>
294
+ <tr>
295
+ <td class="tg-0lax">MinMo</td>
296
+ <td class="tg-0lax">-|<strong>39.9</strong>|46.7|26.0</td>
297
+ </tr>
298
+ <tr>
299
+ <td class="tg-0lax">Qwen-Audio</td>
300
+ <td class="tg-0lax">25.1|33.9|41.5|15.7</td>
301
+ </tr>
302
+ <tr>
303
+ <td class="tg-0lax">Qwen2-Audio</td>
304
+ <td class="tg-0lax">29.9|35.2|45.2|24.4</td>
305
+ </tr>
306
+ <tr>
307
+ <td class="tg-0lax">Qwen2.5-Omni-3B</td>
308
+ <td class="tg-0lax">28.3|38.1|41.4|26.6</td>
309
+ </tr>
310
+ <tr>
311
+ <td class="tg-0lax">Qwen2.5-Omni-7B</td>
312
+ <td class="tg-0lax"><strong>30.2</strong>|37.7|41.4|<strong>29.4</strong></td>
313
+ </tr>
314
+ <tr>
315
+ <td class="tg-9j4x" colspan="3">SER</td>
316
+ </tr>
317
+ <tr>
318
+ <td class="tg-0lax" rowspan="6">Meld</td>
319
+ <td class="tg-0lax">WavLM-large</td>
320
+ <td class="tg-0lax">0.542</td>
321
+ </tr>
322
+ <tr>
323
+ <td class="tg-0lax">MiniCPM-o</td>
324
+ <td class="tg-0lax">0.524</td>
325
+ </tr>
326
+ <tr>
327
+ <td class="tg-0lax">Qwen-Audio</td>
328
+ <td class="tg-0lax">0.557</td>
329
+ </tr>
330
+ <tr>
331
+ <td class="tg-0lax">Qwen2-Audio</td>
332
+ <td class="tg-0lax">0.553</td>
333
+ </tr>
334
+ <tr>
335
+ <td class="tg-0lax">Qwen2.5-Omni-3B</td>
336
+ <td class="tg-0lax">0.558</td>
337
+ </tr>
338
+ <tr>
339
+ <td class="tg-0lax">Qwen2.5-Omni-7B</td>
340
+ <td class="tg-0lax"><strong>0.570</strong></td>
341
+ </tr>
342
+ <tr>
343
+ <td class="tg-9j4x" colspan="3">VSC</td>
344
+ </tr>
345
+ <tr>
346
+ <td class="tg-0lax" rowspan="6">VocalSound</td>
347
+ <td class="tg-0lax">CLAP</td>
348
+ <td class="tg-0lax">0.495</td>
349
+ </tr>
350
+ <tr>
351
+ <td class="tg-0lax">Pengi</td>
352
+ <td class="tg-0lax">0.604</td>
353
+ </tr>
354
+ <tr>
355
+ <td class="tg-0lax">Qwen-Audio</td>
356
+ <td class="tg-0lax">0.929</td>
357
+ </tr>
358
+ <tr>
359
+ <td class="tg-0lax">Qwen2-Audio</td>
360
+ <td class="tg-0lax"><strong>0.939</strong></td>
361
+ </tr>
362
+ <tr>
363
+ <td class="tg-0lax">Qwen2.5-Omni-3B</td>
364
+ <td class="tg-0lax">0.936</td>
365
+ </tr>
366
+ <tr>
367
+ <td class="tg-0lax">Qwen2.5-Omni-7B</td>
368
+ <td class="tg-0lax"><strong>0.939</strong></td>
369
+ </tr>
370
+ <tr>
371
+ <td class="tg-9j4x" colspan="3">Music</td>
372
+ </tr>
373
+ <tr>
374
+ <td class="tg-0lax" rowspan="3">GiantSteps Tempo</td>
375
+ <td class="tg-0lax">Llark-7B</td>
376
+ <td class="tg-0lax">0.86</td>
377
+ </tr>
378
+ <tr>
379
+ <td class="tg-0lax">Qwen2.5-Omni-3B</td>
380
+ <td class="tg-0lax"><strong>0.88</strong></td>
381
+ </tr>
382
+ <tr>
383
+ <td class="tg-0lax">Qwen2.5-Omni-7B</td>
384
+ <td class="tg-0lax"><strong>0.88</strong></td>
385
+ </tr>
386
+ <tr>
387
+ <td class="tg-0lax" rowspan="3">MusicCaps</td>
388
+ <td class="tg-0lax">LP-MusicCaps</td>
389
+ <td class="tg-0lax">0.291|0.149|0.089|<strong>0.061</strong>|0.129|0.130</td>
390
+ </tr>
391
+ <tr>
392
+ <td class="tg-0lax">Qwen2.5-Omni-3B</td>
393
+ <td class="tg-0lax">0.325|<strong>0.163</strong>|<strong>0.093</strong>|0.057|<strong>0.132</strong>|<strong>0.229</strong></td>
394
+ </tr>
395
+ <tr>
396
+ <td class="tg-0lax">Qwen2.5-Omni-7B</td>
397
+ <td class="tg-0lax"><strong>0.328</strong>|0.162|0.090|0.055|0.127|0.225</td>
398
+ </tr>
399
+ <tr>
400
+ <td class="tg-9j4x" colspan="3">Audio Reasoning</td>
401
+ </tr>
402
+ <tr>
403
+ <td class="tg-0lax" rowspan="4">MMAU<br>Sound | Music | Speech | Avg</td>
404
+ <td class="tg-0lax">Gemini-Pro-V1.5</td>
405
+ <td class="tg-0lax">56.75|49.40|58.55|54.90</td>
406
+ </tr>
407
+ <tr>
408
+ <td class="tg-0lax">Qwen2-Audio</td>
409
+ <td class="tg-0lax">54.95|50.98|42.04|49.20</td>
410
+ </tr>
411
+ <tr>
412
+ <td class="tg-0lax">Qwen2.5-Omni-3B</td>
413
+ <td class="tg-0lax"><strong>70.27</strong>|60.48|59.16|63.30</td>
414
+ </tr>
415
+ <tr>
416
+ <td class="tg-0lax">Qwen2.5-Omni-7B</td>
417
+ <td class="tg-0lax">67.87|<strong>69.16|59.76|65.60</strong></td>
418
+ </tr>
419
+ <tr>
420
+ <td class="tg-9j4x" colspan="3">Voice Chatting</td>
421
+ </tr>
422
+ <tr>
423
+ <td class="tg-0lax" rowspan="9">VoiceBench<br>AlpacaEval | CommonEval | SD-QA | MMSU</td>
424
+ <td class="tg-0lax">Ultravox-v0.4.1-LLaMA-3.1-8B</td>
425
+ <td class="tg-0lax"><strong>4.55</strong>|3.90|53.35|47.17</td>
426
+ </tr>
427
+ <tr>
428
+ <td class="tg-0lax">MERaLiON</td>
429
+ <td class="tg-0lax">4.50|3.77|55.06|34.95</td>
430
+ </tr>
431
+ <tr>
432
+ <td class="tg-0lax">Megrez-3B-Omni</td>
433
+ <td class="tg-0lax">3.50|2.95|25.95|27.03</td>
434
+ </tr>
435
+ <tr>
436
+ <td class="tg-0lax">Lyra-Base</td>
437
+ <td class="tg-0lax">3.85|3.50|38.25|49.74</td>
438
+ </tr>
439
+ <tr>
440
+ <td class="tg-0lax">MiniCPM-o</td>
441
+ <td class="tg-0lax">4.42|<strong>4.15</strong>|50.72|54.78</td>
442
+ </tr>
443
+ <tr>
444
+ <td class="tg-0lax">Baichuan-Omni-1.5</td>
445
+ <td class="tg-0lax">4.50|4.05|43.40|57.25</td>
446
+ </tr>
447
+ <tr>
448
+ <td class="tg-0lax">Qwen2-Audio</td>
449
+ <td class="tg-0lax">3.74|3.43|35.71|35.72</td>
450
+ </tr>
451
+ <tr>
452
+ <td class="tg-0lax">Qwen2.5-Omni-3B</td>
453
+ <td class="tg-0lax">4.32|4.00|49.37|50.23</td>
454
+ </tr>
455
+ <tr>
456
+ <td class="tg-0lax">Qwen2.5-Omni-7B</td>
457
+ <td class="tg-0lax">4.49|3.93|<strong>55.71</strong>|<strong>61.32</strong></td>
458
+ </tr>
459
+ <tr>
460
+ <td class="tg-0lax" rowspan="9">VoiceBench<br>OpenBookQA | IFEval | AdvBench | Avg</td>
461
+ <td class="tg-0lax">Ultravox-v0.4.1-LLaMA-3.1-8B</td>
462
+ <td class="tg-0lax">65.27|<strong>66.88</strong>|98.46|71.45</td>
463
+ </tr>
464
+ <tr>
465
+ <td class="tg-0lax">MERaLiON</td>
466
+ <td class="tg-0lax">27.23|62.93|94.81|62.91</td>
467
+ </tr>
468
+ <tr>
469
+ <td class="tg-0lax">Megrez-3B-Omni</td>
470
+ <td class="tg-0lax">28.35|25.71|87.69|46.25</td>
471
+ </tr>
472
+ <tr>
473
+ <td class="tg-0lax">Lyra-Base</td>
474
+ <td class="tg-0lax">72.75|36.28|59.62|57.66</td>
475
+ </tr>
476
+ <tr>
477
+ <td class="tg-0lax">MiniCPM-o</td>
478
+ <td class="tg-0lax">78.02|49.25|97.69|71.69</td>
479
+ </tr>
480
+ <tr>
481
+ <td class="tg-0lax">Baichuan-Omni-1.5</td>
482
+ <td class="tg-0lax">74.51|54.54|97.31|71.14</td>
483
+ </tr>
484
+ <tr>
485
+ <td class="tg-0lax">Qwen2-Audio</td>
486
+ <td class="tg-0lax">49.45|26.33|96.73|55.35</td>
487
+ </tr>
488
+ <tr>
489
+ <td class="tg-0lax">Qwen2.5-Omni-3B</td>
490
+ <td class="tg-0lax">74.73|42.10|98.85|68.81</td>
491
+ </tr>
492
+ <tr>
493
+ <td class="tg-0lax">Qwen2.5-Omni-7B</td>
494
+ <td class="tg-0lax"><strong>81.10</strong>|52.87|<strong>99.42</strong>|<strong>74.12</strong></td>
495
+ </tr>
496
+ </tbody></table>
497
+ </details>
498
+ <details>
499
+ <summary>Image -> Text</summary>
500
+
501
+ | Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Other Best | Qwen2.5-VL-7B | GPT-4o-mini |
502
+ |--------------------------------|--------------|------------|------------|---------------|-------------|
503
+ | MMMU<sub>val</sub> | 59.2 | 53.1 | 53.9 | 58.6 | **60.0** |
504
+ | MMMU-Pro<sub>overall</sub> | 36.6 | 29.7 | - | **38.3** | 37.6 |
505
+ | MathVista<sub>testmini</sub> | 67.9 | 59.4 | **71.9** | 68.2 | 52.5 |
506
+ | MathVision<sub>full</sub> | 25.0 | 20.8 | 23.1 | **25.1** | - |
507
+ | MMBench-V1.1-EN<sub>test</sub> | 81.8 | 77.8 | 80.5 | **82.6** | 76.0 |
508
+ | MMVet<sub>turbo</sub> | 66.8 | 62.1 | **67.5** | 67.1 | 66.9 |
509
+ | MMStar | **64.0** | 55.7 | **64.0** | 63.9 | 54.8 |
510
+ | MME<sub>sum</sub> | 2340 | 2117 | **2372** | 2347 | 2003 |
511
+ | MuirBench | 59.2 | 48.0 | - | **59.2** | - |
512
+ | CRPE<sub>relation</sub> | **76.5** | 73.7 | - | 76.4 | - |
513
+ | RealWorldQA<sub>avg</sub> | 70.3 | 62.6 | **71.9** | 68.5 | - |
514
+ | MME-RealWorld<sub>en</sub> | **61.6** | 55.6 | - | 57.4 | - |
515
+ | MM-MT-Bench | 6.0 | 5.0 | - | **6.3** | - |
516
+ | AI2D | 83.2 | 79.5 | **85.8** | 83.9 | - |
517
+ | TextVQA<sub>val</sub> | 84.4 | 79.8 | 83.2 | **84.9** | - |
518
+ | DocVQA<sub>test</sub> | 95.2 | 93.3 | 93.5 | **95.7** | - |
519
+ | ChartQA<sub>test Avg</sub> | 85.3 | 82.8 | 84.9 | **87.3** | - |
520
+ | OCRBench_V2<sub>en</sub> | **57.8** | 51.7 | - | 56.3 | - |
521
+ | Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Qwen2.5-VL-7B | Grounding DINO | Gemini 1.5 Pro |
522
+ |--------------------------|--------------|---------------|---------------|----------------|----------------|
523
+ | Refcoco<sub>val</sub> | 90.5 | 88.7 | 90.0 | **90.6** | 73.2 |
524
+ | Refcoco<sub>textA</sub> | **93.5** | 91.8 | 92.5 | 93.2 | 72.9 |
525
+ | Refcoco<sub>textB</sub> | 86.6 | 84.0 | 85.4 | **88.2** | 74.6 |
526
+ | Refcoco+<sub>val</sub> | 85.4 | 81.1 | 84.2 | **88.2** | 62.5 |
527
+ | Refcoco+<sub>textA</sub> | **91.0** | 87.5 | 89.1 | 89.0 | 63.9 |
528
+ | Refcoco+<sub>textB</sub> | **79.3** | 73.2 | 76.9 | 75.9 | 65.0 |
529
+ | Refcocog+<sub>val</sub> | **87.4** | 85.0 | 87.2 | 86.1 | 75.2 |
530
+ | Refcocog+<sub>test</sub> | **87.9** | 85.1 | 87.2 | 87.0 | 76.2 |
531
+ | ODinW | 42.4 | 39.2 | 37.3 | **55.0** | 36.7 |
532
+ | PointGrounding | 66.5 | 46.2 | **67.3** | - | - |
533
+ </details>
534
+ <details>
535
+ <summary>Video(without audio) -> Text</summary>
536
+ | Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Other Best | Qwen2.5-VL-7B | GPT-4o-mini |
537
+ |-----------------------------|--------------|------------|------------|---------------|-------------|
538
+ | Video-MME<sub>w/o sub</sub> | 64.3 | 62.0 | 63.9 | **65.1** | 64.8 |
539
+ | Video-MME<sub>w sub</sub> | **72.4** | 68.6 | 67.9 | 71.6 | - |
540
+ | MVBench | **70.3** | 68.7 | 67.2 | 69.6 | - |
541
+ | EgoSchema<sub>test</sub> | **68.6** | 61.4 | 63.2 | 65.0 | - |
542
+ </details>
543
+ <details>
544
+ <summary>Zero-shot Speech Generation</summary>
545
+ <table class="tg"><thead>
546
+ <tr>
547
+ <th class="tg-0lax">Datasets</th>
548
+ <th class="tg-0lax">Model</th>
549
+ <th class="tg-0lax">Performance</th>
550
+ </tr></thead>
551
+ <tbody>
552
+ <tr>
553
+ <td class="tg-9j4x" colspan="3">Content Consistency</td>
554
+ </tr>
555
+ <tr>
556
+ <td class="tg-0lax" rowspan="11">SEED<br>test-zh | test-en | test-hard </td>
557
+ <td class="tg-0lax">Seed-TTS_ICL</td>
558
+ <td class="tg-0lax">1.11 | 2.24 | 7.58</td>
559
+ </tr>
560
+ <tr>
561
+ <td class="tg-0lax">Seed-TTS_RL</td>
562
+ <td class="tg-0lax"><strong>1.00</strong> | 1.94 | <strong>6.42</strong></td>
563
+ </tr>
564
+ <tr>
565
+ <td class="tg-0lax">MaskGCT</td>
566
+ <td class="tg-0lax">2.27 | 2.62 | 10.27</td>
567
+ </tr>
568
+ <tr>
569
+ <td class="tg-0lax">E2_TTS</td>
570
+ <td class="tg-0lax">1.97 | 2.19 | -</td>
571
+ </tr>
572
+ <tr>
573
+ <td class="tg-0lax">F5-TTS</td>
574
+ <td class="tg-0lax">1.56 | <strong>1.83</strong> | 8.67</td>
575
+ </tr>
576
+ <tr>
577
+ <td class="tg-0lax">CosyVoice 2</td>
578
+ <td class="tg-0lax">1.45 | 2.57 | 6.83</td>
579
+ </tr>
580
+ <tr>
581
+ <td class="tg-0lax">CosyVoice 2-S</td>
582
+ <td class="tg-0lax">1.45 | 2.38 | 8.08</td>
583
+ </tr>
584
+ <tr>
585
+ <td class="tg-0lax">Qwen2.5-Omni-3B_ICL</td>
586
+ <td class="tg-0lax">1.95 | 2.87 | 9.92</td>
587
+ </tr>
588
+ <tr>
589
+ <td class="tg-0lax">Qwen2.5-Omni-3B_RL</td>
590
+ <td class="tg-0lax">1.58 | 2.51 | 7.86</td>
591
+ </tr>
592
+ <tr>
593
+ <td class="tg-0lax">Qwen2.5-Omni-7B_ICL</td>
594
+ <td class="tg-0lax">1.70 | 2.72 | 7.97</td>
595
+ </tr>
596
+ <tr>
597
+ <td class="tg-0lax">Qwen2.5-Omni-7B_RL</td>
598
+ <td class="tg-0lax">1.42 | 2.32 | 6.54</td>
599
+ </tr>
600
+ <tr>
601
+ <td class="tg-9j4x" colspan="3">Speaker Similarity</td>
602
+ </tr>
603
+ <tr>
604
+ <td class="tg-0lax" rowspan="11">SEED<br>test-zh | test-en | test-hard </td>
605
+ <td class="tg-0lax">Seed-TTS_ICL</td>
606
+ <td class="tg-0lax">0.796 | 0.762 | 0.776</td>
607
+ </tr>
608
+ <tr>
609
+ <td class="tg-0lax">Seed-TTS_RL</td>
610
+ <td class="tg-0lax"><strong>0.801</strong> | <strong>0.766</strong> | <strong>0.782</strong></td>
611
+ </tr>
612
+ <tr>
613
+ <td class="tg-0lax">MaskGCT</td>
614
+ <td class="tg-0lax">0.774 | 0.714 | 0.748</td>
615
+ </tr>
616
+ <tr>
617
+ <td class="tg-0lax">E2_TTS</td>
618
+ <td class="tg-0lax">0.730 | 0.710 | -</td>
619
+ </tr>
620
+ <tr>
621
+ <td class="tg-0lax">F5-TTS</td>
622
+ <td class="tg-0lax">0.741 | 0.647 | 0.713</td>
623
+ </tr>
624
+ <tr>
625
+ <td class="tg-0lax">CosyVoice 2</td>
626
+ <td class="tg-0lax">0.748 | 0.652 | 0.724</td>
627
+ </tr>
628
+ <tr>
629
+ <td class="tg-0lax">CosyVoice 2-S</td>
630
+ <td class="tg-0lax">0.753 | 0.654 | 0.732</td>
631
+ </tr>
632
+ <tr>
633
+ <td class="tg-0lax">Qwen2.5-Omni-3B_ICL</td>
634
+ <td class="tg-0lax">0.741 | 0.635 | 0.748</td>
635
+ </tr>
636
+ <tr>
637
+ <td class="tg-0lax">Qwen2.5-Omni-3B_RL</td>
638
+ <td class="tg-0lax">0.744 | 0.635 | 0.746</td>
639
+ </tr>
640
+ <tr>
641
+ <td class="tg-0lax">Qwen2.5-Omni-7B_ICL</td>
642
+ <td class="tg-0lax">0.752 | 0.632 | 0.747</td>
643
+ </tr>
644
+ <tr>
645
+ <td class="tg-0lax">Qwen2.5-Omni-7B_RL</td>
646
+ <td class="tg-0lax">0.754 | 0.641 | 0.752</td>
647
+ </tr>
648
+ </tbody></table>
649
+ </details>
650
+ <details>
651
+ <summary>Text -> Text</summary>
652
+
653
+ | Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Qwen2.5-7B | Qwen2.5-3B | Qwen2-7B | Llama3.1-8B | Gemma2-9B |
654
+ |-----------------------------------|-----------|------------|------------|------------|------------|-------------|-----------|
655
+ | MMLU-Pro | 47.0 | 40.4 | **56.3** | 43.7 | 44.1 | 48.3 | 52.1 |
656
+ | MMLU-redux | 71.0 | 60.9 | **75.4** | 64.4 | 67.3 | 67.2 | 72.8 |
657
+ | LiveBench<sub>0831</sub> | 29.6 | 22.3 | **35.9** | 26.8 | 29.2 | 26.7 | 30.6 |
658
+ | GPQA | 30.8 | 34.3 | **36.4** | 30.3 | 34.3 | 32.8 | 32.8 |
659
+ | MATH | 71.5 | 63.6 | **75.5** | 65.9 | 52.9 | 51.9 | 44.3 |
660
+ | GSM8K | 88.7 | 82.6 | **91.6** | 86.7 | 85.7 | 84.5 | 76.7 |
661
+ | HumanEval | 78.7 | 70.7 | **84.8** | 74.4 | 79.9 | 72.6 | 68.9 |
662
+ | MBPP | 73.2 | 70.4 | **79.2** | 72.7 | 67.2 | 69.6 | 74.9 |
663
+ | MultiPL-E | 65.8 | 57.6 | **70.4** | 60.2 | 59.1 | 50.7 | 53.4 |
664
+ | LiveCodeBench<sub>2305-2409</sub> | 24.6 | 16.5 | **28.7** | 19.9 | 23.9 | 8.3 | 18.9 |
665
+ </details>
666
+
667
+ ## Reference
668
+ **Original model card**: [Qwen/Qwen2.5-Omni-3B](https://huggingface.co/Qwen/Qwen2.5-Omni-3B)