--- license: other license_name: qwen-research license_link: LICENSE language: - en tags: - multimodal library_name: transformers pipeline_tag: any-to-any --- # NexaAI/Qwen2.5-Omni-3B-GGUF ## Quickstart Run them directly with [nexa-sdk](https://github.com/NexaAI/nexa-sdk) installed In nexa-sdk CLI: ```bash NexaAI/Qwen2.5-Omni-3B-GGUF ``` #### Available Quantizations | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | [Qwen2.5-Omni-3B-4bit.gguf](https://huggingface.co/NexaAI/Qwen2.5-Omni-3B-GGUF/blob/main/Qwen2.5-Omni-3B-4bit.gguf) | 4bit | 2.1 GB | false | Lightweight 4-bit quant for fast inference. | | [Qwen2.5-Omni-3B-Q8_0.gguf](https://huggingface.co/NexaAI/Qwen2.5-Omni-3B-GGUF/blob/main/Qwen2.5-Omni-3B-Q8_0.gguf) | Q8_0 | 3.62 GB | false | High-quality 8-bit quantization. | | [Qwen2.5-Omni-3Bq2_k.gguf](https://huggingface.co/NexaAI/Qwen2.5-Omni-3B-GGUF/blob/main/Qwen2.5-Omni-3Bq2_k.gguf) | Q2_K | 4 Bytes | false | 2-bit quant. Best for extreme low-resource use. | | [mmproj-Qwen2.5-Omni-3B-Q8_0.gguf](https://huggingface.co/NexaAI/Qwen2.5-Omni-3B-GGUF/blob/main/mmproj-Qwen2.5-Omni-3B-Q8_0.gguf) | Q8_0 | 1.54 GB | false | Required vision adapter for Q8_0 model. | ## Overview ### Introduction Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner.

### Key Features * **Omni and Novel Architecture**: We propose Thinker-Talker architecture, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. We propose a novel position embedding, named TMRoPE (Time-aligned Multimodal RoPE), to synchronize the timestamps of video inputs with audio. * **Real-Time Voice and Video Chat**: Architecture designed for fully real-time interactions, supporting chunked input and immediate output. * **Natural and Robust Speech Generation**: Surpassing many existing streaming and non-streaming alternatives, demonstrating superior robustness and naturalness in speech generation. * **Strong Performance Across Modalities**: Exhibiting exceptional performance across all modalities when benchmarked against similarly sized single-modality models. Qwen2.5-Omni outperforms the similarly sized Qwen2-Audio in audio capabilities and achieves comparable performance to Qwen2.5-VL-7B. * **Excellent End-to-End Speech Instruction Following**: Qwen2.5-Omni shows performance in end-to-end speech instruction following that rivals its effectiveness with text inputs, evidenced by benchmarks such as MMLU and GSM8K. ### Model Architecture

### Performance We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates strong performance across all modalities when compared to similarly sized single-modality models and closed-source models like Qwen2.5-VL-7B, Qwen2-Audio, and Gemini-1.5-pro. In tasks requiring the integration of multiple modalities, such as OmniBench, Qwen2.5-Omni achieves state-of-the-art performance. Furthermore, in single-modality tasks, it excels in areas including speech recognition (Common Voice), translation (CoVoST2), audio understanding (MMAU), image reasoning (MMMU, MMStar), video understanding (MVBench), and speech generation (Seed-tts-eval and subjective naturalness).

Multimodality -> Text
Datasets Model Performance
OmniBench
Speech | Sound Event | Music | Avg
Gemini-1.5-Pro 42.67%|42.26%|46.23%|42.91%
MIO-Instruct 36.96%|33.58%|11.32%|33.80%
AnyGPT (7B) 17.77%|20.75%|13.21%|18.04%
video-SALMONN 34.11%|31.70%|56.60%|35.64%
UnifiedIO2-xlarge 39.56%|36.98%|29.25%|38.00%
UnifiedIO2-xxlarge 34.24%|36.98%|24.53%|33.98%
MiniCPM-o -|-|-|40.50%
Baichuan-Omni-1.5 -|-|-|42.90%
Qwen2.5-Omni-3B 52.14%|52.08%|52.83%|52.19%
Qwen2.5-Omni-7B 55.25%|60.00%|52.83%|56.13%
Audio -> Text
Datasets Model Performance
ASR
Librispeech
dev-clean | dev other | test-clean | test-other
SALMONN -|-|2.1|4.9
SpeechVerse -|-|2.1|4.4
Whisper-large-v3 -|-|1.8|3.6
Llama-3-8B -|-|-|3.4
Llama-3-70B -|-|-|3.1
Seed-ASR-Multilingual -|-|1.6|2.8
MiniCPM-o -|-|1.7|-
MinMo -|-|1.7|3.9
Qwen-Audio 1.8|4.0|2.0|4.2
Qwen2-Audio 1.3|3.4|1.6|3.6
Qwen2.5-Omni-3B 2.0|4.1|2.2|4.5
Qwen2.5-Omni-7B 1.6|3.5|1.8|3.4
Common Voice 15
en | zh | yue | fr
Whisper-large-v3 9.3|12.8|10.9|10.8
MinMo 7.9|6.3|6.4|8.5
Qwen2-Audio 8.6|6.9|5.9|9.6
Qwen2.5-Omni-3B 9.1|6.0|11.6|9.6
Qwen2.5-Omni-7B 7.6|5.2|7.3|7.5
Fleurs
zh | en
Whisper-large-v3 7.7|4.1
Seed-ASR-Multilingual -|3.4
Megrez-3B-Omni 10.8|-
MiniCPM-o 4.4|-
MinMo 3.0|3.8
Qwen2-Audio 7.5|-
Qwen2.5-Omni-3B 3.2|5.4
Qwen2.5-Omni-7B 3.0|4.1
Wenetspeech
test-net | test-meeting
Seed-ASR-Chinese 4.7|5.7
Megrez-3B-Omni -|16.4
MiniCPM-o 6.9|-
MinMo 6.8|7.4
Qwen2.5-Omni-3B 6.3|8.1
Qwen2.5-Omni-7B 5.9|7.7
Voxpopuli-V1.0-en Llama-3-8B 6.2
Llama-3-70B 5.7
Qwen2.5-Omni-3B 6.6
Qwen2.5-Omni-7B 5.8
S2TT
CoVoST2
en-de | de-en | en-zh | zh-en
SALMONN 18.6|-|33.1|-
SpeechLLaMA -|27.1|-|12.3
BLSP 14.1|-|-|-
MiniCPM-o -|-|48.2|27.2
MinMo -|39.9|46.7|26.0
Qwen-Audio 25.1|33.9|41.5|15.7
Qwen2-Audio 29.9|35.2|45.2|24.4
Qwen2.5-Omni-3B 28.3|38.1|41.4|26.6
Qwen2.5-Omni-7B 30.2|37.7|41.4|29.4
SER
Meld WavLM-large 0.542
MiniCPM-o 0.524
Qwen-Audio 0.557
Qwen2-Audio 0.553
Qwen2.5-Omni-3B 0.558
Qwen2.5-Omni-7B 0.570
VSC
VocalSound CLAP 0.495
Pengi 0.604
Qwen-Audio 0.929
Qwen2-Audio 0.939
Qwen2.5-Omni-3B 0.936
Qwen2.5-Omni-7B 0.939
Music
GiantSteps Tempo Llark-7B 0.86
Qwen2.5-Omni-3B 0.88
Qwen2.5-Omni-7B 0.88
MusicCaps LP-MusicCaps 0.291|0.149|0.089|0.061|0.129|0.130
Qwen2.5-Omni-3B 0.325|0.163|0.093|0.057|0.132|0.229
Qwen2.5-Omni-7B 0.328|0.162|0.090|0.055|0.127|0.225
Audio Reasoning
MMAU
Sound | Music | Speech | Avg
Gemini-Pro-V1.5 56.75|49.40|58.55|54.90
Qwen2-Audio 54.95|50.98|42.04|49.20
Qwen2.5-Omni-3B 70.27|60.48|59.16|63.30
Qwen2.5-Omni-7B 67.87|69.16|59.76|65.60
Voice Chatting
VoiceBench
AlpacaEval | CommonEval | SD-QA | MMSU
Ultravox-v0.4.1-LLaMA-3.1-8B 4.55|3.90|53.35|47.17
MERaLiON 4.50|3.77|55.06|34.95
Megrez-3B-Omni 3.50|2.95|25.95|27.03
Lyra-Base 3.85|3.50|38.25|49.74
MiniCPM-o 4.42|4.15|50.72|54.78
Baichuan-Omni-1.5 4.50|4.05|43.40|57.25
Qwen2-Audio 3.74|3.43|35.71|35.72
Qwen2.5-Omni-3B 4.32|4.00|49.37|50.23
Qwen2.5-Omni-7B 4.49|3.93|55.71|61.32
VoiceBench
OpenBookQA | IFEval | AdvBench | Avg
Ultravox-v0.4.1-LLaMA-3.1-8B 65.27|66.88|98.46|71.45
MERaLiON 27.23|62.93|94.81|62.91
Megrez-3B-Omni 28.35|25.71|87.69|46.25
Lyra-Base 72.75|36.28|59.62|57.66
MiniCPM-o 78.02|49.25|97.69|71.69
Baichuan-Omni-1.5 74.51|54.54|97.31|71.14
Qwen2-Audio 49.45|26.33|96.73|55.35
Qwen2.5-Omni-3B 74.73|42.10|98.85|68.81
Qwen2.5-Omni-7B 81.10|52.87|99.42|74.12
Image -> Text | Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Other Best | Qwen2.5-VL-7B | GPT-4o-mini | |--------------------------------|--------------|------------|------------|---------------|-------------| | MMMUval | 59.2 | 53.1 | 53.9 | 58.6 | **60.0** | | MMMU-Prooverall | 36.6 | 29.7 | - | **38.3** | 37.6 | | MathVistatestmini | 67.9 | 59.4 | **71.9** | 68.2 | 52.5 | | MathVisionfull | 25.0 | 20.8 | 23.1 | **25.1** | - | | MMBench-V1.1-ENtest | 81.8 | 77.8 | 80.5 | **82.6** | 76.0 | | MMVetturbo | 66.8 | 62.1 | **67.5** | 67.1 | 66.9 | | MMStar | **64.0** | 55.7 | **64.0** | 63.9 | 54.8 | | MMEsum | 2340 | 2117 | **2372** | 2347 | 2003 | | MuirBench | 59.2 | 48.0 | - | **59.2** | - | | CRPErelation | **76.5** | 73.7 | - | 76.4 | - | | RealWorldQAavg | 70.3 | 62.6 | **71.9** | 68.5 | - | | MME-RealWorlden | **61.6** | 55.6 | - | 57.4 | - | | MM-MT-Bench | 6.0 | 5.0 | - | **6.3** | - | | AI2D | 83.2 | 79.5 | **85.8** | 83.9 | - | | TextVQAval | 84.4 | 79.8 | 83.2 | **84.9** | - | | DocVQAtest | 95.2 | 93.3 | 93.5 | **95.7** | - | | ChartQAtest Avg | 85.3 | 82.8 | 84.9 | **87.3** | - | | OCRBench_V2en | **57.8** | 51.7 | - | 56.3 | - | | Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Qwen2.5-VL-7B | Grounding DINO | Gemini 1.5 Pro | |--------------------------|--------------|---------------|---------------|----------------|----------------| | Refcocoval | 90.5 | 88.7 | 90.0 | **90.6** | 73.2 | | RefcocotextA | **93.5** | 91.8 | 92.5 | 93.2 | 72.9 | | RefcocotextB | 86.6 | 84.0 | 85.4 | **88.2** | 74.6 | | Refcoco+val | 85.4 | 81.1 | 84.2 | **88.2** | 62.5 | | Refcoco+textA | **91.0** | 87.5 | 89.1 | 89.0 | 63.9 | | Refcoco+textB | **79.3** | 73.2 | 76.9 | 75.9 | 65.0 | | Refcocog+val | **87.4** | 85.0 | 87.2 | 86.1 | 75.2 | | Refcocog+test | **87.9** | 85.1 | 87.2 | 87.0 | 76.2 | | ODinW | 42.4 | 39.2 | 37.3 | **55.0** | 36.7 | | PointGrounding | 66.5 | 46.2 | **67.3** | - | - |
Video(without audio) -> Text | Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Other Best | Qwen2.5-VL-7B | GPT-4o-mini | |-----------------------------|--------------|------------|------------|---------------|-------------| | Video-MMEw/o sub | 64.3 | 62.0 | 63.9 | **65.1** | 64.8 | | Video-MMEw sub | **72.4** | 68.6 | 67.9 | 71.6 | - | | MVBench | **70.3** | 68.7 | 67.2 | 69.6 | - | | EgoSchematest | **68.6** | 61.4 | 63.2 | 65.0 | - |
Zero-shot Speech Generation
Datasets Model Performance
Content Consistency
SEED
test-zh | test-en | test-hard
Seed-TTS_ICL 1.11 | 2.24 | 7.58
Seed-TTS_RL 1.00 | 1.94 | 6.42
MaskGCT 2.27 | 2.62 | 10.27
E2_TTS 1.97 | 2.19 | -
F5-TTS 1.56 | 1.83 | 8.67
CosyVoice 2 1.45 | 2.57 | 6.83
CosyVoice 2-S 1.45 | 2.38 | 8.08
Qwen2.5-Omni-3B_ICL 1.95 | 2.87 | 9.92
Qwen2.5-Omni-3B_RL 1.58 | 2.51 | 7.86
Qwen2.5-Omni-7B_ICL 1.70 | 2.72 | 7.97
Qwen2.5-Omni-7B_RL 1.42 | 2.32 | 6.54
Speaker Similarity
SEED
test-zh | test-en | test-hard
Seed-TTS_ICL 0.796 | 0.762 | 0.776
Seed-TTS_RL 0.801 | 0.766 | 0.782
MaskGCT 0.774 | 0.714 | 0.748
E2_TTS 0.730 | 0.710 | -
F5-TTS 0.741 | 0.647 | 0.713
CosyVoice 2 0.748 | 0.652 | 0.724
CosyVoice 2-S 0.753 | 0.654 | 0.732
Qwen2.5-Omni-3B_ICL 0.741 | 0.635 | 0.748
Qwen2.5-Omni-3B_RL 0.744 | 0.635 | 0.746
Qwen2.5-Omni-7B_ICL 0.752 | 0.632 | 0.747
Qwen2.5-Omni-7B_RL 0.754 | 0.641 | 0.752
Text -> Text | Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Qwen2.5-7B | Qwen2.5-3B | Qwen2-7B | Llama3.1-8B | Gemma2-9B | |-----------------------------------|-----------|------------|------------|------------|------------|-------------|-----------| | MMLU-Pro | 47.0 | 40.4 | **56.3** | 43.7 | 44.1 | 48.3 | 52.1 | | MMLU-redux | 71.0 | 60.9 | **75.4** | 64.4 | 67.3 | 67.2 | 72.8 | | LiveBench0831 | 29.6 | 22.3 | **35.9** | 26.8 | 29.2 | 26.7 | 30.6 | | GPQA | 30.8 | 34.3 | **36.4** | 30.3 | 34.3 | 32.8 | 32.8 | | MATH | 71.5 | 63.6 | **75.5** | 65.9 | 52.9 | 51.9 | 44.3 | | GSM8K | 88.7 | 82.6 | **91.6** | 86.7 | 85.7 | 84.5 | 76.7 | | HumanEval | 78.7 | 70.7 | **84.8** | 74.4 | 79.9 | 72.6 | 68.9 | | MBPP | 73.2 | 70.4 | **79.2** | 72.7 | 67.2 | 69.6 | 74.9 | | MultiPL-E | 65.8 | 57.6 | **70.4** | 60.2 | 59.1 | 50.7 | 53.4 | | LiveCodeBench2305-2409 | 24.6 | 16.5 | **28.7** | 19.9 | 23.9 | 8.3 | 18.9 |
## Reference **Original model card**: [Qwen/Qwen2.5-Omni-3B](https://huggingface.co/Qwen/Qwen2.5-Omni-3B)