| 
							 | 
						--- | 
					
					
						
						| 
							 | 
						pipeline_tag: any-to-any | 
					
					
						
						| 
							 | 
						datasets: | 
					
					
						
						| 
							 | 
						- openbmb/RLAIF-V-Dataset | 
					
					
						
						| 
							 | 
						library_name: transformers | 
					
					
						
						| 
							 | 
						language: | 
					
					
						
						| 
							 | 
						- multilingual | 
					
					
						
						| 
							 | 
						tags: | 
					
					
						
						| 
							 | 
						- minicpm-o | 
					
					
						
						| 
							 | 
						- omni | 
					
					
						
						| 
							 | 
						- vision | 
					
					
						
						| 
							 | 
						- ocr | 
					
					
						
						| 
							 | 
						- multi-image | 
					
					
						
						| 
							 | 
						- video | 
					
					
						
						| 
							 | 
						- custom_code | 
					
					
						
						| 
							 | 
						- audio | 
					
					
						
						| 
							 | 
						- speech | 
					
					
						
						| 
							 | 
						- voice cloning | 
					
					
						
						| 
							 | 
						- live Streaming | 
					
					
						
						| 
							 | 
						- realtime speech conversation | 
					
					
						
						| 
							 | 
						- asr | 
					
					
						
						| 
							 | 
						- tts | 
					
					
						
						| 
							 | 
						--- | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						<h1>A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone</h1> | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						[GitHub](https://github.com/OpenBMB/MiniCPM-o) | [Online Demo](https://minicpm-omni-webdemo-us.modelbest.cn)</a> | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						### News | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						* [2025.01.19] ⭐️⭐️⭐️ MiniCPM-o tops GitHub Trending and reaches top-2 on Hugging Face Trending! | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						## MiniCPM-o 2.6 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						**MiniCPM-o 2.6** is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.6, and introduces new features for real-time speech conversation and multimodal live streaming. Notable features of MiniCPM-o 2.6 include: | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						- 🔥 **Leading Visual Capability.** | 
					
					
						
						| 
							 | 
						  MiniCPM-o 2.6 achieves an average score of 70.2 on OpenCompass, a comprehensive evaluation over 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-202405, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding. It also **outperforms GPT-4V and Claude 3.5 Sonnet** in mutli-image and video understanding, and shows promising in-context learning capability. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						- 🎙 **State-of-the-art Speech Capability.** MiniCPM-o 2.6 supports **bilingual real-time speech conversation with configurable voices** in English and Chinese. It **outperforms GPT-4o-realtime on audio understanding tasks** such as ASR and STT translation, and shows **state-of-the-art performance on speech conversation in both semantic and acoustic evaluations in the open-source community**. It also allows for fun features such as emotion/speed/style control, end-to-end voice cloning, role play, etc. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						- 🎬 **Strong Multimodal Live Streaming Capability.** As a new feature, MiniCPM-o 2.6 can **accept continous video and audio streams independent of user queries, and support real-time speech interaction**. It **outperforms GPT-4o-202408 and Claude 3.5 Sonnet and shows state-of-art performance in open-source community on StreamingBench**, a comprehensive benchmark for real-time video understanding, omni-source (video & audio) understanding, and multimodal contextual understanding.										 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						- 💪 **Strong OCR Capability and Others.** | 
					
					
						
						| 
							 | 
						Advancing popular visual capabilites from MiniCPM-V series, MiniCPM-o 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench for models under 25B, surpassing proprietary models such as GPT-4o-202405**. | 
					
					
						
						| 
							 | 
						  Based on the the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, outperforming GPT-4o and Claude 3.5 Sonnet on MMHal-Bench, and supports **multilingual capabilities** on more than 30 languages. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						- 🚀 **Superior Efficiency.** | 
					
					
						
						| 
							 | 
						  In addition to its friendly size, MiniCPM-o 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-o 2.6 can efficiently support **multimodal live streaming** on end-side devices such as iPad. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						-  💫  **Easy Usage.** | 
					
					
						
						| 
							 | 
						MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [LLaMA-Factory](./docs/llamafactory_train.md), (5) quick local WebUI demo setup with [Gradio](#chat-with-our-demo-on-gradio), and (6) online web demo on [server](https://minicpm-omni-webdemo-us.modelbest.cn/). | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						**Model Architecture.** | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						- **End-to-end Omni-modal Architecture.** Different modality encoder/decoders are connected and trained in an **end-to-end** fashion to fully exploit rich multimodal knowledge. | 
					
					
						
						| 
							 | 
						- **Omni-modal Live Streaming Mechanism.** (1) We change the offline modality encoder/decoders into online ones for **streaminig inputs/outputs.** (2) We devise a **time-division multiplexing (TDM) mechanism** for omni-modality streaminig processing in the LLM backbone. It divides parallel omni-modality streams into sequential info within small periodic time slices.  | 
					
					
						
						| 
							 | 
						- **Configurable Speech Modeling Design.** We devise a multimodal system prompt, including traditional text system prompt, and **a new audio system prompt to determine the assistant voice**. This enables flexible voice configurations in inference time, and also facilitates end-to-end voice cloning and description-based voice creation. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						<div align="center"> | 
					
					
						
						| 
							 | 
						<img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/minicpm-o-26-framework-v2.png" , width=80%> | 
					
					
						
						| 
							 | 
						</div> | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						### Evaluation  <!-- omit in toc --> | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						<div align="center"> | 
					
					
						
						| 
							 | 
						    <img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/radar.jpg" width=90% /> | 
					
					
						
						| 
							 | 
						</div> | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						<details> | 
					
					
						
						| 
							 | 
						<summary>Click to view visual understanding results.</summary> | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						**Image Understanding** | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						<div align="center"> | 
					
					
						
						| 
							 | 
						<table style="margin: 0px auto;"> | 
					
					
						
						| 
							 | 
						    <thead> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <th align="left">Model</th> | 
					
					
						
						| 
							 | 
						            <th>Size</th> | 
					
					
						
						| 
							 | 
						            <th>Token Density<sup>+</sup></th> | 
					
					
						
						| 
							 | 
						            <th>OpenCompass</th> | 
					
					
						
						| 
							 | 
						            <th>OCRBench</th> | 
					
					
						
						| 
							 | 
						            <th>MathVista mini</th> | 
					
					
						
						| 
							 | 
						            <th>ChartQA</th> | 
					
					
						
						| 
							 | 
						            <th>MMVet</th> | 
					
					
						
						| 
							 | 
						            <th>MMStar</th> | 
					
					
						
						| 
							 | 
						            <th>MME</th> | 
					
					
						
						| 
							 | 
						            <th>MMB1.1 test</th> | 
					
					
						
						| 
							 | 
						            <th>AI2D</th> | 
					
					
						
						| 
							 | 
						            <th>MMMU val</th> | 
					
					
						
						| 
							 | 
						            <th>HallusionBench</th> | 
					
					
						
						| 
							 | 
						            <th>TextVQA val</th> | 
					
					
						
						| 
							 | 
						            <th>DocVQA test</th> | 
					
					
						
						| 
							 | 
						            <th>MathVerse mini</th> | 
					
					
						
						| 
							 | 
						            <th>MathVision</th> | 
					
					
						
						| 
							 | 
						            <th>MMHal Score</th> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						    </thead> | 
					
					
						
						| 
							 | 
						    <tbody align="center"> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td colspan="19" align="left"><strong>Proprietary</strong></td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">GPT-4o-20240513</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>1088</td> | 
					
					
						
						| 
							 | 
						            <td><u>69.9</u></td> | 
					
					
						
						| 
							 | 
						            <td>736</td> | 
					
					
						
						| 
							 | 
						            <td>61.3</td> | 
					
					
						
						| 
							 | 
						            <td>85.7</td> | 
					
					
						
						| 
							 | 
						            <td><strong>69.1</strong></td> | 
					
					
						
						| 
							 | 
						            <td>63.9</td> | 
					
					
						
						| 
							 | 
						            <td>2328.7</td> | 
					
					
						
						| 
							 | 
						            <td>82.2</td> | 
					
					
						
						| 
							 | 
						            <td>84.6</td> | 
					
					
						
						| 
							 | 
						            <td><strong>69.2</strong></td> | 
					
					
						
						| 
							 | 
						            <td><strong>55.0</strong></td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>92.8</td> | 
					
					
						
						| 
							 | 
						            <td><strong>50.2</strong></td> | 
					
					
						
						| 
							 | 
						            <td><strong>30.4</strong></td> | 
					
					
						
						| 
							 | 
						            <td><u>3.6</u></td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">Claude3.5-Sonnet</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>750</td> | 
					
					
						
						| 
							 | 
						            <td>67.9</td> | 
					
					
						
						| 
							 | 
						            <td>788</td> | 
					
					
						
						| 
							 | 
						            <td>61.6</td> | 
					
					
						
						| 
							 | 
						            <td><strong>90.8</strong></td> | 
					
					
						
						| 
							 | 
						            <td>66.0</td> | 
					
					
						
						| 
							 | 
						            <td>62.2</td> | 
					
					
						
						| 
							 | 
						            <td>1920.0</td> | 
					
					
						
						| 
							 | 
						            <td>78.5</td> | 
					
					
						
						| 
							 | 
						            <td>80.2</td> | 
					
					
						
						| 
							 | 
						            <td><u>65.9</u></td> | 
					
					
						
						| 
							 | 
						            <td>49.9</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td><strong>95.2</strong></td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>3.4</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">Gemini 1.5 Pro</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>64.4</td> | 
					
					
						
						| 
							 | 
						            <td>754</td> | 
					
					
						
						| 
							 | 
						            <td>57.7</td> | 
					
					
						
						| 
							 | 
						            <td>81.3</td> | 
					
					
						
						| 
							 | 
						            <td>64.0</td> | 
					
					
						
						| 
							 | 
						            <td>59.1</td> | 
					
					
						
						| 
							 | 
						            <td>2110.6</td> | 
					
					
						
						| 
							 | 
						            <td>73.9</td> | 
					
					
						
						| 
							 | 
						            <td>79.1</td> | 
					
					
						
						| 
							 | 
						            <td>60.6</td> | 
					
					
						
						| 
							 | 
						            <td>45.6</td> | 
					
					
						
						| 
							 | 
						            <td>73.5</td> | 
					
					
						
						| 
							 | 
						            <td>86.5</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>19.2</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">GPT-4o-mini-20240718</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>1088</td> | 
					
					
						
						| 
							 | 
						            <td>64.1</td> | 
					
					
						
						| 
							 | 
						            <td>785</td> | 
					
					
						
						| 
							 | 
						            <td>52.4</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>66.9</td> | 
					
					
						
						| 
							 | 
						            <td>54.8</td> | 
					
					
						
						| 
							 | 
						            <td>2003.4</td> | 
					
					
						
						| 
							 | 
						            <td>76.0</td> | 
					
					
						
						| 
							 | 
						            <td>77.8</td> | 
					
					
						
						| 
							 | 
						            <td>60.0</td> | 
					
					
						
						| 
							 | 
						            <td>46.1</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>3.3</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td colspan="19" align="left"><strong>Open Source</strong></td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">Cambrian-34B</td> | 
					
					
						
						| 
							 | 
						            <td>34B</td> | 
					
					
						
						| 
							 | 
						            <td><u>1820</u></td> | 
					
					
						
						| 
							 | 
						            <td>58.3</td> | 
					
					
						
						| 
							 | 
						            <td>591</td> | 
					
					
						
						| 
							 | 
						            <td>50.3</td> | 
					
					
						
						| 
							 | 
						            <td>75.6</td> | 
					
					
						
						| 
							 | 
						            <td>53.2</td> | 
					
					
						
						| 
							 | 
						            <td>54.2</td> | 
					
					
						
						| 
							 | 
						            <td>2049.9</td> | 
					
					
						
						| 
							 | 
						            <td>77.8</td> | 
					
					
						
						| 
							 | 
						            <td>79.5</td> | 
					
					
						
						| 
							 | 
						            <td>50.4</td> | 
					
					
						
						| 
							 | 
						            <td>41.6</td> | 
					
					
						
						| 
							 | 
						            <td>76.7</td> | 
					
					
						
						| 
							 | 
						            <td>75.5</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">GLM-4V-9B</td> | 
					
					
						
						| 
							 | 
						            <td>13B</td> | 
					
					
						
						| 
							 | 
						            <td>784</td> | 
					
					
						
						| 
							 | 
						            <td>59.1</td> | 
					
					
						
						| 
							 | 
						            <td>776</td> | 
					
					
						
						| 
							 | 
						            <td>51.1</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>58.0</td> | 
					
					
						
						| 
							 | 
						            <td>54.8</td> | 
					
					
						
						| 
							 | 
						            <td>2018.8</td> | 
					
					
						
						| 
							 | 
						            <td>67.9</td> | 
					
					
						
						| 
							 | 
						            <td>71.2</td> | 
					
					
						
						| 
							 | 
						            <td>46.9</td> | 
					
					
						
						| 
							 | 
						            <td>45.0</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">Pixtral-12B</td> | 
					
					
						
						| 
							 | 
						            <td>12B</td> | 
					
					
						
						| 
							 | 
						            <td>256</td> | 
					
					
						
						| 
							 | 
						            <td>61.0</td> | 
					
					
						
						| 
							 | 
						            <td>685</td> | 
					
					
						
						| 
							 | 
						            <td>56.9</td> | 
					
					
						
						| 
							 | 
						            <td>81.8</td> | 
					
					
						
						| 
							 | 
						            <td>58.5</td> | 
					
					
						
						| 
							 | 
						            <td>54.5</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>72.7</td> | 
					
					
						
						| 
							 | 
						            <td>79.0</td> | 
					
					
						
						| 
							 | 
						            <td>51.1</td> | 
					
					
						
						| 
							 | 
						            <td>47.0</td> | 
					
					
						
						| 
							 | 
						            <td>75.7</td> | 
					
					
						
						| 
							 | 
						            <td>90.7</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">DeepSeek-VL2-27B (4B)</td> | 
					
					
						
						| 
							 | 
						            <td>27B</td> | 
					
					
						
						| 
							 | 
						            <td>672</td> | 
					
					
						
						| 
							 | 
						            <td>66.4</td> | 
					
					
						
						| 
							 | 
						            <td>809</td> | 
					
					
						
						| 
							 | 
						            <td>63.9</td> | 
					
					
						
						| 
							 | 
						            <td>86.0</td> | 
					
					
						
						| 
							 | 
						            <td>60.0</td> | 
					
					
						
						| 
							 | 
						            <td>61.9</td> | 
					
					
						
						| 
							 | 
						            <td>2253.0</td> | 
					
					
						
						| 
							 | 
						            <td>81.2</td> | 
					
					
						
						| 
							 | 
						            <td>83.8</td> | 
					
					
						
						| 
							 | 
						            <td>54.0</td> | 
					
					
						
						| 
							 | 
						            <td>45.3</td> | 
					
					
						
						| 
							 | 
						            <td><u>84.2</u></td> | 
					
					
						
						| 
							 | 
						            <td>93.3</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>3.0</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">Qwen2-VL-7B</td> | 
					
					
						
						| 
							 | 
						            <td>8B</td> | 
					
					
						
						| 
							 | 
						            <td>784</td> | 
					
					
						
						| 
							 | 
						            <td>67.1</td> | 
					
					
						
						| 
							 | 
						            <td><u>866</u></td> | 
					
					
						
						| 
							 | 
						            <td>58.2</td> | 
					
					
						
						| 
							 | 
						            <td>83.0</td> | 
					
					
						
						| 
							 | 
						            <td>62.0</td> | 
					
					
						
						| 
							 | 
						            <td>60.7</td> | 
					
					
						
						| 
							 | 
						            <td>2326.0</td> | 
					
					
						
						| 
							 | 
						            <td>81.8</td> | 
					
					
						
						| 
							 | 
						            <td>83.0</td> | 
					
					
						
						| 
							 | 
						            <td>54.1</td> | 
					
					
						
						| 
							 | 
						            <td>50.6</td> | 
					
					
						
						| 
							 | 
						            <td><strong>84.3</strong></td> | 
					
					
						
						| 
							 | 
						            <td><u>94.5</u></td> | 
					
					
						
						| 
							 | 
						            <td>31.9</td> | 
					
					
						
						| 
							 | 
						            <td>16.3</td> | 
					
					
						
						| 
							 | 
						            <td>3.2</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">LLaVA-OneVision-72B</td> | 
					
					
						
						| 
							 | 
						            <td>72B</td> | 
					
					
						
						| 
							 | 
						            <td>182</td> | 
					
					
						
						| 
							 | 
						            <td>68.1</td> | 
					
					
						
						| 
							 | 
						            <td>741</td> | 
					
					
						
						| 
							 | 
						            <td>67.5</td> | 
					
					
						
						| 
							 | 
						            <td>83.7</td> | 
					
					
						
						| 
							 | 
						            <td>60.6</td> | 
					
					
						
						| 
							 | 
						            <td><strong>65.8</strong></td> | 
					
					
						
						| 
							 | 
						            <td>2261.0</td> | 
					
					
						
						| 
							 | 
						            <td><strong>85.0</strong></td> | 
					
					
						
						| 
							 | 
						            <td><u>85.6</u></td> | 
					
					
						
						| 
							 | 
						            <td>56.8</td> | 
					
					
						
						| 
							 | 
						            <td>49.0</td> | 
					
					
						
						| 
							 | 
						            <td>80.5</td> | 
					
					
						
						| 
							 | 
						            <td>91.3</td> | 
					
					
						
						| 
							 | 
						            <td>39.1</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>3.5</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">InternVL2.5-8B</td> | 
					
					
						
						| 
							 | 
						            <td>8B</td> | 
					
					
						
						| 
							 | 
						            <td>706</td> | 
					
					
						
						| 
							 | 
						            <td>68.3</td> | 
					
					
						
						| 
							 | 
						            <td>822</td> | 
					
					
						
						| 
							 | 
						            <td><u>64.4</u></td> | 
					
					
						
						| 
							 | 
						            <td>84.8</td> | 
					
					
						
						| 
							 | 
						            <td>62.8</td> | 
					
					
						
						| 
							 | 
						            <td>62.8</td> | 
					
					
						
						| 
							 | 
						            <td>2344.0</td> | 
					
					
						
						| 
							 | 
						            <td><u>83.6</u></td> | 
					
					
						
						| 
							 | 
						            <td>84.5</td> | 
					
					
						
						| 
							 | 
						            <td>56.0</td> | 
					
					
						
						| 
							 | 
						            <td>50.1</td> | 
					
					
						
						| 
							 | 
						            <td>79.1</td> | 
					
					
						
						| 
							 | 
						            <td>93.0</td> | 
					
					
						
						| 
							 | 
						            <td>39.5</td> | 
					
					
						
						| 
							 | 
						            <td>19.7</td> | 
					
					
						
						| 
							 | 
						            <td>3.4</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td> | 
					
					
						
						| 
							 | 
						            <td>8B</td> | 
					
					
						
						| 
							 | 
						            <td><strong>2822</strong></td> | 
					
					
						
						| 
							 | 
						            <td>65.2</td> | 
					
					
						
						| 
							 | 
						            <td>852*</td> | 
					
					
						
						| 
							 | 
						            <td>60.6</td> | 
					
					
						
						| 
							 | 
						            <td>79.4</td> | 
					
					
						
						| 
							 | 
						            <td>60.0</td> | 
					
					
						
						| 
							 | 
						            <td>57.5</td> | 
					
					
						
						| 
							 | 
						            <td><u>2348.4*</u></td> | 
					
					
						
						| 
							 | 
						            <td>78.0</td> | 
					
					
						
						| 
							 | 
						            <td>82.1</td> | 
					
					
						
						| 
							 | 
						            <td>49.8*</td> | 
					
					
						
						| 
							 | 
						            <td>48.1*</td> | 
					
					
						
						| 
							 | 
						            <td>80.1</td> | 
					
					
						
						| 
							 | 
						            <td>90.8</td> | 
					
					
						
						| 
							 | 
						            <td>25.7</td> | 
					
					
						
						| 
							 | 
						            <td>18.3</td> | 
					
					
						
						| 
							 | 
						            <td>3.6</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td> | 
					
					
						
						| 
							 | 
						            <td>8B</td> | 
					
					
						
						| 
							 | 
						            <td><strong>2822</strong></td> | 
					
					
						
						| 
							 | 
						            <td><strong>70.2</strong></td> | 
					
					
						
						| 
							 | 
						            <td><strong>897*</strong></td> | 
					
					
						
						| 
							 | 
						            <td><strong>71.9*</strong></td> | 
					
					
						
						| 
							 | 
						            <td><u>86.9*</u></td> | 
					
					
						
						| 
							 | 
						            <td><u>67.5</u></td> | 
					
					
						
						| 
							 | 
						            <td><u>64.0</u></td> | 
					
					
						
						| 
							 | 
						            <td><strong>2372.0*</strong></td> | 
					
					
						
						| 
							 | 
						            <td>80.5</td> | 
					
					
						
						| 
							 | 
						            <td><strong>85.8</strong></td> | 
					
					
						
						| 
							 | 
						            <td>50.4*</td> | 
					
					
						
						| 
							 | 
						            <td><u>51.9</u></td> | 
					
					
						
						| 
							 | 
						            <td>82.0</td> | 
					
					
						
						| 
							 | 
						            <td>93.5</td> | 
					
					
						
						| 
							 | 
						            <td><u>41.4*</u></td> | 
					
					
						
						| 
							 | 
						            <td><u>23.1*</u></td> | 
					
					
						
						| 
							 | 
						            <td><strong>3.8</strong></td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						    </tbody> | 
					
					
						
						| 
							 | 
						</table> | 
					
					
						
						| 
							 | 
						</div> | 
					
					
						
						| 
							 | 
						* We evaluate this benchmark using chain-of-thought prompting. Specifically, for MME, we used this technique only for the Cognition set. | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						<sup>+</sup> Token Density: number of pixels encoded into each visual token at maximum resolution, i.e., # pixels at maximum resolution / # visual tokens. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Note: For proprietary models, we calculate token density based on the image encoding charging strategy defined in the official API documentation, which provides an upper-bound estimation. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						**Multi-image and Video Understanding** | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						<div align="center"> | 
					
					
						
						| 
							 | 
						  | 
					
					
						
						| 
							 | 
						<table style="margin: 0px auto;"> | 
					
					
						
						| 
							 | 
						    <thead> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <th align="left">Model</th> | 
					
					
						
						| 
							 | 
						            <th>Size</th> | 
					
					
						
						| 
							 | 
						            <th>BLINK val</th> | 
					
					
						
						| 
							 | 
						            <th>Mantis Eval</th> | 
					
					
						
						| 
							 | 
						            <th>MIRB</th> | 
					
					
						
						| 
							 | 
						            <th>Video-MME (wo / w subs)</th> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						    </thead> | 
					
					
						
						| 
							 | 
						    <tbody align="center"> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td colspan="6" align="left"><strong>Proprietary</strong></td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">GPT-4o-20240513</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td><strong>68.0</strong></td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td><strong>71.9/77.2<strong></td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">GPT4V</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>54.6</td> | 
					
					
						
						| 
							 | 
						            <td>62.7</td> | 
					
					
						
						| 
							 | 
						            <td>53.1</td> | 
					
					
						
						| 
							 | 
						            <td>59.9/63.3</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td colspan="6" align="left"><strong>Open-source</strong></td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">LLaVA-NeXT-Interleave 14B</td> | 
					
					
						
						| 
							 | 
						            <td>14B</td> | 
					
					
						
						| 
							 | 
						            <td>52.6</td> | 
					
					
						
						| 
							 | 
						            <td>66.4</td> | 
					
					
						
						| 
							 | 
						            <td>30.2</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">LLaVA-OneVision-72B</td> | 
					
					
						
						| 
							 | 
						            <td>72B</td> | 
					
					
						
						| 
							 | 
						            <td>55.4</td> | 
					
					
						
						| 
							 | 
						            <td><strong>77.6</strong></td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td><u>66.2/69.5</u></td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">MANTIS 8B</td> | 
					
					
						
						| 
							 | 
						            <td>8B</td> | 
					
					
						
						| 
							 | 
						            <td>49.1</td> | 
					
					
						
						| 
							 | 
						            <td>59.5</td> | 
					
					
						
						| 
							 | 
						            <td>34.8</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">Qwen2-VL-7B</td> | 
					
					
						
						| 
							 | 
						            <td>8B</td> | 
					
					
						
						| 
							 | 
						            <td>53.2</td> | 
					
					
						
						| 
							 | 
						            <td>69.6*</td> | 
					
					
						
						| 
							 | 
						            <td><strong>67.6*</strong></td> | 
					
					
						
						| 
							 | 
						            <td>63.3/69.0</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">InternVL2.5-8B</td> | 
					
					
						
						| 
							 | 
						            <td>8B</td> | 
					
					
						
						| 
							 | 
						            <td>54.8</td> | 
					
					
						
						| 
							 | 
						            <td>67.7</td> | 
					
					
						
						| 
							 | 
						            <td>52.5</td> | 
					
					
						
						| 
							 | 
						            <td>64.2/66.9</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td> | 
					
					
						
						| 
							 | 
						            <td>8B</td> | 
					
					
						
						| 
							 | 
						            <td>53.0</td> | 
					
					
						
						| 
							 | 
						            <td>69.1</td> | 
					
					
						
						| 
							 | 
						            <td>53.8</td> | 
					
					
						
						| 
							 | 
						            <td>60.9/63.6</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td> | 
					
					
						
						| 
							 | 
						            <td>8B</td> | 
					
					
						
						| 
							 | 
						            <td><u>56.7</u></td> | 
					
					
						
						| 
							 | 
						            <td><u>71.9</u></td> | 
					
					
						
						| 
							 | 
						            <td><u>58.6</u></td> | 
					
					
						
						| 
							 | 
						            <td>63.9/67.9</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						    </tbody> | 
					
					
						
						| 
							 | 
						</table> | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						</div> | 
					
					
						
						| 
							 | 
						* We evaluate officially released checkpoints by ourselves. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						</details> | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						<details> | 
					
					
						
						| 
							 | 
						<summary>Click to view audio understanding and speech conversation results.</summary> | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						**Audio Understanding** | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						<div align="center"> | 
					
					
						
						| 
							 | 
						<table style="margin: 0px auto;"> | 
					
					
						
						| 
							 | 
						    <thead> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <th align="left">Task</th> | 
					
					
						
						| 
							 | 
						            <th>Size</th> | 
					
					
						
						| 
							 | 
						            <th colspan="3">ASR (zh)</th> | 
					
					
						
						| 
							 | 
						            <th colspan="3">ASR (en)</th> | 
					
					
						
						| 
							 | 
						            <th colspan="2">AST</th> | 
					
					
						
						| 
							 | 
						            <th>Emotion</th> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <th align="left">Metric</th> | 
					
					
						
						| 
							 | 
						            <td></td> | 
					
					
						
						| 
							 | 
						            <th colspan="3">CER↓</th> | 
					
					
						
						| 
							 | 
						            <th colspan="3">WER↓</th> | 
					
					
						
						| 
							 | 
						            <th colspan="2">BLEU↑</th> | 
					
					
						
						| 
							 | 
						            <th>ACC↑</th> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <th align="left">Dataset</th> | 
					
					
						
						| 
							 | 
						            <td></td> | 
					
					
						
						| 
							 | 
						            <th>AISHELL-1</th> | 
					
					
						
						| 
							 | 
						            <th>Fleurs zh</th> | 
					
					
						
						| 
							 | 
						            <th>WenetSpeech test-net</th> | 
					
					
						
						| 
							 | 
						            <th>LibriSpeech test-clean</th> | 
					
					
						
						| 
							 | 
						            <th>GigaSpeech</th> | 
					
					
						
						| 
							 | 
						            <th>TED-LIUM</th> | 
					
					
						
						| 
							 | 
						            <th>CoVoST en2zh</th> | 
					
					
						
						| 
							 | 
						            <th>CoVoST zh2en</th> | 
					
					
						
						| 
							 | 
						            <th>MELD emotion</th> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						    </thead> | 
					
					
						
						| 
							 | 
						    <tbody align="center"> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td colspan="11" align="left"><strong>Proprietary</strong></td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">GPT-4o-Realtime</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>7.3*</td> | 
					
					
						
						| 
							 | 
						            <td><u>5.4*</u></td> | 
					
					
						
						| 
							 | 
						            <td>28.9*</td> | 
					
					
						
						| 
							 | 
						            <td>2.6*</td> | 
					
					
						
						| 
							 | 
						            <td>12.9*</td> | 
					
					
						
						| 
							 | 
						            <td>4.8*</td> | 
					
					
						
						| 
							 | 
						            <td>37.1*</td> | 
					
					
						
						| 
							 | 
						            <td>15.7*</td> | 
					
					
						
						| 
							 | 
						            <td>33.2*</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">Gemini 1.5 Pro</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>4.5*</td> | 
					
					
						
						| 
							 | 
						            <td>5.9*</td> | 
					
					
						
						| 
							 | 
						            <td>14.3*</td> | 
					
					
						
						| 
							 | 
						            <td>2.9*</td> | 
					
					
						
						| 
							 | 
						            <td>10.6*</td> | 
					
					
						
						| 
							 | 
						            <td><strong>3.0*</strong></td> | 
					
					
						
						| 
							 | 
						            <td><u>47.3*</u></td> | 
					
					
						
						| 
							 | 
						            <td>22.6*</td> | 
					
					
						
						| 
							 | 
						            <td>48.4*</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td colspan="11" align="left"><strong>Open-Source</strong></td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">Qwen2-Audio-7B</td> | 
					
					
						
						| 
							 | 
						            <td>8B</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>7.5</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td><strong>1.6</strong></td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>45.2</td> | 
					
					
						
						| 
							 | 
						            <td><u>24.4</u></td> | 
					
					
						
						| 
							 | 
						            <td><strong>55.3</strong></td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">Qwen2-Audio-7B-Instruct</td> | 
					
					
						
						| 
							 | 
						            <td>8B</td> | 
					
					
						
						| 
							 | 
						            <td>2.6*</td> | 
					
					
						
						| 
							 | 
						            <td>6.9*</td> | 
					
					
						
						| 
							 | 
						            <td><u>10.3*</u></td> | 
					
					
						
						| 
							 | 
						            <td>3.1*</td> | 
					
					
						
						| 
							 | 
						            <td><u>9.7</u>*</td> | 
					
					
						
						| 
							 | 
						            <td>5.9*</td> | 
					
					
						
						| 
							 | 
						            <td>39.5*</td> | 
					
					
						
						| 
							 | 
						            <td>22.9*</td> | 
					
					
						
						| 
							 | 
						            <td>17.4*</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">GLM-4-Voice-Base</td> | 
					
					
						
						| 
							 | 
						            <td>9B</td> | 
					
					
						
						| 
							 | 
						            <td><u>2.5</u></td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>2.8</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td> | 
					
					
						
						| 
							 | 
						            <td>8B</td> | 
					
					
						
						| 
							 | 
						            <td><strong>1.6</strong></td> | 
					
					
						
						| 
							 | 
						            <td><strong>4.4</strong></td> | 
					
					
						
						| 
							 | 
						            <td><strong>6.9</strong></td> | 
					
					
						
						| 
							 | 
						            <td><u>1.7</u></td> | 
					
					
						
						| 
							 | 
						            <td><strong>8.7</strong></td> | 
					
					
						
						| 
							 | 
						            <td><strong>3.0</strong></td> | 
					
					
						
						| 
							 | 
						            <td><strong>48.2</strong></td> | 
					
					
						
						| 
							 | 
						            <td><strong>27.2</strong></td> | 
					
					
						
						| 
							 | 
						            <td><u>52.4</u></td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						    </tbody> | 
					
					
						
						| 
							 | 
						</table> | 
					
					
						
						| 
							 | 
						</div> | 
					
					
						
						| 
							 | 
						* We evaluate officially released checkpoints by ourselves.<br><br> | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						**Speech Generation** | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						<div align="center"> | 
					
					
						
						| 
							 | 
						<table style="margin: 0px auto;"> | 
					
					
						
						| 
							 | 
						    <thead> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <th align="left">Task</th> | 
					
					
						
						| 
							 | 
						            <th>Size</th> | 
					
					
						
						| 
							 | 
						            <th colspan="9">SpeechQA</th> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <th align="left">Metric</th> | 
					
					
						
						| 
							 | 
						            <th></th> | 
					
					
						
						| 
							 | 
						            <th colspan="3">ACC↑</th> | 
					
					
						
						| 
							 | 
						            <th>G-Eval (10 point)↑</th> | 
					
					
						
						| 
							 | 
						            <th>Semantic ELO score↑</th> | 
					
					
						
						| 
							 | 
						            <th>Acoustic ELO score↑</th> | 
					
					
						
						| 
							 | 
						            <th>Overall ELO score↑</th> | 
					
					
						
						| 
							 | 
						            <th>UTMOS↑</th> | 
					
					
						
						| 
							 | 
						            <th>ASR-WER↓</th> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <th align="left">Dataset</th> | 
					
					
						
						| 
							 | 
						            <th></th> | 
					
					
						
						| 
							 | 
						            <th>Speech Llama Q.</th> | 
					
					
						
						| 
							 | 
						            <th>Speech Web Q.</th> | 
					
					
						
						| 
							 | 
						            <th>Speech Trivia QA</th> | 
					
					
						
						| 
							 | 
						            <th>Speech AlpacaEval</th> | 
					
					
						
						| 
							 | 
						            <th colspan="5">AudioArena</th> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						    </thead> | 
					
					
						
						| 
							 | 
						    <tbody align="center"> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td colspan="11" align="left"><strong>Proprietary</strong></td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">GPT-4o-Realtime</td> | 
					
					
						
						| 
							 | 
						            <td></td> | 
					
					
						
						| 
							 | 
						            <td><strong>71.7</strong></td> | 
					
					
						
						| 
							 | 
						            <td><strong>51.6</strong></td> | 
					
					
						
						| 
							 | 
						            <td><strong>69.7</strong></td> | 
					
					
						
						| 
							 | 
						            <td><strong>7.4</strong></td> | 
					
					
						
						| 
							 | 
						            <td><strong>1157</strong></td> | 
					
					
						
						| 
							 | 
						            <td><strong>1203</strong></td> | 
					
					
						
						| 
							 | 
						            <td><strong>1200</strong></td> | 
					
					
						
						| 
							 | 
						            <td><strong>4.2</strong></td> | 
					
					
						
						| 
							 | 
						            <td><strong>2.3</strong></td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td colspan="11" align="left"><strong>Open-Source</strong></td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">GLM-4-Voice</td> | 
					
					
						
						| 
							 | 
						            <td>9B</td> | 
					
					
						
						| 
							 | 
						            <td>50.0</td> | 
					
					
						
						| 
							 | 
						            <td>32.0</td> | 
					
					
						
						| 
							 | 
						            <td>36.4</td> | 
					
					
						
						| 
							 | 
						            <td><u>5.1</u></td> | 
					
					
						
						| 
							 | 
						            <td>999</td> | 
					
					
						
						| 
							 | 
						            <td>1147</td> | 
					
					
						
						| 
							 | 
						            <td>1035</td> | 
					
					
						
						| 
							 | 
						            <td><u>4.1</u></td> | 
					
					
						
						| 
							 | 
						            <td><u>11.7</u></td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">Llama-Omni</td> | 
					
					
						
						| 
							 | 
						            <td>8B</td> | 
					
					
						
						| 
							 | 
						            <td>45.3</td> | 
					
					
						
						| 
							 | 
						            <td>22.9</td> | 
					
					
						
						| 
							 | 
						            <td>10.7</td> | 
					
					
						
						| 
							 | 
						            <td>3.9</td> | 
					
					
						
						| 
							 | 
						            <td>960</td> | 
					
					
						
						| 
							 | 
						            <td>878</td> | 
					
					
						
						| 
							 | 
						            <td>897</td> | 
					
					
						
						| 
							 | 
						            <td>3.2</td> | 
					
					
						
						| 
							 | 
						            <td>24.3</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">Moshi</td> | 
					
					
						
						| 
							 | 
						            <td>7B</td> | 
					
					
						
						| 
							 | 
						            <td>43.7</td> | 
					
					
						
						| 
							 | 
						            <td>23.8</td> | 
					
					
						
						| 
							 | 
						            <td>16.7</td> | 
					
					
						
						| 
							 | 
						            <td>2.4</td> | 
					
					
						
						| 
							 | 
						            <td>871</td> | 
					
					
						
						| 
							 | 
						            <td>808</td> | 
					
					
						
						| 
							 | 
						            <td>875</td> | 
					
					
						
						| 
							 | 
						            <td>2.8</td> | 
					
					
						
						| 
							 | 
						            <td>8.2</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">Mini-Omni</td> | 
					
					
						
						| 
							 | 
						            <td>1B</td> | 
					
					
						
						| 
							 | 
						            <td>22.0</td> | 
					
					
						
						| 
							 | 
						            <td>12.8</td> | 
					
					
						
						| 
							 | 
						            <td>6.9</td> | 
					
					
						
						| 
							 | 
						            <td>2.5</td> | 
					
					
						
						| 
							 | 
						            <td>926</td> | 
					
					
						
						| 
							 | 
						            <td>803</td> | 
					
					
						
						| 
							 | 
						            <td>865</td> | 
					
					
						
						| 
							 | 
						            <td>3.4</td> | 
					
					
						
						| 
							 | 
						            <td>10.0</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td> | 
					
					
						
						| 
							 | 
						            <td>8B</td> | 
					
					
						
						| 
							 | 
						            <td><u>61.0</u></td> | 
					
					
						
						| 
							 | 
						            <td><u>40.0</u></td> | 
					
					
						
						| 
							 | 
						            <td><u>40.2</u></td> | 
					
					
						
						| 
							 | 
						            <td><u>5.1</u></td> | 
					
					
						
						| 
							 | 
						            <td><u>1088</u></td> | 
					
					
						
						| 
							 | 
						            <td><u>1163</u></td> | 
					
					
						
						| 
							 | 
						            <td><u>1131</u></td> | 
					
					
						
						| 
							 | 
						            <td><strong>4.2</strong></td> | 
					
					
						
						| 
							 | 
						            <td>9.8</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						    </tbody> | 
					
					
						
						| 
							 | 
						</table> | 
					
					
						
						| 
							 | 
						</div> | 
					
					
						
						| 
							 | 
						All results are from AudioEvals, and the evaluation methods along with further details can be found in <a href="https://github.com/OpenBMB/UltraEval-Audio" target="_blank">UltraEval-Audio</a>.<br><br> | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						**End-to-end Voice Cloning** | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						<div align="center"> | 
					
					
						
						| 
							 | 
						<table style="margin: 0px auto;"> | 
					
					
						
						| 
							 | 
						    <thead> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <th align="left">Task</th> | 
					
					
						
						| 
							 | 
						            <th colspan="2">Voice cloning</th> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <th align="left">Metric</th> | 
					
					
						
						| 
							 | 
						            <th>SIMO↑</th> | 
					
					
						
						| 
							 | 
						            <th>SIMO↑</th> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <th align="left">Dataset</th> | 
					
					
						
						| 
							 | 
						            <th>Seed-TTS test-zh</th> | 
					
					
						
						| 
							 | 
						            <th>Seed-TTS test-en</th> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						    </thead> | 
					
					
						
						| 
							 | 
						    <tbody align="center"> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">F5-TTS</td> | 
					
					
						
						| 
							 | 
						            <td><strong>76</strong></td> | 
					
					
						
						| 
							 | 
						            <td><strong>67</strong></td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">CosyVoice</td> | 
					
					
						
						| 
							 | 
						            <td><u>75</u></td> | 
					
					
						
						| 
							 | 
						            <td><u>64</u></td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">FireRedTTS</td> | 
					
					
						
						| 
							 | 
						            <td>63</td> | 
					
					
						
						| 
							 | 
						            <td>46</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td> | 
					
					
						
						| 
							 | 
						            <td>57</td> | 
					
					
						
						| 
							 | 
						            <td>47</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						    </tbody> | 
					
					
						
						| 
							 | 
						</table> | 
					
					
						
						| 
							 | 
						</div> | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						</details> | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						<details> | 
					
					
						
						| 
							 | 
						<summary>Click to view multimodal live streaming results.</summary> | 
					
					
						
						| 
							 | 
						   | 
					
					
						
						| 
							 | 
						**Multimodal Live Streaming**: results on StreamingBench | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						<table style="margin: 0px auto;"> | 
					
					
						
						| 
							 | 
						    <thead> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <th align="left">Model</th> | 
					
					
						
						| 
							 | 
						            <th>Size</th> | 
					
					
						
						| 
							 | 
						            <th>Real-Time Video Understanding</th> | 
					
					
						
						| 
							 | 
						            <th>Omni-Source Understanding</th> | 
					
					
						
						| 
							 | 
						            <th>Contextual Understanding</th> | 
					
					
						
						| 
							 | 
						            <th>Overall</th> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						    </thead> | 
					
					
						
						| 
							 | 
						    <tbody align="center"> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td colspan="7" align="left"><strong>Proprietary</strong></td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">Gemini 1.5 Pro</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td><u>77.4</u></td> | 
					
					
						
						| 
							 | 
						            <td><strong>67.8</strong></td> | 
					
					
						
						| 
							 | 
						            <td><strong>51.1</strong></td> | 
					
					
						
						| 
							 | 
						            <td><strong>70.3</strong></td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">GPT-4o-202408</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>74.5</td> | 
					
					
						
						| 
							 | 
						            <td>51.0</td> | 
					
					
						
						| 
							 | 
						            <td><u>48.0</u></td> | 
					
					
						
						| 
							 | 
						            <td>64.1</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">Claude-3.5-Sonnet</td> | 
					
					
						
						| 
							 | 
						            <td>-</td> | 
					
					
						
						| 
							 | 
						            <td>74.0</td> | 
					
					
						
						| 
							 | 
						            <td>41.4</td> | 
					
					
						
						| 
							 | 
						            <td>37.8</td> | 
					
					
						
						| 
							 | 
						            <td>59.7</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td colspan="9" align="left"><strong>Open-source</strong></td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">VILA-1.5</td> | 
					
					
						
						| 
							 | 
						            <td>8B</td> | 
					
					
						
						| 
							 | 
						            <td>61.5</td> | 
					
					
						
						| 
							 | 
						            <td>37.5</td> | 
					
					
						
						| 
							 | 
						            <td>26.7</td> | 
					
					
						
						| 
							 | 
						            <td>49.5</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">LongVA</td> | 
					
					
						
						| 
							 | 
						            <td>7B</td> | 
					
					
						
						| 
							 | 
						            <td>63.1</td> | 
					
					
						
						| 
							 | 
						            <td>35.9</td> | 
					
					
						
						| 
							 | 
						            <td>30.2</td> | 
					
					
						
						| 
							 | 
						            <td>50.7</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">LLaVA-Next-Video-34B</td> | 
					
					
						
						| 
							 | 
						            <td>34B</td> | 
					
					
						
						| 
							 | 
						            <td>69.8</td> | 
					
					
						
						| 
							 | 
						            <td>41.7</td> | 
					
					
						
						| 
							 | 
						            <td>34.3</td> | 
					
					
						
						| 
							 | 
						            <td>56.7</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">Qwen2-VL-7B</td> | 
					
					
						
						| 
							 | 
						            <td>8B</td> | 
					
					
						
						| 
							 | 
						            <td>71.2</td> | 
					
					
						
						| 
							 | 
						            <td>40.7</td> | 
					
					
						
						| 
							 | 
						            <td>33.1</td> | 
					
					
						
						| 
							 | 
						            <td>57.0</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">InternVL2-8B</td> | 
					
					
						
						| 
							 | 
						            <td>8B</td> | 
					
					
						
						| 
							 | 
						            <td>70.1</td> | 
					
					
						
						| 
							 | 
						            <td>42.7</td> | 
					
					
						
						| 
							 | 
						            <td>34.1</td> | 
					
					
						
						| 
							 | 
						            <td>57.0</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">VITA-1.5</td> | 
					
					
						
						| 
							 | 
						            <td>8B</td> | 
					
					
						
						| 
							 | 
						            <td>70.9</td> | 
					
					
						
						| 
							 | 
						            <td>40.8</td> | 
					
					
						
						| 
							 | 
						            <td>35.8</td> | 
					
					
						
						| 
							 | 
						            <td>57.4</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">LLaVA-OneVision-7B</td> | 
					
					
						
						| 
							 | 
						            <td>8B</td> | 
					
					
						
						| 
							 | 
						            <td>74.3</td> | 
					
					
						
						| 
							 | 
						            <td>40.8</td> | 
					
					
						
						| 
							 | 
						            <td>31.0</td> | 
					
					
						
						| 
							 | 
						            <td>58.4</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">InternLM-XC2.5-OL-7B</td> | 
					
					
						
						| 
							 | 
						            <td>8B</td> | 
					
					
						
						| 
							 | 
						            <td>75.4</td> | 
					
					
						
						| 
							 | 
						            <td>46.2</td> | 
					
					
						
						| 
							 | 
						            <td>33.6</td> | 
					
					
						
						| 
							 | 
						            <td>60.8</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td> | 
					
					
						
						| 
							 | 
						            <td>8B</td> | 
					
					
						
						| 
							 | 
						            <td>72.4</td> | 
					
					
						
						| 
							 | 
						            <td>40.2</td> | 
					
					
						
						| 
							 | 
						            <td>33.4</td> | 
					
					
						
						| 
							 | 
						            <td>57.7</td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						        <tr> | 
					
					
						
						| 
							 | 
						            <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td> | 
					
					
						
						| 
							 | 
						            <td>8B</td> | 
					
					
						
						| 
							 | 
						            <td><strong>79.9</strong></td> | 
					
					
						
						| 
							 | 
						            <td><u>53.4</u></td> | 
					
					
						
						| 
							 | 
						            <td>38.5</td> | 
					
					
						
						| 
							 | 
						            <td><u>66.0</u></td> | 
					
					
						
						| 
							 | 
						        </tr> | 
					
					
						
						| 
							 | 
						    </tbody> | 
					
					
						
						| 
							 | 
						</table> | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						</details> | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						### Examples <!-- omit in toc --> | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						We deploy MiniCPM-o 2.6 on end devices. The demo video is the raw-speed recording on an iPad Pro and a Web demo. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						<div align="center"> | 
					
					
						
						| 
							 | 
						  <a href="https://youtu.be/JFJg9KZ_iZk"><img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/o-2dot6-demo-video-preview.png", width=70%></a> | 
					
					
						
						| 
							 | 
						</div> | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						<br> | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						<div style="display: flex; flex-direction: column; align-items: center;"> | 
					
					
						
						| 
							 | 
						  <img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/minicpmo2_6/minicpmo2_6_math_intersect.png" alt="math" style="margin-bottom: 5px;"> | 
					
					
						
						| 
							 | 
						  <img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/minicpmo2_6/minicpmo2_6_diagram_train_NN.png" alt="diagram" style="margin-bottom: 5px;"> | 
					
					
						
						| 
							 | 
						  <img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/minicpmo2_6/minicpmo2_6_multi-image_bike.png" alt="bike" style="margin-bottom: 5px;"> | 
					
					
						
						| 
							 | 
						</div> | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						## Online Demo | 
					
					
						
						| 
							 | 
						Click here to try the online demo of [MiniCPM-o 2.6](https://minicpm-omni-webdemo-us.modelbest.cn). | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						## Usage | 
					
					
						
						| 
							 | 
						Inference using Huggingface transformers on NVIDIA GPUs. Please ensure that `transformers==4.44.2` is installed, as other versions may have compatibility issues. We are investigating this issue. Requirements tested on python 3.10: | 
					
					
						
						| 
							 | 
						``` | 
					
					
						
						| 
							 | 
						Pillow==10.1.0 | 
					
					
						
						| 
							 | 
						torch==2.3.1 | 
					
					
						
						| 
							 | 
						torchaudio==2.3.1 | 
					
					
						
						| 
							 | 
						torchvision==0.18.1 | 
					
					
						
						| 
							 | 
						transformers==4.44.2 | 
					
					
						
						| 
							 | 
						librosa==0.9.0 | 
					
					
						
						| 
							 | 
						soundfile==0.12.1 | 
					
					
						
						| 
							 | 
						vector-quantize-pytorch==1.18.5 | 
					
					
						
						| 
							 | 
						vocos==0.1.0 | 
					
					
						
						| 
							 | 
						decord | 
					
					
						
						| 
							 | 
						moviepy | 
					
					
						
						| 
							 | 
						``` | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						### Model initialization | 
					
					
						
						| 
							 | 
						```python | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						import torch | 
					
					
						
						| 
							 | 
						from PIL import Image | 
					
					
						
						| 
							 | 
						from transformers import AutoModel, AutoTokenizer | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						# load omni model default, the default init_vision/init_audio/init_tts is True | 
					
					
						
						| 
							 | 
						# if load vision-only model, please set init_audio=False and init_tts=False | 
					
					
						
						| 
							 | 
						# if load audio-only model, please set init_vision=False | 
					
					
						
						| 
							 | 
						model = AutoModel.from_pretrained( | 
					
					
						
						| 
							 | 
						    'openbmb/MiniCPM-o-2_6', | 
					
					
						
						| 
							 | 
						    trust_remote_code=True, | 
					
					
						
						| 
							 | 
						    attn_implementation='sdpa', # sdpa or flash_attention_2 | 
					
					
						
						| 
							 | 
						    torch_dtype=torch.bfloat16, | 
					
					
						
						| 
							 | 
						    init_vision=True, | 
					
					
						
						| 
							 | 
						    init_audio=True, | 
					
					
						
						| 
							 | 
						    init_tts=True | 
					
					
						
						| 
							 | 
						) | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						model = model.eval().cuda() | 
					
					
						
						| 
							 | 
						tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True) | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						# In addition to vision-only mode, tts processor and vocos also needs to be initialized | 
					
					
						
						| 
							 | 
						model.init_tts() | 
					
					
						
						| 
							 | 
						``` | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						If you are using an older version of PyTorch, you might encounter this issue `"weight_norm_fwd_first_dim_kernel" not implemented for 'BFloat16'`, Please convert the TTS to float32 type. | 
					
					
						
						| 
							 | 
						```python | 
					
					
						
						| 
							 | 
						model.tts.float() | 
					
					
						
						| 
							 | 
						``` | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						### Omni mode | 
					
					
						
						| 
							 | 
						we provide two inference modes: chat and streaming | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						#### Chat inference | 
					
					
						
						| 
							 | 
						```python | 
					
					
						
						| 
							 | 
						import math | 
					
					
						
						| 
							 | 
						import numpy as np | 
					
					
						
						| 
							 | 
						from PIL import Image | 
					
					
						
						| 
							 | 
						from moviepy.editor import VideoFileClip | 
					
					
						
						| 
							 | 
						import tempfile | 
					
					
						
						| 
							 | 
						import librosa | 
					
					
						
						| 
							 | 
						import soundfile as sf | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						def get_video_chunk_content(video_path, flatten=True): | 
					
					
						
						| 
							 | 
						    video = VideoFileClip(video_path) | 
					
					
						
						| 
							 | 
						    print('video_duration:', video.duration) | 
					
					
						
						| 
							 | 
						     | 
					
					
						
						| 
							 | 
						    with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as temp_audio_file: | 
					
					
						
						| 
							 | 
						        temp_audio_file_path = temp_audio_file.name | 
					
					
						
						| 
							 | 
						        video.audio.write_audiofile(temp_audio_file_path, codec="pcm_s16le", fps=16000) | 
					
					
						
						| 
							 | 
						        audio_np, sr = librosa.load(temp_audio_file_path, sr=16000, mono=True) | 
					
					
						
						| 
							 | 
						    num_units = math.ceil(video.duration) | 
					
					
						
						| 
							 | 
						     | 
					
					
						
						| 
							 | 
						    # 1 frame + 1s audio chunk | 
					
					
						
						| 
							 | 
						    contents= [] | 
					
					
						
						| 
							 | 
						    for i in range(num_units): | 
					
					
						
						| 
							 | 
						        frame = video.get_frame(i+1) | 
					
					
						
						| 
							 | 
						        image = Image.fromarray((frame).astype(np.uint8)) | 
					
					
						
						| 
							 | 
						        audio = audio_np[sr*i:sr*(i+1)] | 
					
					
						
						| 
							 | 
						        if flatten: | 
					
					
						
						| 
							 | 
						            contents.extend(["<unit>", image, audio]) | 
					
					
						
						| 
							 | 
						        else: | 
					
					
						
						| 
							 | 
						            contents.append(["<unit>", image, audio]) | 
					
					
						
						| 
							 | 
						     | 
					
					
						
						| 
							 | 
						    return contents | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						video_path="assets/Skiing.mp4" | 
					
					
						
						| 
							 | 
						# if use voice clone prompt, please set ref_audio | 
					
					
						
						| 
							 | 
						ref_audio_path = 'assets/demo.wav' | 
					
					
						
						| 
							 | 
						ref_audio, _ = librosa.load(ref_audio_path, sr=16000, mono=True) | 
					
					
						
						| 
							 | 
						sys_msg = model.get_sys_prompt(ref_audio=ref_audio, mode='omni', language='en') | 
					
					
						
						| 
							 | 
						# or use default prompt | 
					
					
						
						| 
							 | 
						# sys_msg = model.get_sys_prompt(mode='omni', language='en') | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						contents = get_video_chunk_content(video_path) | 
					
					
						
						| 
							 | 
						msg = {"role":"user", "content": contents} | 
					
					
						
						| 
							 | 
						msgs = [sys_msg, msg] | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						# please set generate_audio=True and output_audio_path to save the tts result | 
					
					
						
						| 
							 | 
						generate_audio = True | 
					
					
						
						| 
							 | 
						output_audio_path = 'output.wav' | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						res = model.chat( | 
					
					
						
						| 
							 | 
						    msgs=msgs, | 
					
					
						
						| 
							 | 
						    tokenizer=tokenizer, | 
					
					
						
						| 
							 | 
						    sampling=True, | 
					
					
						
						| 
							 | 
						    temperature=0.5, | 
					
					
						
						| 
							 | 
						    max_new_tokens=4096, | 
					
					
						
						| 
							 | 
						    omni_input=True, # please set omni_input=True when omni inference | 
					
					
						
						| 
							 | 
						    use_tts_template=True, | 
					
					
						
						| 
							 | 
						    generate_audio=generate_audio, | 
					
					
						
						| 
							 | 
						    output_audio_path=output_audio_path, | 
					
					
						
						| 
							 | 
						    max_slice_nums=1, | 
					
					
						
						| 
							 | 
						    use_image_id=False, | 
					
					
						
						| 
							 | 
						    return_dict=True | 
					
					
						
						| 
							 | 
						) | 
					
					
						
						| 
							 | 
						print(res) | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						## You will get the answer: The person in the picture is skiing down a snowy slope. | 
					
					
						
						| 
							 | 
						# import IPython | 
					
					
						
						| 
							 | 
						# IPython.display.Audio('output.wav') | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						``` | 
					
					
						
						| 
							 | 
						#### Streaming inference | 
					
					
						
						| 
							 | 
						```python | 
					
					
						
						| 
							 | 
						# a new conversation need reset session first, it will reset the kv-cache | 
					
					
						
						| 
							 | 
						model.reset_session() | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						contents = get_video_chunk_content(video_path, flatten=False) | 
					
					
						
						| 
							 | 
						session_id = '123' | 
					
					
						
						| 
							 | 
						generate_audio = True | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						# 1. prefill system prompt | 
					
					
						
						| 
							 | 
						res = model.streaming_prefill( | 
					
					
						
						| 
							 | 
						    session_id=session_id, | 
					
					
						
						| 
							 | 
						    msgs=[sys_msg],  | 
					
					
						
						| 
							 | 
						    tokenizer=tokenizer | 
					
					
						
						| 
							 | 
						) | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						# 2. prefill video/audio chunks | 
					
					
						
						| 
							 | 
						for content in contents: | 
					
					
						
						| 
							 | 
						    msgs = [{"role":"user", "content": content}] | 
					
					
						
						| 
							 | 
						    res = model.streaming_prefill( | 
					
					
						
						| 
							 | 
						        session_id=session_id, | 
					
					
						
						| 
							 | 
						        msgs=msgs,  | 
					
					
						
						| 
							 | 
						        tokenizer=tokenizer | 
					
					
						
						| 
							 | 
						    ) | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						# 3. generate | 
					
					
						
						| 
							 | 
						res = model.streaming_generate( | 
					
					
						
						| 
							 | 
						    session_id=session_id, | 
					
					
						
						| 
							 | 
						    tokenizer=tokenizer, | 
					
					
						
						| 
							 | 
						    temperature=0.5, | 
					
					
						
						| 
							 | 
						    generate_audio=generate_audio | 
					
					
						
						| 
							 | 
						) | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						audios = [] | 
					
					
						
						| 
							 | 
						text = "" | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						if generate_audio: | 
					
					
						
						| 
							 | 
						    for r in res: | 
					
					
						
						| 
							 | 
						        audio_wav = r.audio_wav | 
					
					
						
						| 
							 | 
						        sampling_rate = r.sampling_rate | 
					
					
						
						| 
							 | 
						        txt = r.text | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						        audios.append(audio_wav) | 
					
					
						
						| 
							 | 
						        text += txt | 
					
					
						
						| 
							 | 
						         | 
					
					
						
						| 
							 | 
						    res = np.concatenate(audios) | 
					
					
						
						| 
							 | 
						    sf.write("output.wav", res, samplerate=sampling_rate) | 
					
					
						
						| 
							 | 
						    print("text:", text) | 
					
					
						
						| 
							 | 
						    print("audio saved to output.wav") | 
					
					
						
						| 
							 | 
						else: | 
					
					
						
						| 
							 | 
						    for r in res: | 
					
					
						
						| 
							 | 
						        text += r['text'] | 
					
					
						
						| 
							 | 
						    print("text:", text) | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						``` | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						#### Speech and Audio Mode | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Model initialization | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						```python | 
					
					
						
						| 
							 | 
						import torch | 
					
					
						
						| 
							 | 
						import librosa | 
					
					
						
						| 
							 | 
						from transformers import AutoModel, AutoTokenizer | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True, | 
					
					
						
						| 
							 | 
						    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager | 
					
					
						
						| 
							 | 
						model = model.eval().cuda() | 
					
					
						
						| 
							 | 
						tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True) | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						model.init_tts() | 
					
					
						
						| 
							 | 
						model.tts.float() | 
					
					
						
						| 
							 | 
						``` | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						<hr/> | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						##### Mimick | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						`Mimick` task reflects a model's end-to-end speech modeling capability. The model takes audio input, and outputs an ASR transcription and subsequently reconstructs the original audio with high similarity. The higher the similarity between the reconstructed audio and the original audio, the stronger the model's foundational capability in end-to-end speech modeling. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						<details> <summary>Click here to demonstrate the capability of end-to-end audio understanding and generation. </summary> | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						```python | 
					
					
						
						| 
							 | 
						mimick_prompt = "Please repeat each user's speech, including voice style and speech content." | 
					
					
						
						| 
							 | 
						audio_input, _ = librosa.load('xxx.wav', sr=16000, mono=True) | 
					
					
						
						| 
							 | 
						msgs = [{'role': 'user', 'content': [mimick_prompt,audio_input]}] | 
					
					
						
						| 
							 | 
						res = model.chat( | 
					
					
						
						| 
							 | 
						    msgs=msgs, | 
					
					
						
						| 
							 | 
						    tokenizer=tokenizer, | 
					
					
						
						| 
							 | 
						    sampling=True, | 
					
					
						
						| 
							 | 
						    max_new_tokens=128, | 
					
					
						
						| 
							 | 
						    use_tts_template=True, | 
					
					
						
						| 
							 | 
						    temperature=0.3, | 
					
					
						
						| 
							 | 
						    generate_audio=True, | 
					
					
						
						| 
							 | 
						    output_audio_path='output.wav', # save the tts result to output_audio_path | 
					
					
						
						| 
							 | 
						) | 
					
					
						
						| 
							 | 
						``` | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						</details> | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						<hr/> | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						##### General Speech Conversation with Configurable Voices | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						A general usage scenario of MiniCPM-o 2.6 is role-playing a specific character based on the audio prompt. It will mimic the voice of the character to some extent and act like the character in text, including language style. In this mode, MiniCPM-o-2.6 will sounds **more natural and human-like**. Self-defined audio prompts can be used to customize the voice of the character in an end-to-end manner. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						```python | 
					
					
						
						| 
							 | 
						ref_audio, _ = librosa.load('./assets/voice_01.wav', sr=16000, mono=True) # load the reference audio | 
					
					
						
						| 
							 | 
						sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_roleplay', language='en') | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						# round one | 
					
					
						
						| 
							 | 
						user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]} | 
					
					
						
						| 
							 | 
						msgs = [sys_prompt, user_question] | 
					
					
						
						| 
							 | 
						res = model.chat( | 
					
					
						
						| 
							 | 
						    msgs=msgs, | 
					
					
						
						| 
							 | 
						    tokenizer=tokenizer, | 
					
					
						
						| 
							 | 
						    sampling=True, | 
					
					
						
						| 
							 | 
						    max_new_tokens=128, | 
					
					
						
						| 
							 | 
						    use_tts_template=True, | 
					
					
						
						| 
							 | 
						    generate_audio=True, | 
					
					
						
						| 
							 | 
						    temperature=0.3, | 
					
					
						
						| 
							 | 
						    output_audio_path='result.wav', | 
					
					
						
						| 
							 | 
						) | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						# round two | 
					
					
						
						| 
							 | 
						history = msgs.append({'role': 'assistant', 'content': res}) | 
					
					
						
						| 
							 | 
						user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]} | 
					
					
						
						| 
							 | 
						msgs = history.append(user_question) | 
					
					
						
						| 
							 | 
						res = model.chat( | 
					
					
						
						| 
							 | 
						    msgs=msgs, | 
					
					
						
						| 
							 | 
						    tokenizer=tokenizer, | 
					
					
						
						| 
							 | 
						    sampling=True, | 
					
					
						
						| 
							 | 
						    max_new_tokens=128, | 
					
					
						
						| 
							 | 
						    use_tts_template=True, | 
					
					
						
						| 
							 | 
						    generate_audio=True, | 
					
					
						
						| 
							 | 
						    temperature=0.3, | 
					
					
						
						| 
							 | 
						    output_audio_path='result_round_2.wav', | 
					
					
						
						| 
							 | 
						) | 
					
					
						
						| 
							 | 
						print(res) | 
					
					
						
						| 
							 | 
						``` | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						<hr/> | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						##### Speech Conversation as an AI Assistant | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						An enhanced feature of MiniCPM-o-2.6 is to act as an AI assistant, but only with limited choice of voices. In this mode, MiniCPM-o-2.6 is **less human-like and more like a voice assistant**. But it is more instruction-following. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						```python | 
					
					
						
						| 
							 | 
						sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_assistant', language='en')  | 
					
					
						
						| 
							 | 
						user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]} | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						# round one | 
					
					
						
						| 
							 | 
						msgs = [sys_prompt, user_question] | 
					
					
						
						| 
							 | 
						res = model.chat( | 
					
					
						
						| 
							 | 
						    msgs=msgs, | 
					
					
						
						| 
							 | 
						    tokenizer=tokenizer, | 
					
					
						
						| 
							 | 
						    sampling=True, | 
					
					
						
						| 
							 | 
						    max_new_tokens=128, | 
					
					
						
						| 
							 | 
						    use_tts_template=True, | 
					
					
						
						| 
							 | 
						    generate_audio=True, | 
					
					
						
						| 
							 | 
						    temperature=0.3, | 
					
					
						
						| 
							 | 
						    output_audio_path='result.wav', | 
					
					
						
						| 
							 | 
						) | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						# round two | 
					
					
						
						| 
							 | 
						history = msgs.append({'role': 'assistant', 'content': res}) | 
					
					
						
						| 
							 | 
						user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]} | 
					
					
						
						| 
							 | 
						msgs = history.append(user_question) | 
					
					
						
						| 
							 | 
						res = model.chat( | 
					
					
						
						| 
							 | 
						    msgs=msgs, | 
					
					
						
						| 
							 | 
						    tokenizer=tokenizer, | 
					
					
						
						| 
							 | 
						    sampling=True, | 
					
					
						
						| 
							 | 
						    max_new_tokens=128, | 
					
					
						
						| 
							 | 
						    use_tts_template=True, | 
					
					
						
						| 
							 | 
						    generate_audio=True, | 
					
					
						
						| 
							 | 
						    temperature=0.3, | 
					
					
						
						| 
							 | 
						    output_audio_path='result_round_2.wav', | 
					
					
						
						| 
							 | 
						) | 
					
					
						
						| 
							 | 
						print(res) | 
					
					
						
						| 
							 | 
						``` | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						<hr/> | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						##### Instruction-to-Speech | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						MiniCPM-o-2.6 can also do Instruction-to-Speech, aka **Voice Creation**. You can describe a voice in detail, and the model will generate a voice that matches the description. For more Instruction-to-Speech sample instructions, you can refer to https://voxinstruct.github.io/VoxInstruct/. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						<details> | 
					
					
						
						| 
							 | 
						<summary> Click to view Python code running MiniCPM-o 2.6 with Instruction-to-Speech. </summary> | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						```python | 
					
					
						
						| 
							 | 
						instruction = 'Delighting in a surprised tone, an adult male with low pitch and low volume comments:"One even gave my little dog a biscuit" This dialogue takes place at a leisurely pace, delivering a sense of excitement and surprise in the context. ' | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						instruction = '在新闻中,一个年轻男性兴致勃勃地说:“祝福亲爱的祖国母亲美丽富强!”他用低音调和低音量,慢慢地说出了这句话。' | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						msgs = [{'role': 'user', 'content': [instruction]}] | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						res = model.chat( | 
					
					
						
						| 
							 | 
						    msgs=msgs, | 
					
					
						
						| 
							 | 
						    tokenizer=tokenizer, | 
					
					
						
						| 
							 | 
						    sampling=True, | 
					
					
						
						| 
							 | 
						    max_new_tokens=128, | 
					
					
						
						| 
							 | 
						    use_tts_template=True, | 
					
					
						
						| 
							 | 
						    generate_audio=True, | 
					
					
						
						| 
							 | 
						    temperature=0.3, | 
					
					
						
						| 
							 | 
						    output_audio_path='result.wav', | 
					
					
						
						| 
							 | 
						) | 
					
					
						
						| 
							 | 
						``` | 
					
					
						
						| 
							 | 
						</details> | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						<hr/> | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						##### Voice Cloning | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						MiniCPM-o-2.6 can also do zero-shot text-to-speech, aka **Voice Cloning**. With this mode, model will act like a TTS model. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						<details> | 
					
					
						
						| 
							 | 
						<summary> Click to show Python code running MiniCPM-o 2.6 with voice cloning. </summary> | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						```python | 
					
					
						
						| 
							 | 
						sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='voice_cloning', language='en') | 
					
					
						
						| 
							 | 
						text_prompt = f"Please read the text below." | 
					
					
						
						| 
							 | 
						user_question = {'role': 'user', 'content': [text_prompt, "content that you want to read"]} | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						msgs = [sys_prompt, user_question] | 
					
					
						
						| 
							 | 
						res = model.chat( | 
					
					
						
						| 
							 | 
						    msgs=msgs, | 
					
					
						
						| 
							 | 
						    tokenizer=tokenizer, | 
					
					
						
						| 
							 | 
						    sampling=True, | 
					
					
						
						| 
							 | 
						    max_new_tokens=128, | 
					
					
						
						| 
							 | 
						    use_tts_template=True, | 
					
					
						
						| 
							 | 
						    generate_audio=True, | 
					
					
						
						| 
							 | 
						    temperature=0.3, | 
					
					
						
						| 
							 | 
						    output_audio_path='result.wav', | 
					
					
						
						| 
							 | 
						) | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						``` | 
					
					
						
						| 
							 | 
						</details> | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						<hr/> | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						##### Addressing Various Audio Understanding Tasks | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						MiniCPM-o-2.6 can also be used to address various audio understanding tasks, such as ASR, speaker analysis, general audio captioning, and sound scene tagging. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						For audio-to-text tasks, you can use the following prompts: | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						- ASR with ZH(same as AST en2zh): `请仔细听这段音频片段,并将其内容逐字记录。` | 
					
					
						
						| 
							 | 
						- ASR with EN(same as AST zh2en): `Please listen to the audio snippet carefully and transcribe the content.` | 
					
					
						
						| 
							 | 
						- Speaker Analysis: `Based on the speaker's content, speculate on their gender, condition, age range, and health status.` | 
					
					
						
						| 
							 | 
						- General Audio Caption: `Summarize the main content of the audio.` | 
					
					
						
						| 
							 | 
						- General Sound Scene Tagging: `Utilize one keyword to convey the audio's content or the associated scene.` | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						```python | 
					
					
						
						| 
							 | 
						task_prompt = "Please listen to the audio snippet carefully and transcribe the content." + "\n" # can change to other prompts. | 
					
					
						
						| 
							 | 
						audio_input, _ = librosa.load('xxx.wav', sr=16000, mono=True) | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						msgs = [{'role': 'user', 'content': [task_prompt, audio_input]}] | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						res = model.chat( | 
					
					
						
						| 
							 | 
						    msgs=msgs, | 
					
					
						
						| 
							 | 
						    tokenizer=tokenizer, | 
					
					
						
						| 
							 | 
						    sampling=True, | 
					
					
						
						| 
							 | 
						    max_new_tokens=128, | 
					
					
						
						| 
							 | 
						    use_tts_template=True, | 
					
					
						
						| 
							 | 
						    generate_audio=True, | 
					
					
						
						| 
							 | 
						    temperature=0.3, | 
					
					
						
						| 
							 | 
						    output_audio_path='result.wav', | 
					
					
						
						| 
							 | 
						) | 
					
					
						
						| 
							 | 
						print(res) | 
					
					
						
						| 
							 | 
						``` | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						### Vision-Only mode | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						`MiniCPM-o-2_6` has the same inference methods as `MiniCPM-V-2_6` | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						#### Chat with single image | 
					
					
						
						| 
							 | 
						```python | 
					
					
						
						| 
							 | 
						# test.py | 
					
					
						
						| 
							 | 
						image = Image.open('xx.jpg').convert('RGB') | 
					
					
						
						| 
							 | 
						question = 'What is in the image?' | 
					
					
						
						| 
							 | 
						msgs = [{'role': 'user', 'content': [image, question]}] | 
					
					
						
						| 
							 | 
						res = model.chat( | 
					
					
						
						| 
							 | 
						    image=None, | 
					
					
						
						| 
							 | 
						    msgs=msgs, | 
					
					
						
						| 
							 | 
						    tokenizer=tokenizer | 
					
					
						
						| 
							 | 
						) | 
					
					
						
						| 
							 | 
						print(res) | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						## if you want to use streaming, please make sure sampling=True and stream=True | 
					
					
						
						| 
							 | 
						## the model.chat will return a generator | 
					
					
						
						| 
							 | 
						res = model.chat( | 
					
					
						
						| 
							 | 
						    msgs=msgs, | 
					
					
						
						| 
							 | 
						    tokenizer=tokenizer, | 
					
					
						
						| 
							 | 
						    sampling=True, | 
					
					
						
						| 
							 | 
						    stream=True | 
					
					
						
						| 
							 | 
						) | 
					
					
						
						| 
							 | 
						generated_text = "" | 
					
					
						
						| 
							 | 
						for new_text in res: | 
					
					
						
						| 
							 | 
						    generated_text += new_text | 
					
					
						
						| 
							 | 
						    print(new_text, flush=True, end='') | 
					
					
						
						| 
							 | 
						``` | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						#### Chat with multiple images | 
					
					
						
						| 
							 | 
						<details> | 
					
					
						
						| 
							 | 
						<summary> Click to show Python code running MiniCPM-o 2.6 with multiple images input. </summary> | 
					
					
						
						| 
							 | 
						   | 
					
					
						
						| 
							 | 
						```python | 
					
					
						
						| 
							 | 
						image1 = Image.open('image1.jpg').convert('RGB') | 
					
					
						
						| 
							 | 
						image2 = Image.open('image2.jpg').convert('RGB') | 
					
					
						
						| 
							 | 
						question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.' | 
					
					
						
						| 
							 | 
						msgs = [{'role': 'user', 'content': [image1, image2, question]}] | 
					
					
						
						| 
							 | 
						answer = model.chat( | 
					
					
						
						| 
							 | 
						    msgs=msgs, | 
					
					
						
						| 
							 | 
						    tokenizer=tokenizer | 
					
					
						
						| 
							 | 
						) | 
					
					
						
						| 
							 | 
						print(answer) | 
					
					
						
						| 
							 | 
						``` | 
					
					
						
						| 
							 | 
						</details> | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						#### In-context few-shot learning | 
					
					
						
						| 
							 | 
						<details> | 
					
					
						
						| 
							 | 
						<summary> Click to view Python code running MiniCPM-o 2.6 with few-shot input. </summary> | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						```python | 
					
					
						
						| 
							 | 
						question = "production date"  | 
					
					
						
						| 
							 | 
						image1 = Image.open('example1.jpg').convert('RGB') | 
					
					
						
						| 
							 | 
						answer1 = "2023.08.04" | 
					
					
						
						| 
							 | 
						image2 = Image.open('example2.jpg').convert('RGB') | 
					
					
						
						| 
							 | 
						answer2 = "2007.04.24" | 
					
					
						
						| 
							 | 
						image_test = Image.open('test.jpg').convert('RGB') | 
					
					
						
						| 
							 | 
						msgs = [ | 
					
					
						
						| 
							 | 
						    {'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]}, | 
					
					
						
						| 
							 | 
						    {'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]}, | 
					
					
						
						| 
							 | 
						    {'role': 'user', 'content': [image_test, question]} | 
					
					
						
						| 
							 | 
						] | 
					
					
						
						| 
							 | 
						answer = model.chat( | 
					
					
						
						| 
							 | 
						    msgs=msgs, | 
					
					
						
						| 
							 | 
						    tokenizer=tokenizer | 
					
					
						
						| 
							 | 
						) | 
					
					
						
						| 
							 | 
						print(answer) | 
					
					
						
						| 
							 | 
						``` | 
					
					
						
						| 
							 | 
						</details> | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						#### Chat with video | 
					
					
						
						| 
							 | 
						<details> | 
					
					
						
						| 
							 | 
						<summary> Click to view Python code running MiniCPM-o 2.6 with video input. </summary> | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						```python | 
					
					
						
						| 
							 | 
						MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number | 
					
					
						
						| 
							 | 
						def encode_video(video_path): | 
					
					
						
						| 
							 | 
						    def uniform_sample(l, n): | 
					
					
						
						| 
							 | 
						        gap = len(l) / n | 
					
					
						
						| 
							 | 
						        idxs = [int(i * gap + gap / 2) for i in range(n)] | 
					
					
						
						| 
							 | 
						        return [l[i] for i in idxs] | 
					
					
						
						| 
							 | 
						    vr = VideoReader(video_path, ctx=cpu(0)) | 
					
					
						
						| 
							 | 
						    sample_fps = round(vr.get_avg_fps() / 1)  # FPS | 
					
					
						
						| 
							 | 
						    frame_idx = [i for i in range(0, len(vr), sample_fps)] | 
					
					
						
						| 
							 | 
						    if len(frame_idx) > MAX_NUM_FRAMES: | 
					
					
						
						| 
							 | 
						        frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES) | 
					
					
						
						| 
							 | 
						    frames = vr.get_batch(frame_idx).asnumpy() | 
					
					
						
						| 
							 | 
						    frames = [Image.fromarray(v.astype('uint8')) for v in frames] | 
					
					
						
						| 
							 | 
						    print('num frames:', len(frames)) | 
					
					
						
						| 
							 | 
						    return frames | 
					
					
						
						| 
							 | 
						video_path ="video_test.mp4" | 
					
					
						
						| 
							 | 
						frames = encode_video(video_path) | 
					
					
						
						| 
							 | 
						question = "Describe the video" | 
					
					
						
						| 
							 | 
						msgs = [ | 
					
					
						
						| 
							 | 
						    {'role': 'user', 'content': frames + [question]},  | 
					
					
						
						| 
							 | 
						] | 
					
					
						
						| 
							 | 
						# Set decode params for video | 
					
					
						
						| 
							 | 
						params={} | 
					
					
						
						| 
							 | 
						params["use_image_id"] = False | 
					
					
						
						| 
							 | 
						params["max_slice_nums"] = 2 # use 1 if cuda OOM and video resolution >  448*448 | 
					
					
						
						| 
							 | 
						answer = model.chat( | 
					
					
						
						| 
							 | 
						    msgs=msgs, | 
					
					
						
						| 
							 | 
						    tokenizer=tokenizer, | 
					
					
						
						| 
							 | 
						    **params | 
					
					
						
						| 
							 | 
						) | 
					
					
						
						| 
							 | 
						print(answer) | 
					
					
						
						| 
							 | 
						``` | 
					
					
						
						| 
							 | 
						</details> | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Please look at [GitHub](https://github.com/OpenBMB/MiniCPM-o) for more detail about usage. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						## Inference with llama.cpp<a id="llamacpp"></a> | 
					
					
						
						| 
							 | 
						MiniCPM-o 2.6 (vision-only mode) can run with llama.cpp. See our fork of [llama.cpp](https://github.com/OpenBMB/llama.cpp/tree/minicpm-omni) and [readme](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) for more detail. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						## Int4 quantized version | 
					
					
						
						| 
							 | 
						Download the int4 quantized version for lower GPU memory (7GB) usage:  [MiniCPM-o-2_6-int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4). | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						## License | 
					
					
						
						| 
							 | 
						#### Model License | 
					
					
						
						| 
							 | 
						* The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.  | 
					
					
						
						| 
							 | 
						* The usage of MiniCPM-o and MiniCPM-V series model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md). | 
					
					
						
						| 
							 | 
						* The models and weights of MiniCPM are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, MiniCPM-o 2.6 weights are also available for free commercial use. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						#### Statement | 
					
					
						
						| 
							 | 
						* As an LMM, MiniCPM-o 2.6 generates contents by learning a large mount of multimodal corpora, but it cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-o 2.6 does not represent the views and positions of the model developers | 
					
					
						
						| 
							 | 
						* We will not be liable for any problems arising from the use of the MinCPM-V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						## Key Techniques and Other Multimodal Projects | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						👏 Welcome to explore key techniques of MiniCPM-o 2.6 and other multimodal projects of our team: | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						[VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD)  | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V) | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						## Citation | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						If you find our work helpful, please consider citing our papers 📝 and liking this project ❤️! | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						```bib | 
					
					
						
						| 
							 | 
						@article{yao2024minicpm, | 
					
					
						
						| 
							 | 
						  title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone}, | 
					
					
						
						| 
							 | 
						  author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others}, | 
					
					
						
						| 
							 | 
						  journal={arXiv preprint arXiv:2408.01800}, | 
					
					
						
						| 
							 | 
						  year={2024} | 
					
					
						
						| 
							 | 
						} | 
					
					
						
						| 
							 | 
						``` | 
					
					
						
						| 
							 | 
						
 |