--- pipeline_tag: visual-question-answering language: - en - zh datasets: - HaoyeZhang/RLHF-V-Dataset - Yirany/UniMM-Chat --- [GitHub](https://github.com/OpenBMB/MiniCPM-V) | [Demo](http://120.92.209.146:8889/) ## MiniCPM-Llama3-V 2.5 **MiniCPM-Llama3-V 2.5** is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.0. Notable features of MiniCPM-Llama3-V 2.5 include: - 🔥 **Leading Performance.** MiniCPM-Llama3-V 2.5 has achieved an average score of 65.0 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks. **It surpasses widely used proprietary models like GPT-4V-1106, Gemini Pro and Claude 3 with 8B parameters**, greatly outperforming other multimodal large models built on Llama 3. - 💪 **Strong OCR Capabilities.** MiniCPM-Llama3-V 2.5 can process images with any aspect ratio up to 1.8 million pixels, achieving an **700+ score on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V-0409, QWEN-VL-Max and Gemini Pro**. Based on recent user feedback, MiniCPM-Llama3-V 2.5 has now enhanced full-text OCR extraction, table-to-markdown conversion, and other high-utility capabilities, and has further strengthened its instruction-following and complex reasoning abilities, enhancing multimodal interaction experiences. - 🏆 **Trustworthy Behavior.** Leveraging the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) method (the newest technology in the [RLHF-V](https://github.com/RLHF-V) [CVPR'24] series), MiniCPM-Llama3-V 2.5 exhibits trustworthy multimodal behavior. It achieves **10.3%** hallucination rate on Object HalBench, lower than GPT-4V-1106 (13.6%), achieving the best level within the open-source community. - 🌏 **Multilingual Support.** Thanks to Llama 3’s robust multilingual capabilities and VisCPM's cross-lingual generalization technology, MiniCPM-Llama3-V 2.5 extends its foundational bilingual (Chinese-English) multimodal capabilities to support **30+ languages including German, French, Spanish, Italian, Russian etc.** We achieve this extension through only minimal instruction-tuning with translated multimodal data. [All Supported Languages](./assets/minicpm-llama-v-2-5_languages.md). - 🚀 **Efficient Deployment.** MiniCPM-Llama3-V 2.5 systematically employs **model quantization, CPU optimizations, NPU optimizations and compilation optimizations** as acceleration techniques, achieving high-efficiency deployment on edge devices. For mobile phones with Qualcomm chips, we have integrated the NPU acceleration framework QNN into llama.cpp for the first time. After systematic optimization, MiniCPM-Llama3-V 2.5 has realized a **150-fold acceleration in multimodal large model edge-side image encoding** and a **3-fold increase in language decoding speed**. ### Evaluation
Model | Size | OCRBench | TextVQA val | DocVQA test | Open-Compass | MME | MMB dev (en) | MMB dev (zh) | MMMU val | Math-Vista | LLaVA Bench | RealWorld QA | Object HalBench |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Proprietary | |||||||||||||
Gemini Pro | - | 680 | 74.6 | 88.1 | 63.8 | 2148.9 | 75.2 | 74.0 | 48.9 | 45.8 | 79.9 | 60.4 | - |
GPT-4V (2023.11.06) | - | 645 | 78.0 | 88.4 | 63.2 | 1771.5 | 75.1 | 75.0 | 53.8 | 47.8 | 93.1 | 63.0 | 86.4 |
Open-source | |||||||||||||
DeepSeek-VL-1.3B | 1.7B | 413 | 58.4* | 37.9* | 46.0 | 1531.6 | 64.0 | 61.2 | 33.8 | 29.4 | 51.1 | 49.7 | - |
Mini-Gemini | 2.2B | - | 56.2 | 34.2* | - | 1653.0 | 59.8 | - | 31.7 | - | - | - | - |
Yi-VL-6B | 6.7B | 290 | 45.5* | 17.1* | 49.3 | 1915.1 | 68.6 | 68.3 | 40.3 | 28.8 | 51.9 | 53.5 | - |
Qwen-VL-Chat | 9.6B | 488 | 61.5 | 62.6 | 52.1 | 1860.0 | 60.6 | 56.7 | 37.0 | 33.8 | 67.7 | 49.3 | 56.2 / 80.0 |
DeepSeek-VL-7B | 7.3B | 435 | 64.7* | 47.0* | 55.6 | 1765.4 | 74.1 | 72.8 | 38.3 | 36.8 | 77.8 | 54.2 | 91.5 / 95.3 |
Yi-VL-34B | 34B | 290 | 43.4* | 16.9* | 52.6 | 2050.2 | 71.1 | 71.4 | 45.1 | 30.7 | 62.3 | 54.8 | 79.3 / 86.0 |
CogVLM-Chat | 17.4B | 590 | 70.4 | 33.3* | 52.5 | 1736.6 | 63.7 | 53.8 | 37.3 | 34.7 | 73.9 | 60.3 | 73.6 / 87.4 |
TextMonkey | 9.7B | 558 | 64.3 | 66.7 | - | - | - | - | - | - | - | - | - |
IDEFICS2-8B | 8.0B | - | - | - | 57.2 | 1847.6 | 75.7 | 68.6 | 45.2 | 52.2 | 49.1 | 60.7 | - |
Bunny-LLama-3-8B | 8.4B | - | - | - | 54.3 | 1920.3 | 77.0 | 73.9 | 41.3 | 31.5 | 61.2 | 58.8 | - |
XTuner-Llama-3-8B-v1.1 | 8.4B | - | - | - | 53.3 | 1818.0 | 71.7 | 63.2 | 39.2 | 40.0 | 69.2 | - | - |
LLaVA-NeXT Llama-3-8B | 8.4B | - | - | 78.2 | - | 1971.5 | 72.1 | - | 41.7 | 37.5 | 80.1 | 60.0 | - |
MiniCPM-V 1.0 | 2.8B | 366 | 60.6 | 38.2 | 47.6 | 1650.2 | 67.9 | 65.3 | 38.3 | 28.9 | 51.3 | 51.2 | 78.4 / 88.5 |
MiniCPM-V 2.0 | 2.8B | 605 | 74.1 | 71.9 | 55.0 | 1808.6 | 69.6 | 68.1 | 38.2 | 38.7 | 69.2 | 55.8 | 85.5 / 92.2 |
MiniCPM-Llama3-V 2.5 | 8.5B | 725 | 76.6 | 84.8 | 65.0 | 2024.6 | 76.7 | 73.4 | 45.8 | 54.3 | 86.7 | 63.5 | 89.7 / 95.0 |