tc-mb commited on
Commit
182d1fb
·
verified ·
1 Parent(s): a4edfa5

Initial commit: MiniCPM-V-4_5 model

Browse files
Files changed (1) hide show
  1. README.md +267 -3
README.md CHANGED
@@ -1,3 +1,267 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: image-text-to-text
3
+ datasets:
4
+ - openbmb/RLAIF-V-Dataset
5
+ library_name: transformers
6
+ language:
7
+ - multilingual
8
+ tags:
9
+ - minicpm-v
10
+ - vision
11
+ - ocr
12
+ - multi-image
13
+ - video
14
+ - custom_code
15
+ ---
16
+
17
+ <h1>A GPT-4o Level MLLM for Single Image, Multi Image and Video Understanding on Your Phone</h1>
18
+
19
+ [GitHub](https://github.com/OpenBMB/MiniCPM-o) | [Demo](http://101.126.42.235:30910/)</a>
20
+
21
+
22
+
23
+ ## MiniCPM-V 4.5
24
+
25
+ **MiniCPM-V 4.5** is the latest and most capable model in the MiniCPM-V series. The model is built on Qwen3-8B and SigLIP2-400M with a total of 8B parameters. It exhibits a significant performance improvement over previous MiniCPM-V and MiniCPM-o models, and introduces new useful features. Notable features of MiniCPM-V 4.5 include:
26
+
27
+ - 🔥 **State-of-the-art Vision-Language Capability.**
28
+ MiniCPM-V 4.5 achieves an average score of 77.2 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-latest, Gemini-2.0 Pro, and strong open-source models like Qwen2.5-VL 72B** for vision-language capabilities, making it the most performant MLLM under 30B parameters.
29
+
30
+ - 🎬 **Efficient High Refresh Rate and Long Video Understanding.** Powered by a new unified 3D-Resampler over images and videos, MiniCPM-V 4.5 can now achieve 96x compression rate for video tokens, where 6 448x448 video frames can be jointly compressed into 64 video tokens (normally 1,536 tokens for most MLLMs). This means that the model can percieve significantly more video frames without increasing the LLM inference cost. This brings state-of-the-art high refresh rate (up to 10FPS) video understanding and long video understanding capabilities on Video-MME, LVBench, MLVU, MotionBench, FavorBench, etc., efficiently.
31
+
32
+ - ⚙️ **Controllable Hybrid Fast/Deep Thinking.** MiniCPM-V 4.5 supports both fast thinking for efficient frequent usage with competitive performance, and deep thinking for more complex problem solving. To cover efficiency and performance trade-offs in different user scenarios, this fast/deep thinking mode can be switched in a highly controlled fashion.
33
+
34
+ - 💪 **Strong OCR, Document Parsing and Others.**
35
+ Based on [LLaVA-UHD](https://arxiv.org/pdf/2403.11703) architecture, MiniCPM-V 4.5 can process high-resolution images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344), using 4x less visual tokens than most MLLMs. The model achieves **leading performance on OCRBench, surpassing proprietary models such as GPT-4o-latest and Gemini 2.5**. It also achieves state-of-the-art performance for PDF document parsing capability on OmniDocBench among general MLLMs. Based on the the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, outperforming GPT-4o-latest on MMHal-Bench, and supports **multilingual capabilities** in more than 30 languages.
36
+
37
+
38
+ - 💫 **Easy Usage.**
39
+ MiniCPM-V 4.5 can be easily used in various ways: (1) [llama.cpp](https://github.com/tc-mb/llama.cpp/blob/Support-MiniCPM-V-4.5/docs/multimodal/minicpmv4.5.md) and [ollama](https://github.com/tc-mb/ollama/tree/MIniCPM-V) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-V-4_5-int4), [GGUF](https://huggingface.co/openbmb/MiniCPM-V-4_5-gguf) and [AWQ](https://github.com/tc-mb/AutoAWQ) format quantized models in 16 sizes, (3) [SGLang](https://github.com/tc-mb/sglang/tree/main) and [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [Transformers](https://github.com/tc-mb/transformers/tree/main) and [LLaMA-Factory](./docs/llamafactory_train_and_infer.md), (5) quick [local WebUI demo](#chat-with-our-demo-on-gradio), (6) optimized [local iOS app](https://github.com/tc-mb/MiniCPM-o-demo-iOS) on iPhone and iPad, and (7) online web demo on [server](http://101.126.42.235:30910/). See our [Cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook) for full usages!
40
+
41
+
42
+ ### Evaluation
43
+
44
+ <div align="center">
45
+ <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/radar_minicpm_v45.png", width=60%>
46
+ </div>
47
+ <div align="center">
48
+ <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv_4_5_evaluation_results.jpg" , width=100%>
49
+ </div>
50
+
51
+ ### Examples
52
+
53
+ <div align="center">
54
+ <a href="https://youtu.be/SCtimvC3Qfk"><img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv4_5/MiniCPM-V%204.5-8.26_img.jpeg", width=70%></a>
55
+ </div>
56
+
57
+ <div style="display: flex; flex-direction: column; align-items: center;">
58
+ <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv4_5/en_case1.png" alt="en_case1" style="margin-bottom: 5px;">
59
+ <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv4_5/en_case2.png" alt="en_case2" style="margin-bottom: 5px;">
60
+ <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv4_5/en_case3.jpeg" alt="en_case3" style="margin-bottom: 5px;">
61
+ </div>
62
+
63
+ We deploy MiniCPM-V 4.5 on iPad M4 with [iOS demo](https://github.com/tc-mb/MiniCPM-o-demo-iOS). The demo video is the raw screen recording without edition.
64
+
65
+ <div align="center">
66
+ <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv4_5/v45_en_handwriting.gif" width="45%" style="display: inline-block; margin: 0 10px;"/>
67
+ <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv4_5/v45_en_cot.gif" width="45%" style="display: inline-block; margin: 0 10px;"/>
68
+ </div>
69
+
70
+ <div align="center">
71
+ <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv4_5/v45_cn_handwriting.gif" width="45%" style="display: inline-block; margin: 0 10px;"/>
72
+ <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv4_5/v45_cn_travel.gif" width="45%" style="display: inline-block; margin: 0 10px;"/>
73
+ </div>
74
+
75
+
76
+ ## Usage
77
+
78
+ ```python
79
+ import torch
80
+ from PIL import Image
81
+ from transformers import AutoModel, AutoTokenizer
82
+
83
+ torch.manual_seed(100)
84
+
85
+ model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
86
+ attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
87
+ model = model.eval().cuda()
88
+ tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6
89
+
90
+ image = Image.open('./assets/minicpmo2_6/show_demo.jpg').convert('RGB')
91
+
92
+ enable_thinking=False # If `enable_thinking=True`, the long-thinking mode is enabled.
93
+
94
+ # First round chat
95
+ question = "What is the landform in the picture?"
96
+ msgs = [{'role': 'user', 'content': [image, question]}]
97
+
98
+ answer = model.chat(
99
+ msgs=msgs,
100
+ tokenizer=tokenizer,
101
+ enable_thinking=enable_thinking
102
+ )
103
+ print(answer)
104
+
105
+ # Second round chat, pass history context of multi-turn conversation
106
+ msgs.append({"role": "assistant", "content": [answer]})
107
+ msgs.append({"role": "user", "content": ["What should I pay attention to when traveling here?"]})
108
+
109
+ answer = model.chat(
110
+ msgs=msgs,
111
+ tokenizer=tokenizer
112
+ )
113
+ print(answer)
114
+ ```
115
+
116
+ You will get the following output:
117
+
118
+ ```shell
119
+ # round1
120
+ The landform in the picture is karst topography. Karst landscapes are characterized by distinctive, jagged limestone hills or mountains with steep, irregular peaks and deep valleys—exactly what you see here These unique formations result from the dissolution of soluble rocks like limestone over millions of years through water erosion.
121
+
122
+ This scene closely resembles the famous karst landscape of Guilin and Yangshuo in China’s Guangxi Province. The area features dramatic, pointed limestone peaks rising dramatically above serene rivers and lush green forests, creating a breathtaking and iconic natural beauty that attracts millions of visitors each year for its picturesque views.
123
+
124
+ # round2
125
+ When traveling to a karst landscape like this, here are some important tips:
126
+
127
+ 1. Wear comfortable shoes: The terrain can be uneven and hilly.
128
+ 2. Bring water and snacks for energy during hikes or boat rides.
129
+ 3. Protect yourself from the sun with sunscreen, hats, and sunglasses—especially since you’ll likely spend time outdoors exploring scenic spots.
130
+ 4. Respect local customs and nature regulations by not littering or disturbing wildlife.
131
+
132
+ By following these guidelines, you'll have a safe and enjoyable trip while appreciating the stunning natural beauty of places such as Guilin’s karst mountains.
133
+ ```
134
+
135
+
136
+ #### Chat with Video
137
+ <summary> Click to view Python code running MiniCPM-V-4_5 by with video input and 3D-Resampler. </summary>
138
+
139
+ ```python
140
+ ## The 3d-resampler compresses multiple frames into 64 tokens by introducing temporal_ids.
141
+ # To achieve this, you need to organize your video data into two corresponding sequences:
142
+ # frames: List[Image]
143
+ # temporal_ids: List[List[Int]].
144
+
145
+ import torch
146
+ from PIL import Image
147
+ from transformers import AutoModel, AutoTokenizer
148
+ from decord import VideoReader, cpu # pip install decord
149
+ from scipy.spatial import cKDTree
150
+ import numpy as np
151
+ import math
152
+
153
+ model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
154
+ attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
155
+ model = model.eval().cuda()
156
+ tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6
157
+
158
+ MAX_NUM_FRAMES=180 # Indicates the maximum number of frames received after the videos are packed. The actual maximum number of valid frames is MAX_NUM_FRAMES * MAX_NUM_PACKING.
159
+ MAX_NUM_PACKING=3 # indicates the maximum packing number of video frames. valid range: 1-6
160
+ TIME_SCALE = 0.1
161
+
162
+ def map_to_nearest_scale(values, scale):
163
+ tree = cKDTree(np.asarray(scale)[:, None])
164
+ _, indices = tree.query(np.asarray(values)[:, None])
165
+ return np.asarray(scale)[indices]
166
+
167
+
168
+ def group_array(arr, size):
169
+ return [arr[i:i+size] for i in range(0, len(arr), size)]
170
+
171
+ def encode_video(video_path, choose_fps=3, force_packing=None):
172
+ def uniform_sample(l, n):
173
+ gap = len(l) / n
174
+ idxs = [int(i * gap + gap / 2) for i in range(n)]
175
+ return [l[i] for i in idxs]
176
+ vr = VideoReader(video_path, ctx=cpu(0))
177
+ fps = vr.get_avg_fps()
178
+ video_duration = len(vr) / fps
179
+
180
+ if choose_fps * int(video_duration) <= MAX_NUM_FRAMES:
181
+ packing_nums = 1
182
+ choose_frames = round(min(choose_fps, round(fps)) * min(MAX_NUM_FRAMES, video_duration))
183
+
184
+ else:
185
+ packing_nums = math.ceil(video_duration * choose_fps / MAX_NUM_FRAMES)
186
+ if packing_nums <= MAX_NUM_PACKING:
187
+ choose_frames = round(video_duration * choose_fps)
188
+ else:
189
+ choose_frames = round(MAX_NUM_FRAMES * MAX_NUM_PACKING)
190
+ packing_nums = MAX_NUM_PACKING
191
+
192
+ frame_idx = [i for i in range(0, len(vr))]
193
+ frame_idx = np.array(uniform_sample(frame_idx, choose_frames))
194
+
195
+ if force_packing:
196
+ packing_nums = min(force_packing, MAX_NUM_PACKING)
197
+
198
+ print(video_path, ' duration:', video_duration)
199
+ print(f'get video frames={len(frame_idx)}, packing_nums={packing_nums}')
200
+
201
+ frames = vr.get_batch(frame_idx).asnumpy()
202
+
203
+ frame_idx_ts = frame_idx / fps
204
+ scale = np.arange(0, video_duration, TIME_SCALE)
205
+
206
+ frame_ts_id = map_to_nearest_scale(frame_idx_ts, scale) / TIME_SCALE
207
+ frame_ts_id = frame_ts_id.astype(np.int32)
208
+
209
+ assert len(frames) == len(frame_ts_id)
210
+
211
+ frames = [Image.fromarray(v.astype('uint8')).convert('RGB') for v in frames]
212
+ frame_ts_id_group = group_array(frame_ts_id, packing_nums)
213
+
214
+ return frames, frame_ts_id_group
215
+
216
+
217
+ video_path="video_test.mp4"
218
+ fps = 5 # fps for video
219
+ force_packing = None # You can set force_packing to ensure that 3D packing is forcibly enabled; otherwise, encode_video will dynamically set the packing quantity based on the duration.
220
+ frames, frame_ts_id_group = encode_video(video_path, fps, force_packing=force_packing)
221
+
222
+ question = "Describe the video"
223
+ msgs = [
224
+ {'role': 'user', 'content': frames + [question]},
225
+ ]
226
+
227
+
228
+ answer = model.chat(
229
+ msgs=msgs,
230
+ tokenizer=tokenizer,
231
+ use_image_id=False,
232
+ max_slice_nums=1,
233
+ temporal_ids=frame_ts_id_group
234
+ )
235
+ print(answer)
236
+ ```
237
+
238
+
239
+ ## License
240
+ #### Model License
241
+ * The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
242
+ * The usage of MiniCPM-V series model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM-o/blob/main/MiniCPM%20Model%20License.md).
243
+ * The models and weights of MiniCPM are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, MiniCPM-V 4.5 weights are also available for free commercial use.
244
+
245
+
246
+ #### Statement
247
+ * As an LMM, MiniCPM-V 4.5 generates contents by learning a large amount of multimodal corpora, but it cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-V 4.5 does not represent the views and positions of the model developers
248
+ * We will not be liable for any problems arising from the use of the MinCPM-V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.
249
+
250
+ ## Key Techniques and Other Multimodal Projects
251
+
252
+ 👏 Welcome to explore key techniques of MiniCPM-V 4.5 and other multimodal projects of our team:
253
+
254
+ [VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD) | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V)
255
+
256
+ ## Citation
257
+
258
+ If you find our work helpful, please consider citing our papers 📝 and liking this project ❤️!
259
+
260
+ ```bib
261
+ @article{yao2024minicpm,
262
+ title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
263
+ author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
264
+ journal={Nat Commun 16, 5509 (2025)},
265
+ year={2025}
266
+ }
267
+ ```