BigDong funmaker commited on
Commit
0231958
·
verified ·
1 Parent(s): cd838a2

Update README.md (#4)

Browse files

- Update README.md (322a25f1e9c88c6560e9a52c62d6e5ddc67e9660)


Co-authored-by: funcy <[email protected]>

Files changed (1) hide show
  1. README.md +310 -277
README.md CHANGED
@@ -1,277 +1,310 @@
1
- ---
2
- license: apache-2.0
3
- language:
4
- - zh
5
- - en
6
- pipeline_tag: text-generation
7
- library_name: transformers
8
- ---
9
- <div align="center">
10
- <img src="https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm_logo.png?raw=true" width="500em" ></img>
11
- </div>
12
-
13
- <p align="center">
14
- <a href="https://github.com/OpenBMB/MiniCPM/" target="_blank">GitHub Repo</a> |
15
- <a href="https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf" target="_blank">Technical Report</a>
16
- </p>
17
- <p align="center">
18
- 👋 Join us on <a href="https://discord.gg/3cGQn9b3YM" target="_blank">Discord</a> and <a href="https://github.com/OpenBMB/MiniCPM/blob/main/assets/wechat.jpg" target="_blank">WeChat</a>
19
- </p>
20
-
21
- ## What's New
22
- - [2025.06.06] **MiniCPM4** series are released! This model achieves ultimate efficiency improvements while maintaining optimal performance at the same scale! It can achieve over 5x generation acceleration on typical end-side chips! You can find technical report [here](https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf).🔥🔥🔥
23
-
24
- ## MiniCPM4 Series
25
- MiniCPM4 series are highly efficient large language models (LLMs) designed explicitly for end-side devices, which achieves this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems.
26
- - [MiniCPM4-8B](https://huggingface.co/openbmb/MiniCPM4-8B): The flagship of MiniCPM4, with 8B parameters, trained on 8T tokens. (**<-- you are here**)
27
- - [MiniCPM4-0.5B](https://huggingface.co/openbmb/MiniCPM4-0.5B): The small version of MiniCPM4, with 0.5B parameters, trained on 1T tokens.
28
- - [MiniCPM4-8B-Eagle-FRSpec](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-FRSpec): Eagle head for FRSpec, accelerating speculative inference for MiniCPM4-8B.
29
- - [MiniCPM4-8B-Eagle-FRSpec-QAT-cpmcu](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-FRSpec-QAT-cpmcu): Eagle head trained with QAT for FRSpec, efficiently integrate speculation and quantization to achieve ultra acceleration for MiniCPM4-8B.
30
- - [MiniCPM4-8B-Eagle-vLLM](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-vLLM): Eagle head in vLLM format, accelerating speculative inference for MiniCPM4-8B.
31
- - [MiniCPM4-8B-marlin-Eagle-vLLM](https://huggingface.co/openbmb/MiniCPM4-8B-marlin-Eagle-vLLM): Quantized Eagle head for vLLM format, accelerating speculative inference for MiniCPM4-8B.
32
- - [BitCPM4-0.5B](https://huggingface.co/openbmb/BitCPM4-0.5B): Extreme ternary quantization applied to MiniCPM4-0.5B compresses model parameters into ternary values, achieving a 90% reduction in bit width.
33
- - [BitCPM4-1B](https://huggingface.co/openbmb/BitCPM4-1B): Extreme ternary quantization applied to MiniCPM3-1B compresses model parameters into ternary values, achieving a 90% reduction in bit width.
34
- - [MiniCPM4-Survey](https://huggingface.co/openbmb/MiniCPM4-Survey): Based on MiniCPM4-8B, accepts users' quiries as input and autonomously generate trustworthy, long-form survey papers.
35
- - [MiniCPM4-MCP](https://huggingface.co/openbmb/MiniCPM4-MCP): Based on MiniCPM4-8B, accepts users' queries and available MCP tools as input and autonomously calls relevant MCP tools to satisfy users' requirements.
36
-
37
- ## Introduction
38
- MiniCPM 4 is an extremely efficient edge-side large model that has undergone efficient optimization across four dimensions: model architecture, learning algorithms, training data, and inference systems, achieving ultimate efficiency improvements.
39
-
40
- - 🏗️ **Efficient Model Architecture:**
41
- - InfLLM v2 -- Trainable Sparse Attention Mechanism: Adopts a trainable sparse attention mechanism architecture where each token only needs to compute relevance with less than 5% of tokens in 128K long text processing, significantly reducing computational overhead for long texts
42
-
43
- - 🧠 **Efficient Learning Algorithms:**
44
- - Model Wind Tunnel 2.0 -- Efficient Predictable Scaling: Introduces scaling prediction methods for performance of downstream tasks, enabling more precise model training configuration search
45
- - BitCPM -- Ultimate Ternary Quantization: Compresses model parameter bit-width to 3 values, achieving 90% extreme model bit-width reduction
46
- - Efficient Training Engineering Optimization: Adopts FP8 low-precision computing technology combined with Multi-token Prediction training strategy
47
-
48
- - 📚 **High-Quality Training Data:**
49
- - UltraClean -- High-quality Pre-training Data Filtering and Generation: Builds iterative data cleaning strategies based on efficient data verification, open-sourcing high-quality Chinese and English pre-training dataset [UltraFinweb](https://huggingface.co/datasets/openbmb/Ultra-FineWeb)
50
- - UltraChat v2 -- High-quality Supervised Fine-tuning Data Generation: Constructs large-scale high-quality supervised fine-tuning datasets covering multiple dimensions including knowledge-intensive data, reasoning-intensive data, instruction-following data, long text understanding data, and tool calling data
51
-
52
- - ⚡ **Efficient Inference System:**
53
- - CPM.cu -- Lightweight and Efficient CUDA Inference Framework: Integrates sparse attention, model quantization, and speculative sampling to achieve efficient prefilling and decoding
54
- - ArkInfer -- Cross-platform Deployment System: Supports efficient deployment across multiple backend environments, providing flexible cross-platform adaptation capabilities
55
-
56
- ## Usage
57
-
58
- ### Inference with [CPM.cu](https://github.com/OpenBMB/cpm.cu)
59
-
60
- We recommend using [CPM.cu](https://github.com/OpenBMB/cpm.cu) for the inference of MiniCPM4. CPM.cu is a CUDA inference framework developed by OpenBMB, which integrates efficient sparse, speculative sampling, and quantization techniques, fully leveraging the efficiency advantages of MiniCPM4.
61
-
62
- You can install CPM.cu by running the following command:
63
-
64
- ```bash
65
- git clone https://github.com/OpenBMB/cpm.cu.git --recursive
66
- cd cpm.cu
67
- python3 setup.py install
68
- ```
69
-
70
- MiniCPM4 natively supports context lengths of up to 32,768 tokens. To reproduce the long-text acceleration effect in the paper, we recommend using the LongRoPE factors that have been validated. Change the `rope_scaling` field in the `config.json` file as the following to enable LongRoPE.
71
- ```json
72
- {
73
- ...,
74
- "rope_scaling": {
75
- "rope_type": "longrope",
76
- "long_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
77
- "short_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
78
- "original_max_position_embeddings": 32768
79
- }
80
- }
81
- ```
82
-
83
- After modification, you can run the following command to reproduce the long-context acceleration effect (the script will automatically download the model weights from HuggingFace)
84
- ```bash
85
- python3 tests/test_generate.py
86
- ```
87
-
88
- For more details about CPM.cu, please refer to [the repo CPM.cu](https://github.com/OpenBMB/cpm.cu).
89
-
90
- ### Inference with Transformers
91
- ```python
92
- from transformers import AutoModelForCausalLM, AutoTokenizer
93
- import torch
94
- torch.manual_seed(0)
95
-
96
- path = 'openbmb/MiniCPM4-8B'
97
- device = "cuda"
98
- tokenizer = AutoTokenizer.from_pretrained(path)
99
- model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)
100
-
101
- # User can directly use the chat interface
102
- # responds, history = model.chat(tokenizer, "Write an article about Artificial Intelligence.", temperature=0.7, top_p=0.7)
103
- # print(responds)
104
-
105
- # User can also use the generate interface
106
- messages = [
107
- {"role": "user", "content": "Write an article about Artificial Intelligence."},
108
- ]
109
- model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(device)
110
-
111
- model_outputs = model.generate(
112
- model_inputs,
113
- max_new_tokens=1024,
114
- top_p=0.7,
115
- temperature=0.7
116
- )
117
- output_token_ids = [
118
- model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs))
119
- ]
120
-
121
- responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]
122
- print(responses)
123
- ```
124
-
125
- MiniCPM4-8B supports `InfLLM v2`, a sparse attention mechanism designed for efficient long-sequence inference. It requires the [infllmv2_cuda_impl](https://github.com/OpenBMB/infllmv2_cuda_impl) library.
126
-
127
- You can install it by running the following command:
128
- ```bash
129
- git clone -b feature_infer https://github.com/OpenBMB/infllmv2_cuda_impl.git
130
- cd infllmv2_cuda_impl
131
- git submodule update --init --recursive
132
- pip install -e . # or python setup.py install
133
- ```
134
-
135
- To enable InfLLM v2, you need to add the `sparse_config` field in `config.json`:
136
- ```json
137
- {
138
- ...,
139
- "sparse_config": {
140
- "kernel_size": 32,
141
- "kernel_stride": 16,
142
- "init_blocks": 1,
143
- "block_size": 64,
144
- "window_size": 2048,
145
- "topk": 64,
146
- "use_nope": false,
147
- "dense_len": 8192
148
- }
149
- }
150
- ```
151
-
152
- These parameters control the behavior of InfLLM v2:
153
- * `kernel_size` (default: 32): The size of semantic kernels.
154
- * `kernel_stride` (default: 16): The stride between adjacent kernels.
155
- * `init_blocks` (default: 1): The number of initial blocks that every query token attends to. This ensures attention to the beginning of the sequence.
156
- * `block_size` (default: 64): The block size for key-value blocks.
157
- * `window_size` (default: 2048): The size of the local sliding window.
158
- * `topk` (default: 64): The specifies that each token computes attention with only the top-k most relevant key-value blocks.
159
- * `use_nope` (default: false): Whether to use the NOPE technique in block selection for improved performance.
160
- * `dense_len` (default: 8192): Since Sparse Attention offers limited benefits for short sequences, the model can use standard (dense) attention for shorter texts. The model will use dense attention for sequences with a token length below `dense_len` and switch to sparse attention for sequences exceeding this length. Set this to `-1` to always use sparse attention regardless of sequence length.
161
-
162
- MiniCPM4 natively supports context lengths of up to 32,768 tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques for effective handling of long texts. We have validated the model's performance on context lengths of up to 131,072 tokens by modifying the LongRoPE factor.
163
-
164
- You can apply the LongRoPE factor modification by modifying the model files. Specifically, in the `config.json` file, adjust the `rope_scaling` fields.
165
- ```json
166
- {
167
- ...,
168
- "rope_scaling": {
169
- "rope_type": "longrope",
170
- "long_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
171
- "short_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
172
- "original_max_position_embeddings": 32768
173
- }
174
- }
175
- ```
176
-
177
- ### Inference with [SGLang](https://github.com/sgl-project/sglang)
178
-
179
- For now, you need to install our forked version of SGLang.
180
- ```bash
181
- git clone -b openbmb https://github.com/OpenBMB/sglang.git
182
- cd sglang
183
-
184
- pip install --upgrade pip
185
- pip install -e "python[all]"
186
- ```
187
-
188
- You can start the inference server by running the following command:
189
- ```bash
190
- python -m sglang.launch_server --model openbmb/MiniCPM4-8B --trust-remote-code --port 30000 --chat-template chatml
191
- ```
192
-
193
- Then you can use the chat interface by running the following command:
194
- ```python
195
- import openai
196
-
197
- client = openai.Client(base_url=f"http://localhost:30000/v1", api_key="None")
198
-
199
- response = client.chat.completions.create(
200
- model="openbmb/MiniCPM4-8B",
201
- messages=[
202
- {"role": "user", "content": "Write an article about Artificial Intelligence."},
203
- ],
204
- temperature=0.7,
205
- max_tokens=1024,
206
- )
207
-
208
- print(response.choices[0].message.content)
209
- ```
210
-
211
- ### Inference with [vLLM](https://github.com/vllm-project/vllm)
212
- For now, you need to install the latest version of vLLM.
213
- ```
214
- pip install -U vllm \
215
- --pre \
216
- --extra-index-url https://wheels.vllm.ai/nightly
217
- ```
218
-
219
- Then you can inference MiniCPM4-8B with vLLM:
220
- ```python
221
- from transformers import AutoTokenizer
222
- from vllm import LLM, SamplingParams
223
-
224
- model_name = "openbmb/MiniCPM4-8B"
225
- prompt = [{"role": "user", "content": "Please recommend 5 tourist attractions in Beijing. "}]
226
-
227
- tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
228
- input_text = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
229
-
230
- llm = LLM(
231
- model=model_name,
232
- trust_remote_code=True,
233
- max_num_batched_tokens=32768,
234
- dtype="bfloat16",
235
- gpu_memory_utilization=0.8,
236
- )
237
- sampling_params = SamplingParams(top_p=0.7, temperature=0.7, max_tokens=1024, repetition_penalty=1.02)
238
-
239
- outputs = llm.generate(prompts=input_text, sampling_params=sampling_params)
240
-
241
- print(outputs[0].outputs[0].text)
242
- ```
243
-
244
- ## Evaluation Results
245
- On two typical end-side chips, Jetson AGX Orin and RTX 4090, MiniCPM4 demonstrates significantly faster processing speed compared to similar-size models in long text processing tasks. As text length increases, MiniCPM4's efficiency advantage becomes more pronounced. On the Jetson AGX Orin platform, compared to Qwen3-8B, MiniCPM4 achieves approximately 7x decoding speed improvement.
246
-
247
- ![benchmark](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm4/efficiency.png?raw=true)
248
-
249
- #### Comprehensive Evaluation
250
- MiniCPM4 launches end-side versions with 8B and 0.5B parameter scales, both achieving best-in-class performance in their respective categories.
251
-
252
- ![benchmark](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm4/benchmark.png?raw=true)
253
-
254
- #### Long Text Evaluation
255
- MiniCPM4 is pre-trained on 32K long texts and achieves length extension through YaRN technology. In the 128K long text needle-in-a-haystack task, MiniCPM4 demonstrates outstanding performance.
256
-
257
- ![long-niah](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm4/128k-niah.png?raw=true)
258
-
259
- ## Statement
260
- - As a language model, MiniCPM generates content by learning from a vast amount of text.
261
- - However, it does not possess the ability to comprehend or express personal opinions or value judgments.
262
- - Any content generated by MiniCPM does not represent the viewpoints or positions of the model developers.
263
- - Therefore, when using content generated by MiniCPM, users should take full responsibility for evaluating and verifying it on their own.
264
-
265
- ## LICENSE
266
- - This repository and MiniCPM models are released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
267
-
268
- ## Citation
269
- - Please cite our [paper](https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf) if you find our work valuable.
270
-
271
- ```bibtex
272
- @article{minicpm4,
273
- title={{MiniCPM4}: Ultra-Efficient LLMs on End Devices},
274
- author={MiniCPM Team},
275
- year={2025}
276
- }
277
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - zh
5
+ - en
6
+ pipeline_tag: text-generation
7
+ library_name: transformers
8
+ ---
9
+ <div align="center">
10
+ <img src="https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm_logo.png?raw=true" width="500em" ></img>
11
+ </div>
12
+
13
+ <p align="center">
14
+ <a href="https://github.com/OpenBMB/MiniCPM/" target="_blank">GitHub Repo</a> |
15
+ <a href="https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf" target="_blank">Technical Report</a>
16
+ </p>
17
+ <p align="center">
18
+ 👋 Join us on <a href="https://discord.gg/3cGQn9b3YM" target="_blank">Discord</a> and <a href="https://github.com/OpenBMB/MiniCPM/blob/main/assets/wechat.jpg" target="_blank">WeChat</a>
19
+ </p>
20
+
21
+ ## What's New
22
+ - [2025.06.06] **MiniCPM4** series are released! This model achieves ultimate efficiency improvements while maintaining optimal performance at the same scale! It can achieve over 5x generation acceleration on typical end-side chips! You can find technical report [here](https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf).🔥🔥🔥
23
+
24
+ ## MiniCPM4 Series
25
+ MiniCPM4 series are highly efficient large language models (LLMs) designed explicitly for end-side devices, which achieves this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems.
26
+ - [MiniCPM4-8B](https://huggingface.co/openbmb/MiniCPM4-8B): The flagship of MiniCPM4, with 8B parameters, trained on 8T tokens. (**<-- you are here**)
27
+ - [MiniCPM4-0.5B](https://huggingface.co/openbmb/MiniCPM4-0.5B): The small version of MiniCPM4, with 0.5B parameters, trained on 1T tokens.
28
+ - [MiniCPM4-8B-Eagle-FRSpec](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-FRSpec): Eagle head for FRSpec, accelerating speculative inference for MiniCPM4-8B.
29
+ - [MiniCPM4-8B-Eagle-FRSpec-QAT-cpmcu](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-FRSpec-QAT-cpmcu): Eagle head trained with QAT for FRSpec, efficiently integrate speculation and quantization to achieve ultra acceleration for MiniCPM4-8B.
30
+ - [MiniCPM4-8B-Eagle-vLLM](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-vLLM): Eagle head in vLLM format, accelerating speculative inference for MiniCPM4-8B.
31
+ - [MiniCPM4-8B-marlin-Eagle-vLLM](https://huggingface.co/openbmb/MiniCPM4-8B-marlin-Eagle-vLLM): Quantized Eagle head for vLLM format, accelerating speculative inference for MiniCPM4-8B.
32
+ - [BitCPM4-0.5B](https://huggingface.co/openbmb/BitCPM4-0.5B): Extreme ternary quantization applied to MiniCPM4-0.5B compresses model parameters into ternary values, achieving a 90% reduction in bit width.
33
+ - [BitCPM4-1B](https://huggingface.co/openbmb/BitCPM4-1B): Extreme ternary quantization applied to MiniCPM3-1B compresses model parameters into ternary values, achieving a 90% reduction in bit width.
34
+ - [MiniCPM4-Survey](https://huggingface.co/openbmb/MiniCPM4-Survey): Based on MiniCPM4-8B, accepts users' quiries as input and autonomously generate trustworthy, long-form survey papers.
35
+ - [MiniCPM4-MCP](https://huggingface.co/openbmb/MiniCPM4-MCP): Based on MiniCPM4-8B, accepts users' queries and available MCP tools as input and autonomously calls relevant MCP tools to satisfy users' requirements.
36
+
37
+ ## Introduction
38
+ MiniCPM 4 is an extremely efficient edge-side large model that has undergone efficient optimization across four dimensions: model architecture, learning algorithms, training data, and inference systems, achieving ultimate efficiency improvements.
39
+
40
+ - 🏗️ **Efficient Model Architecture:**
41
+ - InfLLM v2 -- Trainable Sparse Attention Mechanism: Adopts a trainable sparse attention mechanism architecture where each token only needs to compute relevance with less than 5% of tokens in 128K long text processing, significantly reducing computational overhead for long texts
42
+
43
+ - 🧠 **Efficient Learning Algorithms:**
44
+ - Model Wind Tunnel 2.0 -- Efficient Predictable Scaling: Introduces scaling prediction methods for performance of downstream tasks, enabling more precise model training configuration search
45
+ - BitCPM -- Ultimate Ternary Quantization: Compresses model parameter bit-width to 3 values, achieving 90% extreme model bit-width reduction
46
+ - Efficient Training Engineering Optimization: Adopts FP8 low-precision computing technology combined with Multi-token Prediction training strategy
47
+
48
+ - 📚 **High-Quality Training Data:**
49
+ - UltraClean -- High-quality Pre-training Data Filtering and Generation: Builds iterative data cleaning strategies based on efficient data verification, open-sourcing high-quality Chinese and English pre-training dataset [UltraFinweb](https://huggingface.co/datasets/openbmb/Ultra-FineWeb)
50
+ - UltraChat v2 -- High-quality Supervised Fine-tuning Data Generation: Constructs large-scale high-quality supervised fine-tuning datasets covering multiple dimensions including knowledge-intensive data, reasoning-intensive data, instruction-following data, long text understanding data, and tool calling data
51
+
52
+ - ⚡ **Efficient Inference System:**
53
+ - CPM.cu -- Lightweight and Efficient CUDA Inference Framework: Integrates sparse attention, model quantization, and speculative sampling to achieve efficient prefilling and decoding
54
+ - ArkInfer -- Cross-platform Deployment System: Supports efficient deployment across multiple backend environments, providing flexible cross-platform adaptation capabilities
55
+
56
+ ## Usage
57
+
58
+ ### Inference with [CPM.cu](https://github.com/OpenBMB/cpm.cu)
59
+
60
+ We recommend using [CPM.cu](https://github.com/OpenBMB/cpm.cu) for the inference of MiniCPM4. CPM.cu is a CUDA inference framework developed by OpenBMB, which integrates efficient sparse, speculative sampling, and quantization techniques, fully leveraging the efficiency advantages of MiniCPM4.
61
+
62
+ You can install CPM.cu by running the following command:
63
+
64
+ ```bash
65
+ git clone https://github.com/OpenBMB/cpm.cu.git --recursive
66
+ cd cpm.cu
67
+ python3 setup.py install
68
+ ```
69
+
70
+ MiniCPM4 natively supports context lengths of up to 32,768 tokens. To reproduce the long-text acceleration effect in the paper, we recommend using the LongRoPE factors that have been validated. Change the `rope_scaling` field in the `config.json` file as the following to enable LongRoPE.
71
+ ```json
72
+ {
73
+ ...,
74
+ "rope_scaling": {
75
+ "rope_type": "longrope",
76
+ "long_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
77
+ "short_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
78
+ "original_max_position_embeddings": 32768
79
+ }
80
+ }
81
+ ```
82
+
83
+ After modification, you can run the following command to reproduce the long-context acceleration effect (the script will automatically download the model weights from HuggingFace)
84
+ ```bash
85
+ python3 tests/test_generate.py
86
+ ```
87
+
88
+ For more details about CPM.cu, please refer to [the repo CPM.cu](https://github.com/OpenBMB/cpm.cu).
89
+
90
+ ### Inference with Transformers
91
+ ```python
92
+ from transformers import AutoModelForCausalLM, AutoTokenizer
93
+ import torch
94
+ torch.manual_seed(0)
95
+
96
+ path = 'openbmb/MiniCPM4-8B'
97
+ device = "cuda"
98
+ tokenizer = AutoTokenizer.from_pretrained(path)
99
+ model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)
100
+
101
+ # User can directly use the chat interface
102
+ # responds, history = model.chat(tokenizer, "Write an article about Artificial Intelligence.", temperature=0.7, top_p=0.7)
103
+ # print(responds)
104
+
105
+ # User can also use the generate interface
106
+ messages = [
107
+ {"role": "user", "content": "Write an article about Artificial Intelligence."},
108
+ ]
109
+ prompt_text = tokenizer.apply_chat_template(
110
+ messages,
111
+ tokenize=False,
112
+ add_generation_prompt=True,
113
+ )
114
+ model_inputs = tokenizer([prompt_text], return_tensors="pt").to(device)
115
+
116
+ model_outputs = model.generate(
117
+ **model_inputs,
118
+ max_new_tokens=1024,
119
+ top_p=0.7,
120
+ temperature=0.7
121
+ )
122
+ output_token_ids = [
123
+ model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs['input_ids']))
124
+ ]
125
+
126
+ responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]
127
+ print(responses)
128
+ ```
129
+
130
+ MiniCPM4-8B supports `InfLLM v2`, a sparse attention mechanism designed for efficient long-sequence inference. It requires the [infllmv2_cuda_impl](https://github.com/OpenBMB/infllmv2_cuda_impl) library.
131
+
132
+ You can install it by running the following command:
133
+ ```bash
134
+ git clone -b feature_infer https://github.com/OpenBMB/infllmv2_cuda_impl.git
135
+ cd infllmv2_cuda_impl
136
+ git submodule update --init --recursive
137
+ pip install -e . # or python setup.py install
138
+ ```
139
+
140
+ To enable InfLLM v2, you need to add the `sparse_config` field in `config.json`:
141
+ ```json
142
+ {
143
+ ...,
144
+ "sparse_config": {
145
+ "kernel_size": 32,
146
+ "kernel_stride": 16,
147
+ "init_blocks": 1,
148
+ "block_size": 64,
149
+ "window_size": 2048,
150
+ "topk": 64,
151
+ "use_nope": false,
152
+ "dense_len": 8192
153
+ }
154
+ }
155
+ ```
156
+
157
+ These parameters control the behavior of InfLLM v2:
158
+ * `kernel_size` (default: 32): The size of semantic kernels.
159
+ * `kernel_stride` (default: 16): The stride between adjacent kernels.
160
+ * `init_blocks` (default: 1): The number of initial blocks that every query token attends to. This ensures attention to the beginning of the sequence.
161
+ * `block_size` (default: 64): The block size for key-value blocks.
162
+ * `window_size` (default: 2048): The size of the local sliding window.
163
+ * `topk` (default: 64): The specifies that each token computes attention with only the top-k most relevant key-value blocks.
164
+ * `use_nope` (default: false): Whether to use the NOPE technique in block selection for improved performance.
165
+ * `dense_len` (default: 8192): Since Sparse Attention offers limited benefits for short sequences, the model can use standard (dense) attention for shorter texts. The model will use dense attention for sequences with a token length below `dense_len` and switch to sparse attention for sequences exceeding this length. Set this to `-1` to always use sparse attention regardless of sequence length.
166
+
167
+ MiniCPM4 natively supports context lengths of up to 32,768 tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques for effective handling of long texts. We have validated the model's performance on context lengths of up to 131,072 tokens by modifying the LongRoPE factor.
168
+
169
+ You can apply the LongRoPE factor modification by modifying the model files. Specifically, in the `config.json` file, adjust the `rope_scaling` fields.
170
+ ```json
171
+ {
172
+ ...,
173
+ "rope_scaling": {
174
+ "rope_type": "longrope",
175
+ "long_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
176
+ "short_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
177
+ "original_max_position_embeddings": 32768
178
+ }
179
+ }
180
+ ```
181
+
182
+ ### Inference with [SGLang](https://github.com/sgl-project/sglang)
183
+
184
+ For now, you need to install our forked version of SGLang.
185
+ ```bash
186
+ git clone -b openbmb https://github.com/OpenBMB/sglang.git
187
+ cd sglang
188
+
189
+ pip install --upgrade pip
190
+ pip install -e "python[all]"
191
+ ```
192
+
193
+ You can start the inference server by running the following command:
194
+ ```bash
195
+ python -m sglang.launch_server --model openbmb/MiniCPM4-8B --trust-remote-code --port 30000 --chat-template chatml
196
+ ```
197
+
198
+ Then you can use the chat interface by running the following command:
199
+ ```python
200
+ import openai
201
+
202
+ client = openai.Client(base_url=f"http://localhost:30000/v1", api_key="None")
203
+
204
+ response = client.chat.completions.create(
205
+ model="openbmb/MiniCPM4-8B",
206
+ messages=[
207
+ {"role": "user", "content": "Write an article about Artificial Intelligence."},
208
+ ],
209
+ temperature=0.7,
210
+ max_tokens=1024,
211
+ )
212
+
213
+ print(response.choices[0].message.content)
214
+ ```
215
+
216
+ ### Inference with [vLLM](https://github.com/vllm-project/vllm)
217
+ For now, you need to install the latest version of vLLM.
218
+ ```
219
+ pip install -U vllm \
220
+ --pre \
221
+ --extra-index-url https://wheels.vllm.ai/nightly
222
+ ```
223
+
224
+ Then you can inference MiniCPM4-8B with vLLM:
225
+ ```python
226
+ from transformers import AutoTokenizer
227
+ from vllm import LLM, SamplingParams
228
+
229
+ model_name = "openbmb/MiniCPM4-8B"
230
+ prompt = [{"role": "user", "content": "Please recommend 5 tourist attractions in Beijing. "}]
231
+
232
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
233
+ input_text = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
234
+
235
+ llm = LLM(
236
+ model=model_name,
237
+ trust_remote_code=True,
238
+ max_num_batched_tokens=32768,
239
+ dtype="bfloat16",
240
+ gpu_memory_utilization=0.8,
241
+ )
242
+ sampling_params = SamplingParams(top_p=0.7, temperature=0.7, max_tokens=1024, repetition_penalty=1.02)
243
+
244
+ outputs = llm.generate(prompts=input_text, sampling_params=sampling_params)
245
+
246
+ print(outputs[0].outputs[0].text)
247
+ ```
248
+
249
+ Also, you can start the inference server by running the following command:
250
+ > **Note**: In vLLM's chat API, `add_special_tokens` is `False` by default. This means important special tokens—such as the beginning-of-sequence (BOS) token—will not be added automatically. To ensure the input prompt is correctly formatted for the model, you should explicitly set `extra_body={"add_special_tokens": True}`.
251
+
252
+ ```bash
253
+ vllm serve openbmb/MiniCPM4-8B
254
+ ```
255
+
256
+ Then you can use the chat interface by running the following code:
257
+
258
+ ```python
259
+ import openai
260
+
261
+ client = openai.Client(base_url="http://localhost:8000/v1", api_key="EMPTY")
262
+
263
+ response = client.chat.completions.create(
264
+ model="openbmb/MiniCPM4-8B",
265
+ messages=[
266
+ {"role": "user", "content": "Write an article about Artificial Intelligence."},
267
+ ],
268
+ temperature=0.7,
269
+ max_tokens=1024,
270
+ extra_body=dict(add_special_tokens=True), # Ensures special tokens are added for chat template
271
+
272
+ )
273
+
274
+ print(response.choices[0].message.content)
275
+ ```
276
+
277
+ ## Evaluation Results
278
+ On two typical end-side chips, Jetson AGX Orin and RTX 4090, MiniCPM4 demonstrates significantly faster processing speed compared to similar-size models in long text processing tasks. As text length increases, MiniCPM4's efficiency advantage becomes more pronounced. On the Jetson AGX Orin platform, compared to Qwen3-8B, MiniCPM4 achieves approximately 7x decoding speed improvement.
279
+
280
+ ![benchmark](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm4/efficiency.png?raw=true)
281
+
282
+ #### Comprehensive Evaluation
283
+ MiniCPM4 launches end-side versions with 8B and 0.5B parameter scales, both achieving best-in-class performance in their respective categories.
284
+
285
+ ![benchmark](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm4/benchmark.png?raw=true)
286
+
287
+ #### Long Text Evaluation
288
+ MiniCPM4 is pre-trained on 32K long texts and achieves length extension through YaRN technology. In the 128K long text needle-in-a-haystack task, MiniCPM4 demonstrates outstanding performance.
289
+
290
+ ![long-niah](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm4/128k-niah.png?raw=true)
291
+
292
+ ## Statement
293
+ - As a language model, MiniCPM generates content by learning from a vast amount of text.
294
+ - However, it does not possess the ability to comprehend or express personal opinions or value judgments.
295
+ - Any content generated by MiniCPM does not represent the viewpoints or positions of the model developers.
296
+ - Therefore, when using content generated by MiniCPM, users should take full responsibility for evaluating and verifying it on their own.
297
+
298
+ ## LICENSE
299
+ - This repository and MiniCPM models are released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
300
+
301
+ ## Citation
302
+ - Please cite our [paper](https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf) if you find our work valuable.
303
+
304
+ ```bibtex
305
+ @article{minicpm4,
306
+ title={{MiniCPM4}: Ultra-Efficient LLMs on End Devices},
307
+ author={MiniCPM Team},
308
+ year={2025}
309
+ }
310
+ ```