File size: 2,187 Bytes
1790ca8
 
 
 
a7c4a72
 
1790ca8
8f2fb93
a44c65b
8f2fb93
 
 
 
 
70cbcf6
8f2fb93
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a44c65b
8f2fb93
 
 
 
 
 
 
a44c65b
8f2fb93
 
 
 
 
 
a44c65b
8f2fb93
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
---
pipeline_tag: text-generation
license: mit
library_name: transformers
base_model:
- MiniMaxAI/MiniMax-M2
---
# Building and Running the Experimental `minimax` Branch of `llama.cpp`

**Note:**  
This setup is experimental. The `minimax` branch will not work with the standard `llama.cpp`. Use it only for testing GGUF models with experimental features.

---

## System Requirements (you can use any supported this is for ubuntu build commands)

- Ubuntu 22.04  
- NVIDIA GPU with CUDA support  
- CUDA Toolkit 12.8 or later  
- CMake  

---

## Installation Steps

### 1. Install CUDA Toolkit 12.8

```bash
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-8
```

### 2. Set Environment Variables

```bash
export CUDA_HOME=/usr/local/cuda
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64
export PATH=$PATH:$CUDA_HOME/bin
```

### 3. Install Build Tools

```bash
sudo apt install cmake
```

### 4. Clone the Experimental Branch

```bash
git clone --branch minimax --single-branch https://github.com/cturan/llama.cpp.git
cd llama.cpp
```

### 5. Build the Project

```bash
mkdir build
cd build
cmake .. -DLLAMA_CUDA=ON -DLLAMA_CURL=OFF
cmake --build . --config Release --parallel $(nproc --all)
```

---

## Build Output

After the build is complete, the binaries will be located in:

```
llama.cpp/build/bin
```

---

## Running the Model

Example command:

```bash
./llama-server -m minimax-m2-Q4_K.gguf -ngl 999 --cpu-moe --jinja -fa on -c 32000 --reasoning-format auto
```

This configuration offloads the experts to the CPU, so approximately 16 GB of VRAM is sufficient.

---

## Notes

- `--cpu-moe` enables CPU offloading for mixture-of-experts layers.  
- `--jinja` activates the Jinja templating engine.  
- Adjust `-c` (context length) and `-ngl` (GPU layers) according to your hardware.  
- Ensure the model file (`minimax-m2-Q4_K.gguf`) is available in the working directory.

---

All steps complete. The experimental CUDA-enabled build of `llama.cpp` is ready to use.