File size: 2,187 Bytes

---
pipeline_tag: text-generation
license: mit
library_name: transformers
base_model:
- MiniMaxAI/MiniMax-M2
---
# Building and Running the Experimental `minimax` Branch of `llama.cpp`

**Note:**  
This setup is experimental. The `minimax` branch will not work with the standard `llama.cpp`. Use it only for testing GGUF models with experimental features.

---

## System Requirements (you can use any supported this is for ubuntu build commands)

- Ubuntu 22.04  
- NVIDIA GPU with CUDA support  
- CUDA Toolkit 12.8 or later  
- CMake  

---

## Installation Steps

### 1. Install CUDA Toolkit 12.8

```bash
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-8
```

### 2. Set Environment Variables

```bash
export CUDA_HOME=/usr/local/cuda
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64
export PATH=$PATH:$CUDA_HOME/bin
```

### 3. Install Build Tools

```bash
sudo apt install cmake
```

### 4. Clone the Experimental Branch

```bash
git clone --branch minimax --single-branch https://github.com/cturan/llama.cpp.git
cd llama.cpp
```

### 5. Build the Project

```bash
mkdir build
cd build
cmake .. -DLLAMA_CUDA=ON -DLLAMA_CURL=OFF
cmake --build . --config Release --parallel $(nproc --all)
```

---

## Build Output

After the build is complete, the binaries will be located in:

```
llama.cpp/build/bin
```

---

## Running the Model

Example command:

```bash
./llama-server -m minimax-m2-Q4_K.gguf -ngl 999 --cpu-moe --jinja -fa on -c 32000 --reasoning-format auto
```

This configuration offloads the experts to the CPU, so approximately 16 GB of VRAM is sufficient.

---

## Notes

- `--cpu-moe` enables CPU offloading for mixture-of-experts layers.  
- `--jinja` activates the Jinja templating engine.  
- Adjust `-c` (context length) and `-ngl` (GPU layers) according to your hardware.  
- Ensure the model file (`minimax-m2-Q4_K.gguf`) is available in the working directory.

---

All steps complete. The experimental CUDA-enabled build of `llama.cpp` is ready to use.