File size: 2,187 Bytes
1790ca8 a7c4a72 1790ca8 8f2fb93 a44c65b 8f2fb93 70cbcf6 8f2fb93 a44c65b 8f2fb93 a44c65b 8f2fb93 a44c65b 8f2fb93 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 |
---
pipeline_tag: text-generation
license: mit
library_name: transformers
base_model:
- MiniMaxAI/MiniMax-M2
---
# Building and Running the Experimental `minimax` Branch of `llama.cpp`
**Note:**
This setup is experimental. The `minimax` branch will not work with the standard `llama.cpp`. Use it only for testing GGUF models with experimental features.
---
## System Requirements (you can use any supported this is for ubuntu build commands)
- Ubuntu 22.04
- NVIDIA GPU with CUDA support
- CUDA Toolkit 12.8 or later
- CMake
---
## Installation Steps
### 1. Install CUDA Toolkit 12.8
```bash
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-8
```
### 2. Set Environment Variables
```bash
export CUDA_HOME=/usr/local/cuda
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64
export PATH=$PATH:$CUDA_HOME/bin
```
### 3. Install Build Tools
```bash
sudo apt install cmake
```
### 4. Clone the Experimental Branch
```bash
git clone --branch minimax --single-branch https://github.com/cturan/llama.cpp.git
cd llama.cpp
```
### 5. Build the Project
```bash
mkdir build
cd build
cmake .. -DLLAMA_CUDA=ON -DLLAMA_CURL=OFF
cmake --build . --config Release --parallel $(nproc --all)
```
---
## Build Output
After the build is complete, the binaries will be located in:
```
llama.cpp/build/bin
```
---
## Running the Model
Example command:
```bash
./llama-server -m minimax-m2-Q4_K.gguf -ngl 999 --cpu-moe --jinja -fa on -c 32000 --reasoning-format auto
```
This configuration offloads the experts to the CPU, so approximately 16 GB of VRAM is sufficient.
---
## Notes
- `--cpu-moe` enables CPU offloading for mixture-of-experts layers.
- `--jinja` activates the Jinja templating engine.
- Adjust `-c` (context length) and `-ngl` (GPU layers) according to your hardware.
- Ensure the model file (`minimax-m2-Q4_K.gguf`) is available in the working directory.
---
All steps complete. The experimental CUDA-enabled build of `llama.cpp` is ready to use. |