Ling‑Mini‑2.0 — ChatLLM.cpp Quantizations (Q4_0 and Q8_0)
Author and distribution: Riverkan
This repository provides CPU/GPU-friendly quantized builds of Ling‑Mini‑2.0 for ChatLLM.cpp. It is not a LLaMA model, is not affiliated with Meta, and does not use the LLaMA license. Files are distributed in ChatLLM.cpp’s GGMM-based format (.bin), ready for local inference.
- Available quantizations: Q4_0 (int4), Q8_0 (int8)
- Tested runtime: ChatLLM.cpp
- Target use: real-time chat/instruction-following on commodity hardware
Notes:
- The model is architecturally distinct from LLaMA-family models.
ChatLLM.cpp Quantizations of Ling‑Mini‑2.0
Quantized with the ChatLLM.cpp toolchain for GGMM-format inference (.bin). These builds are intended for the ChatLLM.cpp runtime (CPU and optional GPU acceleration as provided by ChatLLM’s GGMM backends). Use ChatLLM.cpp’s convert and run flow described below.
Original (float) model: to be announced by Riverkan.
Run them with ChatLLM.cpp or your preferred ChatLLM-based UI.
Prompt format
Ling‑Mini‑2.0 does not require a special role-tag chat template. Plain prompts work well. If your tooling prefers an explicit chat structure, you can use this neutral format:
[System]
You are Ling‑Mini‑2.0, a helpful, concise assistant.
[User]
{your question}
[Assistant]
Example:
[System]
You are Ling‑Mini‑2.0, a helpful, concise assistant.
[User]
List three tips to speed up CPU inference.
[Assistant]
No special tokens are required by the model itself; most UIs can just send user text.
Download a file (not the whole branch) from below
Filename | Quant type | File Size | Split | Description |
---|---|---|---|---|
Ling‑Mini‑2.0‑Q8_0.bin | Q8_0 | 16 GB | false | Highest quality quant provided here; best for quality, moderate speed. |
Ling‑Mini‑2.0‑Q4_0.bin | Q4_0 | 8.52 GB | false | Great balance of speed and memory; recommended for CPU‑only setups. |
Notes:
- File sizes depend on the base model size; check the release or hosting page for exact sizes.
- These are GGMM (.bin) files for ChatLLM.cpp, not GGUF.
How to use with ChatLLM.cpp
- Clone and build ChatLLM.cpp (follow upstream docs for optional GPU backends):
git clone --recursive https://github.com/foldl/chatllm.cpp.git
cd chatllm.cpp
cmake -B build
cmake --build build -j --config Release
Place the quantized model file (e.g., Ling‑Mini‑2.0‑Q4_0.bin) somewhere accessible.
Run interactive chat:
# Linux / macOS
rlwrap ./build/bin/main -m /path/to/Ling‑Mini‑2.0‑Q4_0.bin -i
# Windows (PowerShell)
.\build\bin\Release\main.exe -m C:\path\to\Ling‑Mini‑2.0‑Q4_0.bin -i
- Single-shot example:
./build/bin/main -m /path/to/Ling‑Mini‑2.0‑Q4_0.bin --prompt "Explain memory-bound vs compute-bound."
Tip: Run ./build/bin/main -h
for all options (context size, threads, GPU offload where applicable, etc.).
Example usage
Prompt:
[System]
You are Ling‑Mini‑2.0, a helpful, concise assistant.
[User]
Give me a 1‑paragraph summary of what quantization does for LLMs.
[Assistant]
Running:
./build/bin/main -m Ling‑Mini‑2.0‑Q8_0.bin -i --prompt "Give me a 1‑paragraph summary of what quantization does for LLMs."
In interactive mode (-i
), simply paste your question and press Enter. The chat history is used as context for subsequent turns.
Performance (CPU)
./build/bin/main -m ling-mini-2.0-q4.bin --seed 1
- Q4_0 on AMD Ryzen 5 5600G with Radeon Graphics (3.90 GHz): ~35 tokens/second (output), measured in a typical chat generation scenario.
./build/bin/main -m ling-mini-2.0-q8.bin --seed 1
- Q8_0 on AMD Ryzen 5 5600G with Radeon Graphics (3.90 GHz): ~20 tokens/second (output), measured in a typical chat generation scenario.
Notes:
- Actual throughput varies with prompt length, context size, threads, OS, and build flags.
Which file should I choose?
- Want the fastest CPU experience and smallest memory footprint? Choose Q4_0.
- Want maximum response quality on CPU (or if you have headroom)? Choose Q8_0.
- If you’re offloading to GPU via ChatLLM.cpp backends, both will work; Q8_0 usually provides slightly better output fidelity at the cost of more memory.
Downloading using huggingface-cli
If hosted on Hugging Face, you can fetch specific files with the CLI:
Install:
pip install -U "huggingface_hub[cli]"
Download a specific file:
huggingface-cli download RiverkanIT/Ling-mini-2.0-Quantized --include "Ling‑Mini‑2.0‑Q4_0.bin" --local-dir ./
Or the Q8_0 build:
huggingface-cli download RiverkanIT/Ling-mini-2.0-Quantized --include "Ling‑Mini‑2.0‑Q8_0.bin" --local-dir ./
Replace the model repo path with the actual hosting path if different.
Building your own quant (optional)
If you have the float/base weights and want to generate your own GGMM quantized file for ChatLLM.cpp:
- Install Python deps for ChatLLM.cpp’s conversion pipeline:
pip install -r requirements.txt
- Convert to Q8_0:
python convert.py -i /path/to/base/model -t q8_0 -o Ling‑Mini‑2.0‑Q8_0.bin --name "Ling-Mini-2.0"
- Convert to Q4_0:
python convert.py -i /path/to/base/model -t q4_0 -o Ling‑Mini‑2.0‑Q4_0.bin --name "Ling-Mini-2.0"
Notes:
- ChatLLM.cpp uses GGMM-based .bin files (not GGUF).
- See ChatLLM.cpp docs for model-specific flags and supported architectures.
Credits
- Model and quantized distributions by Riverkan
- Runtime and tooling: ChatLLM.cpp (thanks to the maintainers and the GGMM community)
- Thanks to the InclusionAI team for their foundational work and support!
- Everyone in the open-source LLM community who provided benchmarks, ideas, and tools
For issues, feature requests, or contributions, please open a discussion or pull request in this repo.
Model tree for RiverkanIT/Ling-mini-2.0-Quantized
Base model
inclusionAI/Ling-mini-base-2.0