Thanks for the mainline llama.cpp PR effort!
I'm excitedly following your mainline llama.cpp PR https://github.com/ggml-org/llama.cpp/pull/14654
Curious if @anikifoss or anyone has success with it! I'm just getting access to a big RAM rig again and might give it a try.
Cheers!
This model is huge! I have to quantize on the HDD, which is painfully slow. Also, not enough RAM to do full perplexity tests with Q8_0.
@ubergarm if you manage to get your hands on a big rig, could you share your perplexity results for Q8_0 along with the command line (if you test with Q8_0)? That should give me a baseline to compare perplexity of Q2/DQ2/Q4/DQ4 quants.
Yeah this beast is difficult to maneuver! So sounds like folks are able to get bf16 GGUFs and start quantizing which is great. For ik's fork it might be tricky given the different MLA handling, I'll have to dig into it deeper.
With some luck I can get that Q8_0 to imatrix and then also do my usual perplexity treatment running on a dual socket system with 768GB RAM in each NUMA node compiled CPU-only. The command will likely be something like this:
$ echo 0 | sudo tee /proc/sys/kernel/numa_balancing
$ numactl --interleave=all \
./build/bin/llama-perplexity \
--model /mnt/raid/Kimi-K2-Instruct-Q8_0.gguf \
-f wiki.test.raw \
-ctk fp16 \
-fa \
-mla 3 -fmoe \ # <--- only on ik's fork, omit for mainline
--ctx-size 512 \
--ubatch-size 512 \
--seed 1337 \
--numa distribute \
--threads 384
Downloading the fp8 safetensors now!
For anyone following along at home, I'm going to try to use it the mainline llama.cpp method which so far so good to cast the fp8 safetensors to bf16 safetensors:
# get script
wget https://raw.githubusercontent.com/deepseek-ai/DeepSeek-V3/refs/heads/main/inference/fp8_cast_bf16.py
# edit for CPU usage `cuda` -> `cpu`
# loaded_files[file_name] = load_file(file_path, device="cpu")
# current_state_dict = load_file(safetensor_file, device="cpu")
# install OS deps
apt-get install zlib1g-dev # cmake build-essential and all that jazz too of course
# install python deps
# https://docs.astral.sh/uv/getting-started/installation/
uv venv ./venv --python 3.12 --python-preference=only-managed
source ./venv/bin/activate
uv pip install ninja cmake wheel setuptools pybind11
uv pip install tqdm torch safetensors numpy
uv pip uninstall triton
# we're going to be using triton-CPU instead as no GPU >=sm89 required
# runs in all RAM seems to use highwatermark around 100GB for Kimi-K2-Instruct
# install triton-cpu
$ git clone https://github.com/triton-lang/triton-cpu --recursive
$ cd triton-cpu
# build it
MAX_JOBS=32 uv pip install -e python --no-build-isolation
# now cast it
$ python fp8_cast_bf16.py --help
usage: fp8_cast_bf16.py [-h] --input-fp8-hf-path INPUT_FP8_HF_PATH --output-bf16-hf-path OUTPUT_BF16_HF_PATH
options:
-h, --help show this help message and exit
--input-fp8-hf-path INPUT_FP8_HF_PATH
--output-bf16-hf-path OUTPUT_BF16_HF_PATH
$ python fp8_cast_bf16.py \
--input-fp8-hf-path /mnt/raid/models/moonshotai/Kimi-K2-Instruct/ \
--output-bf16-hf-path /mnt/raid/models/ubergarm/Kimi-K2-Instruct-bf16-safetensors/ 2>&1 | tee -a logs/fp8_cast_bf16-Kimi-K2-Instruct.log
49%|βββββ | 30/61 [14:42<10:19, 19.97s/it]
.
.
.
95%|ββββββββββ| 58/61 [1:12:17<09:02, 180.72s/it]
# slowing down as it gets closer to the end...
# it finished!
100%|ββββββββββ| 61/61 [1:21:53<00:00, 80.55s/it]
# now we have ~ 1TB of fp8 safetensors plus
$ $ du -h /mnt/raid/models/ubergarm/Kimi-K2-Instruct-bf16-safetensors/
1.9T /mnt/raid/models/ubergarm/Kimi-K2-Instruct-bf16-safetensors/
# next step is make another ~2T bf16 GGUF
# thank u kioxia and wendell for all this fast disk space
Okay let's continue...
convert
$ cd llama.cpp
$ git remote add gabriellarson [email protected]:gabriellarson/llama.cpp.git
$ git fetch gabriellarson
$ git checkout kimi-k2
$ git rev-parse --short HEAD
273ea092b
# compile CPU only
cmake -B build -DGGML_CUDA=OFF
cmake --build build --config Release -j $(nproc)
# more dependencies
uv pip install transformers protobuf sentencepiece tiktoken blobfile
# copy over additional files
# !!!*careful* don't accidentally overwrite the output model index file `model.safetensors.index.json`!!!
cp /mnt/raid/models/moonshotai/Kimi-K2-Instruct/config.json /mnt/raid/models/ubergarm/Kimi-K2-Instruct-bf16-safetensors/
cp /mnt/raid/models/moonshotai/Kimi-K2-Instruct/generation_config.json /mnt/raid/models/ubergarm/Kimi-K2-Instruct-bf16-safetensors/
cp /mnt/raid/models/moonshotai/Kimi-K2-Instruct/tokenizer_config.json /mnt/raid/models/ubergarm/Kimi-K2-Instruct-bf16-safetensors/
cp /mnt/raid/models/moonshotai/Kimi-K2-Instruct/*.py /mnt/raid/models/ubergarm/Kimi-K2-Instruct-bf16-safetensors/
cp /mnt/raid/models/moonshotai/Kimi-K2-Instruct/*.py /mnt/raid/models/ubergarm/Kimi-K2-Instruct-bf16-safetensors/
cp /mnt/raid/models/moonshotai/Kimi-K2-Instruct/*.model /mnt/raid/models/ubergarm/Kimi-K2-Instruct-bf16-safetensors/
# convert
python \
convert_hf_to_gguf.py \
--outtype bf16 \
--split-max-size 50G \
--outfile /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/ \
/mnt/raid/models/ubergarm/Kimi-K2-Instruct-bf16-safetensors/
...
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:/mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/Kimi-K2-384x14B-Instruct-safetensors-BF16-00001-of-00045.gguf: n_tensors = 34, t
otal_size = 48.7G
...
INFO:gguf.gguf_writer:/mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/Kimi-K2-384x14B-Instruct-safetensors-BF16-00044-of-00045.gguf: n_tensors = 19, total_size = 45.4G
INFO:gguf.gguf_writer:/mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/Kimi-K2-384x14B-Instruct-safetensors-BF16-00045-of-00045.gguf: n_tensors = 33, total_size = 45.7G
Shard (1/45): 30%|βββ | 14.6G/48.7G [01:20<02:38, 216Mbyte/s]
Writing: 1%| | 14.6G/2.05T [01:20<2:37:28, 216Mbyte/s]
...
Writing: 100%|ββββββββββ| 2.05T/2.05T [2:34:32<00:00, 221Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/
$ du -hc /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/
1.9T /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/
quantize
make a q8_0 for generating imatrix running on 1TB RAM
# [0,60] Layers
# First Layer has dense ffn_(gate|up|down) vs DeepSeek's three dense layers
# Remaining layers have 384x exps and 1x shexp vs deepseek's 256 routed exps
numactl -N 1 -m 1 \
./build/bin/llama-quantize \
--pure \
/mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/Kimi-K2-384x14B-Instruct-safetensors-BF16-00001-of-00045.gguf \
/mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/Kimi-K2-Instruct-Q8_0.gguf \
Q8_0 \
192
...
llama_model_quantize_impl: model size = 1958035.30 MB
llama_model_quantize_impl: quant size = 1040503.41 MB
main: quantize time = 1458161.43 ms
main: total time = 1458161.43 ms
# subsequent runs can use an imatrix generated using command i list up in previous comment
# --imatrix /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-mainline-Kimi-K2-Instruct-BF16.dat
# subsequent runs can use an imatrix generated using command i list up in previous comment
# --imatrix /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-TODO-bf16.dat
vibe check
Always gotta make sure it seems legit in a few rounds of multi-turn instruct chat.
# the Q8_0 is ~1017GiB so requires RAM from two NUMA nodes (BIOS in NPS1, one node per socket each with ~768GB RAM)
$ echo 0 | sudo tee /proc/sys/kernel/numa_balancing
$ sudo sync; echo 3 | sudo tee /proc/sys/vm/drop_caches
$ export model=/mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/Kimi-K2-Instruct-Q8_0.gguf
$ numactl --interleave=all \
./build/bin/llama-server \
--model "$model"\
--alias ubergarm/Kimi-K2-Instruct \
--ctx-size 32768 \
-ctk q8_0 \
-fa \
--parallel 1 \
--threads 192 \
--threads-batch 384 \
--numa distribute \
--host 127.0.0.1 \
--port 8080
prompt eval time = 2838.15 ms / 45 tokens ( 63.07 ms per token, 15.86 tokens per second)
eval time = 7819.60 ms / 44 tokens ( 177.72 ms per token, 5.63 tokens per second)
total time = 10657.75 ms / 89 tokens
prompt eval time = 3378.50 ms / 29 tokens ( 116.50 ms per token, 8.58 tokens per second)
eval time = 107740.53 ms / 584 tokens ( 184.49 ms per token, 5.42 tokens per second)
total time = 111119.04 ms / 613 tokens
First test is looking good!
generate imatrix
model=/mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/Kimi-K2-Instruct-Q8_0.gguf
numactl --interleave=all \
./build/bin/llama-imatrix \
-m "$model" \
-f ubergarm-imatrix-calibration-corpus-v02.txt \
-o /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-mainline-Kimi-K2-Instruct-Q8_0.dat \
--verbosity 1 \
--ctx-size 512 \
--numa distribute \
--threads 384
compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 535.444 ms
compute_imatrix: computing over 827 chunks with batch_size 512
compute_imatrix: 23.00 seconds per pass - ETA 5 hours 17.07 minutes
[1]74.5010,[2]13.7811,[3]6.6587,[4]4.1525,[5]3.2189,[6]2.6832,[7]2.3490,[8]2.1323,[9]2.0879,
save_imatrix: entry ' blk.59.ffn_down_exps.weight' has partial data (99.74%) - skipping
save_imatrix: entry ' blk.59.ffn_up_exps.weight' has partial data (99.74%) - skipping
.
# *OOF* note it skipped a lot might need to look into
# https://github.com/ggml-org/llama.cpp/pull/9400
.
save_imatrix: entry ' blk.39.ffn_gate_exps.weight' has partial data (99.22%) - skipping
save_imatrix: entry ' blk.11.ffn_up_exps.weight' has partial data (99.48%) - skipping
save_imatrix: entry ' blk.41.ffn_down_exps.weight' has partial data (99.74%) - skipping
save_imatrix: storing only 630 out of 789 entries
save_imatrix: stored collected data after 10 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-mainline-Kimi-K2-Instruct-Q8_0.dat
[10]2.0340,[11]2.0793,[12]2.2799,[13]2.3202,[14]2.3551,[15]2.2266,
# I cancelled this as I'm not happy with skipping so many routed exps
# see next try for another approach
imatrix take 2
# https://github.com/ggml-org/llama.cpp/pull/9400
$ git fetch upstream
$ git checkout compilade/imatrix-batched-chunks
$ git checkout kimi-k2
$ git checkout -b testing
$ git rebase compilade/imatrix-batched-chunks
Successfully rebased and updated refs/heads/testing.
# recompile CPU only
$ ./build/bin/llama-imatrix --version
version: 5916 (942c55cd5)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
model=/mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/Kimi-K2-Instruct-Q8_0.gguf
numactl --interleave=all \
./build/bin/llama-imatrix \
-m "$model" \
-f ubergarm-imatrix-calibration-corpus-v02.txt \
-o /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-mainline-Kimi-K2-Instruct-Q8_0.gguf \
--verbosity 1 \
--ctx-size 512 \
--numa distribute \
--threads 384
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
system_info: n_threads = 384 (n_threads_batch = 384) / 768 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI
2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 532.134 ms
compute_imatrix: computing over 827 chunks, n_ctx=512, batch_size=2048, n_seq=4
compute_imatrix: 91.40 seconds per pass - ETA 5 hours 14.95 minutes
[1]74.5010,[2]13.7811,[3]6.6587,[4]4.1525,[5]3.2189,[6]2.6832,[7]2.3490,[8]2.1323,
save_imatrix: entry ' blk.59.ffn_down_exps.weight' has partial data (99.74%)
save_imatrix: entry ' blk.59.ffn_up_exps.weight' has partial data (99.74%)
save_imatrix: entry ' blk.58.ffn_down_exps.weight' has partial data (99.48%)
.
.
.
save_imatrix: entry ' blk.11.ffn_up_exps.weight' has partial data (99.48%)
save_imatrix: entry ' blk.41.ffn_down_exps.weight' has partial data (99.74%)
save_imatrix: stored collected data after 10 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-mainline-942c55cd5-Kimi-K2-Instruct-Q8_0.gguf
[9]2.0879,[10]2.0340,[11]2.0793,[12]2.2799,
...
[817]3.2695,[818]3.2709,[819]3.2721,[820]3.2731,[821]3.2742,[822]3.2773,[823]3.2794,[824]3.2803,[825]3.2820,[826]3.2833,[827]3.2850,
Final estimate: PPL = 3.2850 +/- 0.01492
save_imatrix: stored collected data after 827 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-mainline-942c55cd5-Kimi-K2-Instruct-Q8_0.gguf
llama_perf_context_print: load time = 92197.30 ms
llama_perf_context_print: prompt eval time = 12318983.59 ms / 423424 tokens ( 29.09 ms per token, 34.37 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 12348393.31 ms / 423425 tokens
w@w-VOLCANO:~/projects/llama.cpp$ du -h /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-mainline-942c55cd5-Kimi-K2-Instruct-Q8_0.gguf
1.5G /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-mainline-942c55cd5-Kimi-K2-Instruct-Q8_0.gguf
Cool so I have an experimental mainline imatrix "gguf" using two unmerged PRs haha... If anyone wants holler at me and I'll upload it to huggingface here: https://huggingface.co/ubergarm/Kimi-K2-Instruct-GGUF
Hi, im currently testing your PR!!
I am currently running the Q4_K_M on a dual Epyc 9654 system, prefill took ~15mins and full run on wikitext-2-raw/wiki.test.raw
will take 8hrs...
@ubergarm I saw you mentioned somewhere that ik_llama needs
mla
code patching to work with kimi-k2. Do you have something in progress? I can take a stab at tweaking some constants to make it work for kimi-k2, but I'm not fluent in cpp :)
So I'm pretty fuzzy on the details, but I know that models I've converted via the mainline fp8 cast then convert method print a warning in ik_llama.cpp missing wkv_b tensor
which restricts us to the slower -mla 1
implementation. However using the evshiron + triton-cpu direct fp8 safetensors to bf16 GGUFs method documented here seems to somehow preserve those maybe??
But I haven't tried with ik_llama.cpp yet, it might "just work" on these bf16 GGUFs already to quantize but haven't had a chance to try yet. I'd say try that first and see if there are any warnings printed or not. If it prints warnings, then it might be a task of:
- Merge the changes of gabriellarson's python patches into the evshiron fork.
- Merge the changes of gabriellarson's cpp code into ik_llama.cpp (for the unicode stuff).
- Use the patched evshiron+gabriellarson convert to do the one step fp8 safetensors -> bf16 gguf's
- Hope all the tensors are the og fairydreaming way so ik_llama.cpp can use
-mla 2
and-mla 3
optimizations.
@ubergarm Thanks for more details! I'm porting @gabriellarson 's patch to ik_llama because I can't wait any longer (operation CWAL).
Duplicating the patch to ik_llama seems to do the trick: it complains, but works, and the speed seems OK. I'll post a PR and keep digging.
So I'm pretty fuzzy on the details, but I know that models I've converted via the mainline fp8 cast then convert method print a warning in ik_llama.cpp
missing wkv_b tensor
which restricts us to the slower-mla 1
implementation. However using the evshiron + triton-cpu direct fp8 safetensors to bf16 GGUFs method documented here seems to somehow preserve those maybe??
So the reason is the evshiron fork is from when ik and mainline shared an MLA implementation, and the convert code has stayed the same in ik even though MLA has developed, but for mainline it has changed from what I know.
This comment has links to a few different models with differing MLA implementations from different processes and goes over some of the differences (even if the compatibility part is outdated).
You could in theory use the 2 TB of safetensors with ik convert script and it should work from what I know, alternatively you could just add the python changes from the kimi PR (relatively minor) to the evshiron fork.
Either way I may make a request for a custom quant from you since unlike Deepseek models, I don't know if I want to go through the process myself given it is larger.
Something seems off about BF16 00001. llama-quantize doesn't seem to identify it as a sequence, processes 00001 then terminates w/success.
I believe you were asking somewhere else about refusals and such, there are some people trying it out on a Discord who might know more about that if you are interested. https://huggingface.co/BeaverAI in the #671b-xxl channel. The folks on that discord might be interested in mikupad too, though most seem to use silly tavern.
Something seems off about BF16 00001. llama-quantize doesn't seem to identify it as a sequence, processes 00001 then terminates w/success.
The BF16 safetensors or the BF16 GGUF? I'm assuming you mean the next step after convert_hf_to_gguf.py and not the output of the fp8_cast_bf16 right?
would need some more details e.g. I'm assuming you created them and didn't download from some of the available ones on HF etc?
@usrlocalben
If you were using my GGUFs I'd recommend go trying someone else's, mine were made before a few different changes and it seems like many people are having problems with it
I believe you were asking somewhere else about refusals and such, there are some people trying it out on a Discord who might know more about that if you are interested. https://huggingface.co/BeaverAI in the #671b-xxl channel.
Thanks for the link, not sure if I will go though (discord is really not my scene).
The folks on that discord might be interested in mikupad too, though most seem to use silly tavern.
Although the Mikupad branch/PR is public, it still isn't in a state yet where I feel like it is ready to be broadcasted about (mostly because I still plan to make more changes that will require database migrations for people following along [there still will end up being one for people migrating from the node server, but that's unavoidable]).
Final estimate: PPL = 3.2089 +/- 0.01657
llama_perf_context_print: load time = 328477.51 ms
llama_perf_context_print: prompt eval time = 25102851.70 ms / 290816 tokens ( 86.32 ms per token, 11.58 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 25109517.17 ms / 290817 tokens
Q4KM on dual 9654, openBLAS build.
Also built BLIS with AOCL, AOCC, minimal change in eta so did not run.
Testing lower quants
Final estimate: PPL = 3.2089 +/- 0.01657
I just tested my new SOTA ik_llama.cpp quant IQ2_KL 345.687 GiB (2.892 BPW) which I clocked at:
Final estimate: PPL = 3.2741 +/- 0.01689
The IQ2_KL is less than a week old, but looking strong for that size range. Hoping to get the full Q8_0 value for a baseline soon.
Just got the baseline q8_0: Final estimate: PPL = 2.9507 +/- 0.01468
cd ik_llama.cpp
model=/mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/Kimi-K2-Instruct-Q8_0.gguf
numactl --interleave=all \
./build/bin/llama-perplexity \
-m "$model" \
-f wiki.test.raw \
--seed 1337 \
-fa -fmoe \
-mla 3 \
--ctx-size 512 \
--numa distribute \
--threads 192 \
--threads-batch 384
Hm... Maybe I think my q4 quant may not be the best as well, as I did not use the imatrix. On the same setup, bartowski's calibration data takes 27hours for fp16!
Maybe its times to use unsloth's quants π’
Nevertheless, my original goal was to test performance on epyc, localllama is interested as well.
If you want to make your own mainline quants I've released a mainline imatrix computed with my usual imatrix corpus. https://huggingface.co/ubergarm/Kimi-K2-Instruct-GGUF/blob/main/mainline/imatrix-mainline-pr9400-plus-kimi-k2-942c55cd5-Kimi-K2-Instruct-Q8_0.gguf
It requires the unmerged PR https://github.com/ggml-org/llama.cpp/pull/9400 (i put a comment towards the bottom with some more info)
This imatrix should be superior to what is already available given it doesn't drop data for MoEs and also properly handles MLA tensors so you don't have to keep them all at Q8_0 incurring some TG speed penalty.
Also if you want to use ik's fork I have some high quality quants in that size range with full perplexity data available in the same hf repo above.
I also have sweep bench data on ik's fork showing good performance on CPU only with "only" ~256GB/s RAM bandwidth in a single NUMA node on reddit https://www.reddit.com/r/LocalLLaMA/comments/1m0uoqo/comment/n3dqg0b/
PP really flies on Zen5 with an unmerged experimental avx512 optimization
Cheers!
@ubergarm wow! your data at https://github.com/ikawrakow/ik_llama.cpp/pull/612#issuecomment-3076539817 looks great! awesome results, thanks for the work!