gabriellarson/Kimi-K2-Instruct-GGUF · Thanks for the mainline llama.cpp PR effort!

Jul 13

I'm excitedly following your mainline llama.cpp PR https://github.com/ggml-org/llama.cpp/pull/14654

Curious if @anikifoss or anyone has success with it! I'm just getting access to a big RAM rig again and might give it a try.

Cheers!

anikifoss

Jul 14

This model is huge! I have to quantize on the HDD, which is painfully slow. Also, not enough RAM to do full perplexity tests with Q8_0.

@ubergarm if you manage to get your hands on a big rig, could you share your perplexity results for Q8_0 along with the command line (if you test with Q8_0)? That should give me a baseline to compare perplexity of Q2/DQ2/Q4/DQ4 quants.

ubergarm

Jul 14

•

edited Jul 14

Yeah this beast is difficult to maneuver! So sounds like folks are able to get bf16 GGUFs and start quantizing which is great. For ik's fork it might be tricky given the different MLA handling, I'll have to dig into it deeper.

With some luck I can get that Q8_0 to imatrix and then also do my usual perplexity treatment running on a dual socket system with 768GB RAM in each NUMA node compiled CPU-only. The command will likely be something like this:

$ echo 0 | sudo tee /proc/sys/kernel/numa_balancing
$ numactl --interleave=all \
  ./build/bin/llama-perplexity \
      --model /mnt/raid/Kimi-K2-Instruct-Q8_0.gguf \
      -f wiki.test.raw \
      -ctk fp16 \
      -fa \
      -mla 3 -fmoe \ # <--- only on ik's fork, omit for mainline
      --ctx-size 512 \
      --ubatch-size 512 \
      --seed 1337 \
      --numa distribute \
      --threads 384

Downloading the fp8 safetensors now!

ubergarm

Jul 14

•

edited Jul 14

For anyone following along at home, I'm going to try to use it the mainline llama.cpp method which so far so good to cast the fp8 safetensors to bf16 safetensors:

# get script
wget https://raw.githubusercontent.com/deepseek-ai/DeepSeek-V3/refs/heads/main/inference/fp8_cast_bf16.py

# edit for CPU usage `cuda` -> `cpu`
#            loaded_files[file_name] = load_file(file_path, device="cpu")
#       current_state_dict = load_file(safetensor_file, device="cpu")

# install OS deps
apt-get install zlib1g-dev # cmake build-essential and all that jazz too of course

# install python deps
# https://docs.astral.sh/uv/getting-started/installation/
uv venv ./venv --python 3.12 --python-preference=only-managed
source ./venv/bin/activate
uv pip install ninja cmake wheel setuptools pybind11
uv pip install tqdm torch safetensors numpy
uv pip uninstall triton
# we're going to be using triton-CPU instead as no GPU >=sm89 required
# runs in all RAM seems to use highwatermark around 100GB for Kimi-K2-Instruct

# install triton-cpu
$ git clone https://github.com/triton-lang/triton-cpu --recursive
$ cd triton-cpu

# build it
MAX_JOBS=32 uv pip install -e python --no-build-isolation

# now cast it
$ python fp8_cast_bf16.py --help
usage: fp8_cast_bf16.py [-h] --input-fp8-hf-path INPUT_FP8_HF_PATH --output-bf16-hf-path OUTPUT_BF16_HF_PATH

options:
  -h, --help            show this help message and exit
  --input-fp8-hf-path INPUT_FP8_HF_PATH
  --output-bf16-hf-path OUTPUT_BF16_HF_PATH

$ python fp8_cast_bf16.py \
      --input-fp8-hf-path /mnt/raid/models/moonshotai/Kimi-K2-Instruct/ \
      --output-bf16-hf-path /mnt/raid/models/ubergarm/Kimi-K2-Instruct-bf16-safetensors/ 2>&1 | tee -a logs/fp8_cast_bf16-Kimi-K2-Instruct.log
 49%|████▉     | 30/61 [14:42<10:19, 19.97s/it]
.
.
.
 95%|█████████▌| 58/61 [1:12:17<09:02, 180.72s/it]
# slowing down as it gets closer to the end...
# it finished!
100%|██████████| 61/61 [1:21:53<00:00, 80.55s/it]

# now we have ~ 1TB of fp8 safetensors plus
$ $ du -h /mnt/raid/models/ubergarm/Kimi-K2-Instruct-bf16-safetensors/
1.9T    /mnt/raid/models/ubergarm/Kimi-K2-Instruct-bf16-safetensors/ 

# next step is make another ~2T bf16 GGUF
# thank u kioxia and wendell for all this fast disk space

ubergarm

Jul 14

•

edited Jul 14

Okay let's continue...

convert

$ cd llama.cpp
$ git remote add gabriellarson [email protected]:gabriellarson/llama.cpp.git
$ git fetch gabriellarson
$ git checkout kimi-k2
$ git rev-parse --short HEAD
273ea092b

# compile CPU only
cmake -B build -DGGML_CUDA=OFF
cmake --build build --config Release -j $(nproc)

# more dependencies
uv pip install transformers protobuf sentencepiece tiktoken blobfile

# copy over additional files
# !!!*careful* don't accidentally overwrite the output model index file `model.safetensors.index.json`!!!
cp /mnt/raid/models/moonshotai/Kimi-K2-Instruct/config.json /mnt/raid/models/ubergarm/Kimi-K2-Instruct-bf16-safetensors/
cp /mnt/raid/models/moonshotai/Kimi-K2-Instruct/generation_config.json /mnt/raid/models/ubergarm/Kimi-K2-Instruct-bf16-safetensors/
cp /mnt/raid/models/moonshotai/Kimi-K2-Instruct/tokenizer_config.json /mnt/raid/models/ubergarm/Kimi-K2-Instruct-bf16-safetensors/
cp /mnt/raid/models/moonshotai/Kimi-K2-Instruct/*.py /mnt/raid/models/ubergarm/Kimi-K2-Instruct-bf16-safetensors/
cp /mnt/raid/models/moonshotai/Kimi-K2-Instruct/*.py /mnt/raid/models/ubergarm/Kimi-K2-Instruct-bf16-safetensors/
cp /mnt/raid/models/moonshotai/Kimi-K2-Instruct/*.model /mnt/raid/models/ubergarm/Kimi-K2-Instruct-bf16-safetensors/

# convert
python \
    convert_hf_to_gguf.py \
    --outtype bf16 \
    --split-max-size 50G \
    --outfile /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/ \
    /mnt/raid/models/ubergarm/Kimi-K2-Instruct-bf16-safetensors/

...
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:/mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/Kimi-K2-384x14B-Instruct-safetensors-BF16-00001-of-00045.gguf: n_tensors = 34, t
otal_size = 48.7G
...
INFO:gguf.gguf_writer:/mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/Kimi-K2-384x14B-Instruct-safetensors-BF16-00044-of-00045.gguf: n_tensors = 19, total_size = 45.4G
INFO:gguf.gguf_writer:/mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/Kimi-K2-384x14B-Instruct-safetensors-BF16-00045-of-00045.gguf: n_tensors = 33, total_size = 45.7G
Shard (1/45):  30%|██▉       | 14.6G/48.7G [01:20<02:38, 216Mbyte/s]
Writing:   1%|          | 14.6G/2.05T [01:20<2:37:28, 216Mbyte/s]
...
Writing: 100%|██████████| 2.05T/2.05T [2:34:32<00:00, 221Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/

$ du -hc /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/
1.9T    /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/

quantize

make a q8_0 for generating imatrix running on 1TB RAM

# [0,60] Layers
# First Layer has dense ffn_(gate|up|down) vs DeepSeek's three dense layers
# Remaining layers have 384x exps and 1x shexp vs deepseek's 256 routed exps

numactl -N 1 -m 1 \
./build/bin/llama-quantize \
    --pure \
    /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/Kimi-K2-384x14B-Instruct-safetensors-BF16-00001-of-00045.gguf \
    /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/Kimi-K2-Instruct-Q8_0.gguf \
    Q8_0 \
    192

...

llama_model_quantize_impl: model size  = 1958035.30 MB
llama_model_quantize_impl: quant size  = 1040503.41 MB

main: quantize time = 1458161.43 ms
main:    total time = 1458161.43 ms

# subsequent runs can use an imatrix generated using command i list up in previous comment
# --imatrix /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-mainline-Kimi-K2-Instruct-BF16.dat



# subsequent runs can use an imatrix generated using command i list up in previous comment
# --imatrix /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-TODO-bf16.dat

vibe check

Always gotta make sure it seems legit in a few rounds of multi-turn instruct chat.

# the Q8_0 is ~1017GiB so requires RAM from two NUMA nodes (BIOS in NPS1, one node per socket each with ~768GB RAM)
$ echo 0 | sudo tee /proc/sys/kernel/numa_balancing
$ sudo sync; echo 3 | sudo tee /proc/sys/vm/drop_caches
$ export model=/mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/Kimi-K2-Instruct-Q8_0.gguf
$ numactl --interleave=all \
./build/bin/llama-server \
    --model "$model"\
    --alias ubergarm/Kimi-K2-Instruct \
    --ctx-size 32768 \
    -ctk q8_0 \
    -fa \
    --parallel 1 \
    --threads 192 \
    --threads-batch 384 \
    --numa distribute \
    --host 127.0.0.1 \
    --port 8080

prompt eval time =    2838.15 ms /    45 tokens (   63.07 ms per token,    15.86 tokens per second)
       eval time =    7819.60 ms /    44 tokens (  177.72 ms per token,     5.63 tokens per second)
      total time =   10657.75 ms /    89 tokens

prompt eval time =    3378.50 ms /    29 tokens (  116.50 ms per token,     8.58 tokens per second)
       eval time =  107740.53 ms /   584 tokens (  184.49 ms per token,     5.42 tokens per second)
      total time =  111119.04 ms /   613 tokens

First test is looking good!

generate imatrix

model=/mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/Kimi-K2-Instruct-Q8_0.gguf

numactl --interleave=all \
./build/bin/llama-imatrix \
    -m "$model" \
    -f ubergarm-imatrix-calibration-corpus-v02.txt \
    -o /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-mainline-Kimi-K2-Instruct-Q8_0.dat \
    --verbosity 1 \
    --ctx-size 512 \
    --numa distribute \
    --threads 384

compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 535.444 ms
compute_imatrix: computing over 827 chunks with batch_size 512
compute_imatrix: 23.00 seconds per pass - ETA 5 hours 17.07 minutes
[1]74.5010,[2]13.7811,[3]6.6587,[4]4.1525,[5]3.2189,[6]2.6832,[7]2.3490,[8]2.1323,[9]2.0879,
save_imatrix: entry '             blk.59.ffn_down_exps.weight' has partial data (99.74%) - skipping
save_imatrix: entry '               blk.59.ffn_up_exps.weight' has partial data (99.74%) - skipping
.
# *OOF* note it skipped a lot might need to look into
# https://github.com/ggml-org/llama.cpp/pull/9400
.
save_imatrix: entry '             blk.39.ffn_gate_exps.weight' has partial data (99.22%) - skipping
save_imatrix: entry '               blk.11.ffn_up_exps.weight' has partial data (99.48%) - skipping
save_imatrix: entry '             blk.41.ffn_down_exps.weight' has partial data (99.74%) - skipping
save_imatrix: storing only 630 out of 789 entries

save_imatrix: stored collected data after 10 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-mainline-Kimi-K2-Instruct-Q8_0.dat
[10]2.0340,[11]2.0793,[12]2.2799,[13]2.3202,[14]2.3551,[15]2.2266,

# I cancelled this as I'm not happy with skipping so many routed exps
# see next try for another approach

imatrix take 2

# https://github.com/ggml-org/llama.cpp/pull/9400
$ git fetch upstream
$ git checkout compilade/imatrix-batched-chunks
$ git checkout kimi-k2
$ git checkout -b testing
$ git rebase compilade/imatrix-batched-chunks
Successfully rebased and updated refs/heads/testing.

# recompile CPU only
$ ./build/bin/llama-imatrix --version
version: 5916 (942c55cd5)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

model=/mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/Kimi-K2-Instruct-Q8_0.gguf
numactl --interleave=all \
./build/bin/llama-imatrix \
    -m "$model" \
    -f ubergarm-imatrix-calibration-corpus-v02.txt \
    -o /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-mainline-Kimi-K2-Instruct-Q8_0.gguf \
    --verbosity 1 \
    --ctx-size 512 \
    --numa distribute \
    --threads 384

common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048

system_info: n_threads = 384 (n_threads_batch = 384) / 768 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI
2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 532.134 ms
compute_imatrix: computing over 827 chunks, n_ctx=512, batch_size=2048, n_seq=4
compute_imatrix: 91.40 seconds per pass - ETA 5 hours 14.95 minutes
[1]74.5010,[2]13.7811,[3]6.6587,[4]4.1525,[5]3.2189,[6]2.6832,[7]2.3490,[8]2.1323,
save_imatrix: entry '             blk.59.ffn_down_exps.weight' has partial data (99.74%)
save_imatrix: entry '               blk.59.ffn_up_exps.weight' has partial data (99.74%)
save_imatrix: entry '             blk.58.ffn_down_exps.weight' has partial data (99.48%)
.
.
.
save_imatrix: entry '               blk.11.ffn_up_exps.weight' has partial data (99.48%)
save_imatrix: entry '             blk.41.ffn_down_exps.weight' has partial data (99.74%)

save_imatrix: stored collected data after 10 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-mainline-942c55cd5-Kimi-K2-Instruct-Q8_0.gguf
[9]2.0879,[10]2.0340,[11]2.0793,[12]2.2799,
...
[817]3.2695,[818]3.2709,[819]3.2721,[820]3.2731,[821]3.2742,[822]3.2773,[823]3.2794,[824]3.2803,[825]3.2820,[826]3.2833,[827]3.2850,
Final estimate: PPL = 3.2850 +/- 0.01492

save_imatrix: stored collected data after 827 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-mainline-942c55cd5-Kimi-K2-Instruct-Q8_0.gguf

llama_perf_context_print:        load time =   92197.30 ms
llama_perf_context_print: prompt eval time = 12318983.59 ms / 423424 tokens (   29.09 ms per token,    34.37 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time = 12348393.31 ms / 423425 tokens
w@w-VOLCANO:~/projects/llama.cpp$ du -h /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-mainline-942c55cd5-Kimi-K2-Instruct-Q8_0.gguf
1.5G    /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-mainline-942c55cd5-Kimi-K2-Instruct-Q8_0.gguf

Cool so I have an experimental mainline imatrix "gguf" using two unmerged PRs haha... If anyone wants holler at me and I'll upload it to huggingface here: https://huggingface.co/ubergarm/Kimi-K2-Instruct-GGUF

cheese111

Jul 14

Hi, im currently testing your PR!!

I am currently running the Q4_K_M on a dual Epyc 9654 system, prefill took ~15mins and full run on wikitext-2-raw/wiki.test.raw will take 8hrs...

anikifoss

Jul 14

@ubergarm I saw you mentioned somewhere that ik_llama needs mla code patching to work with kimi-k2. Do you have something in progress? I can take a stab at tweaking some constants to make it work for kimi-k2, but I'm not fluent in cpp :)

ubergarm

Jul 14

•

edited Jul 14

@anikifoss

@ubergarm I saw you mentioned somewhere that ik_llama needs mla code patching to work with kimi-k2. Do you have something in progress? I can take a stab at tweaking some constants to make it work for kimi-k2, but I'm not fluent in cpp :)

So I'm pretty fuzzy on the details, but I know that models I've converted via the mainline fp8 cast then convert method print a warning in ik_llama.cpp missing wkv_b tensor which restricts us to the slower -mla 1 implementation. However using the evshiron + triton-cpu direct fp8 safetensors to bf16 GGUFs method documented here seems to somehow preserve those maybe??

But I haven't tried with ik_llama.cpp yet, it might "just work" on these bf16 GGUFs already to quantize but haven't had a chance to try yet. I'd say try that first and see if there are any warnings printed or not. If it prints warnings, then it might be a task of:

Merge the changes of gabriellarson's python patches into the evshiron fork.
Merge the changes of gabriellarson's cpp code into ik_llama.cpp (for the unicode stuff).
Use the patched evshiron+gabriellarson convert to do the one step fp8 safetensors -> bf16 gguf's
Hope all the tensors are the og fairydreaming way so ik_llama.cpp can use -mla 2 and -mla 3 optimizations.

anikifoss

Jul 14

@ubergarm Thanks for more details! I'm porting @gabriellarson 's patch to ik_llama because I can't wait any longer (operation CWAL).

anikifoss

Jul 14

Duplicating the patch to ik_llama seems to do the trick: it complains, but works, and the speed seems OK. I'll post a PR and keep digging.

tdh111

Jul 14

So I'm pretty fuzzy on the details, but I know that models I've converted via the mainline fp8 cast then convert method print a warning in ik_llama.cpp missing wkv_b tensor which restricts us to the slower -mla 1 implementation. However using the evshiron + triton-cpu direct fp8 safetensors to bf16 GGUFs method documented here seems to somehow preserve those maybe??

So the reason is the evshiron fork is from when ik and mainline shared an MLA implementation, and the convert code has stayed the same in ik even though MLA has developed, but for mainline it has changed from what I know.

This comment has links to a few different models with differing MLA implementations from different processes and goes over some of the differences (even if the compatibility part is outdated).

You could in theory use the 2 TB of safetensors with ik convert script and it should work from what I know, alternatively you could just add the python changes from the kimi PR (relatively minor) to the evshiron fork.

Either way I may make a request for a custom quant from you since unlike Deepseek models, I don't know if I want to go through the process myself given it is larger.

usrlocalben

Jul 14

Something seems off about BF16 00001. llama-quantize doesn't seem to identify it as a sequence, processes 00001 then terminates w/success.

ubergarm

Jul 14

@tdh111

I believe you were asking somewhere else about refusals and such, there are some people trying it out on a Discord who might know more about that if you are interested. https://huggingface.co/BeaverAI in the #671b-xxl channel. The folks on that discord might be interested in mikupad too, though most seem to use silly tavern.

ubergarm

Jul 14

@usrlocalben

Something seems off about BF16 00001. llama-quantize doesn't seem to identify it as a sequence, processes 00001 then terminates w/success.

The BF16 safetensors or the BF16 GGUF? I'm assuming you mean the next step after convert_hf_to_gguf.py and not the output of the fp8_cast_bf16 right?

would need some more details e.g. I'm assuming you created them and didn't download from some of the available ones on HF etc?

gabriellarson

Owner Jul 14

@usrlocalben
If you were using my GGUFs I'd recommend go trying someone else's, mine were made before a few different changes and it seems like many people are having problems with it

tdh111

Jul 15

I believe you were asking somewhere else about refusals and such, there are some people trying it out on a Discord who might know more about that if you are interested. https://huggingface.co/BeaverAI in the #671b-xxl channel.

Thanks for the link, not sure if I will go though (discord is really not my scene).

The folks on that discord might be interested in mikupad too, though most seem to use silly tavern.

Although the Mikupad branch/PR is public, it still isn't in a state yet where I feel like it is ready to be broadcasted about (mostly because I still plan to make more changes that will require database migrations for people following along [there still will end up being one for people migrating from the node server, but that's unavoidable]).

cheese111

Jul 15

Final estimate: PPL = 3.2089 +/- 0.01657

llama_perf_context_print:        load time =  328477.51 ms
llama_perf_context_print: prompt eval time = 25102851.70 ms / 290816 tokens (   86.32 ms per token,    11.58 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time = 25109517.17 ms / 290817 tokens

Q4KM on dual 9654, openBLAS build.

Also built BLIS with AOCL, AOCC, minimal change in eta so did not run.

Testing lower quants

ubergarm

Jul 15

•

edited Jul 15

@cheese111

Final estimate: PPL = 3.2089 +/- 0.01657

I just tested my new SOTA ik_llama.cpp quant IQ2_KL 345.687 GiB (2.892 BPW) which I clocked at:

Final estimate: PPL = 3.2741 +/- 0.01689

The IQ2_KL is less than a week old, but looking strong for that size range. Hoping to get the full Q8_0 value for a baseline soon.

ubergarm

Jul 15

Just got the baseline q8_0: Final estimate: PPL = 2.9507 +/- 0.01468

cd ik_llama.cpp
model=/mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/Kimi-K2-Instruct-Q8_0.gguf
numactl --interleave=all \
./build/bin/llama-perplexity \
    -m "$model" \
    -f wiki.test.raw \
    --seed 1337 \
    -fa -fmoe \
    -mla 3 \
    --ctx-size 512 \
    --numa distribute \
    --threads 192 \
    --threads-batch 384

cheese111

Jul 16

•

edited Jul 16

Hm... Maybe I think my q4 quant may not be the best as well, as I did not use the imatrix. On the same setup, bartowski's calibration data takes 27hours for fp16!

Maybe its times to use unsloth's quants 😢

Nevertheless, my original goal was to test performance on epyc, localllama is interested as well.

ubergarm

Jul 16

•

edited Jul 16

@cheese111

If you want to make your own mainline quants I've released a mainline imatrix computed with my usual imatrix corpus. https://huggingface.co/ubergarm/Kimi-K2-Instruct-GGUF/blob/main/mainline/imatrix-mainline-pr9400-plus-kimi-k2-942c55cd5-Kimi-K2-Instruct-Q8_0.gguf

It requires the unmerged PR https://github.com/ggml-org/llama.cpp/pull/9400 (i put a comment towards the bottom with some more info)

This imatrix should be superior to what is already available given it doesn't drop data for MoEs and also properly handles MLA tensors so you don't have to keep them all at Q8_0 incurring some TG speed penalty.

Also if you want to use ik's fork I have some high quality quants in that size range with full perplexity data available in the same hf repo above.

I also have sweep bench data on ik's fork showing good performance on CPU only with "only" ~256GB/s RAM bandwidth in a single NUMA node on reddit https://www.reddit.com/r/LocalLLaMA/comments/1m0uoqo/comment/n3dqg0b/

PP really flies on Zen5 with an unmerged experimental avx512 optimization

Cheers!

cheese111

Jul 17

@ubergarm wow! your data at https://github.com/ikawrakow/ik_llama.cpp/pull/612#issuecomment-3076539817 looks great! awesome results, thanks for the work!