mlx-community/Kimi-K2-Instruct-0905-mlx-DQ3_K_M

This model mlx-community/Kimi-K2-Instruct-0905-mlx-DQ3_K_M was converted to MLX format from moonshotai/Kimi-K2-Instruct-0905 using mlx-lm version 0.26.3.

This is created for people using a single Apple Mac Studio M3 Ultra with 512 GB. The 4-bit version of Kimi K2 does not fit. Using research results, we aim to get 4-bit performance from a slightly smaller and smarter quantization. It should also not be so large that it leaves no memory for a useful context window.

pip install mlx-lm

mlx_lm.generate --model mlx-community/Kimi-K2-Instruct-0905-mlx-DQ3_K_M --temp 0.6 --min-p 0.01 --max-tokens 4096 --trust-remote-code --prompt "Hallo"

What is this DQ3_K_M?

In the Arxiv paper Quantitative Analysis of Performance Drop in DeepSeek Model Quantization the authors write,

We further propose DQ3_K_M, a dynamic 3-bit quantization method that significantly outperforms traditional Q3_K_M variant on various benchmarks, which is also comparable with 4-bit quantization (Q4_K_M) approach in most tasks.

and

dynamic 3-bit quantization method (DQ3_K_M) that outperforms the 3-bit quantization implementation in llama.cpp and achieves performance comparable to 4-bit quantization across multiple benchmarks.

The resulting multi-bitwidth quantization has been well tested and documented.


How can you create your own DQ3_K_M quants?

In the convert.py file of mlx-lm on your system ( you can see the original code here ), replace the code inside def mixed_quant_predicate() with something like

        index = (
            int(path.split(".")[layer_location])
            if len(path.split(".")) > layer_location
            else 0
        )
        # Build a mixed quant like "DQ3" of Arxiv paper https://arxiv.org/abs/2505.02390
        #    Quantitative Analysis of Performance Drop in DeepSeek Model Quantization
        q_bits = 4       
        if "lm_head" in path:
           q_bits = 6
        #if "tokens" in path:
        #   q_bits = 4
        if "attn.kv" in path:
           q_bits = 6
        #if "o_proj" in path:
        #   q_bits = 4
        #if "attn.q" in path:
        #   q_bits = 4
        # For all "mlp" and "shared experts"
        if "down_proj" in path:
           q_bits = 6
        #if "up_proj" in path:
        #   q_bits = 4
        #if "gate_proj" in path:
        #   q_bits = 4
        # For "switch experts"
        if "switch_mlp.up_proj" in path:
           q_bits = 3
        if "switch_mlp.gate_proj" in path:
           q_bits = 3
        if "switch_mlp.down_proj" in path:
           q_bits = 3
           # Blocks up to 5 are higher quality
           if index < 5:
              q_bits = 6
           # Every 5th block is "medium" quality
           if (index % 5) == 0:
              q_bits = 4
        #print("path:", path, "index:", index, "q_bits:", q_bits)
        return {"group_size": group_size, "bits": q_bits}

Should you wish to squeeze more out of your quant, and you do not need to use a larger context window, you can change the last part of the above code to

        if "switch_mlp.down_proj" in path:
           q_bits = 4
           # Blocks up to 5 are higher quality
           if index < 5:
              q_bits = 6
        #print("path:", path, "index:", index, "q_bits:", q_bits)
        return {"group_size": group_size, "bits": q_bits}

Then create your DQ3_K_M quant with

mlx_lm.convert --hf-path moonshotai/Kimi-K2-Instruct-0905 --mlx-path your-model-DQ3_K_M -q --quant-predicate mixed_3_4 --trust-remote-code

Enjoy!

Downloads last month
133
Safetensors
Model size
1,026B params
Tensor type
BF16
U32
F32
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for mlx-community/Kimi-K2-Instruct-0905-mlx-DQ3_K_M

Quantized
(8)
this model