🌙 Kimi K2 Instruct - MLX 4-bit

State-of-the-Art 671B MoE Model, Optimized for Apple Silicon

MLX Model Size Quantization Context License

Original Model | MLX Framework | More Quantizations


📖 What is This?

This is a high-performance 4-bit quantized version of Kimi K2 Instruct, optimized to run on Apple Silicon (M1/M2/M3/M4) Macs using the MLX framework. The 4-bit version offers an excellent balance between quality and efficiency - the sweet spot for most practical deployments!

✨ Why You'll Love It

  • 🚀 Massive Context Window - Handle up to 262,144 tokens (~200,000 words!)
  • 🧠 671B Parameters - One of the most capable open models available
  • Apple Silicon Native - Fully optimized for M-series chips with Metal acceleration
  • 🎯 4-bit Sweet Spot - Excellent balance of quality, speed, and size
  • Fast Inference - Lightning-quick generation speeds
  • 🌏 Bilingual - Fluent in both English and Chinese
  • 💬 Instruction-Tuned - Ready for conversations, coding, analysis, and more

🎯 Quick Start

Hardware Requirements

Kimi-K2 is a massive 671B parameter MoE model. Choose your quantization based on available unified memory:

Quantization Model Size Min RAM Quality
2-bit ~84 GB 96 GB Acceptable - some quality loss
3-bit ~126 GB 128 GB Good - recommended minimum
4-bit ~168 GB 192 GB Very Good - best quality/size balance
5-bit ~210 GB 256 GB Excellent
6-bit ~252 GB 288 GB Near original
8-bit ~336 GB 384 GB Original quality

Recommended Configurations

Mac Model Max RAM Recommended Quantization
Mac Studio M2 Ultra 192 GB 4-bit
Mac Studio M4 Ultra 512 GB 8-bit
Mac Pro M2 Ultra 192 GB 4-bit
MacBook Pro M3 Max 128 GB 3-bit
MacBook Pro M4 Max 128 GB 3-bit

Performance Notes

  • Inference Speed: Expect ~5-15 tokens/sec depending on quantization and hardware
  • First Token Latency: 10-30 seconds for model loading
  • Context Window: Full 128K context supported
  • Active Parameters: Only ~37B parameters active per token (MoE architecture)

Installation

pip install mlx-lm

Your First Generation (3 lines of code!)

from mlx_lm import load, generate

model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-4bit")
print(generate(model, tokenizer, prompt="Explain quantum entanglement simply:", max_tokens=200))

That's it! 🎉

💻 System Requirements

Component Minimum Recommended
Mac M1 or newer M2 Ultra / M3 Max / M4 Max+
Memory 48 GB unified 96 GB+ unified
Storage 600 GB free Fast SSD (1.5+ TB)
macOS 12.0+ Latest version

Note: The 4-bit version is popular for practical deployments - great balance of performance!

📚 Usage Examples

Command Line Interface

mlx_lm.generate \
  --model richardyoung/Kimi-K2-Instruct-0905-MLX-4bit \
  --prompt "Write a Python script to analyze CSV files." \
  --max-tokens 500

Chat Conversation

from mlx_lm import load, generate

model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-4bit")

conversation = """<|im_start|>system
You are a helpful AI assistant specialized in coding and problem-solving.<|im_end|>
<|im_start|>user
Can you help me optimize this Python code?<|im_end|>
<|im_start|>assistant
"""

response = generate(model, tokenizer, prompt=conversation, max_tokens=500)
print(response)

Advanced: Streaming Output

from mlx_lm import load, generate

model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-4bit")

for token in generate(
    model,
    tokenizer,
    prompt="Tell me about the future of AI:",
    max_tokens=500,
    stream=True
):
    print(token, end="", flush=True)

🏗️ Architecture Highlights

Click to expand technical details

Model Specifications

Feature Value
Total Parameters ~671 Billion
Architecture DeepSeek V3 (MoE)
Experts 384 routed + 1 shared
Active Experts 8 per token
Hidden Size 7168
Layers 61
Heads 56
Context Length 262,144 tokens
Quantization 4.502 bits per weight

Advanced Features

  • 🎯 YaRN Rope Scaling - 64x factor for extended context
  • 🗜️ KV Compression - LoRA-based (rank 512)
  • ⚡ Query Compression - Q-LoRA (rank 1536)
  • 🧮 MoE Routing - Top-8 expert selection with sigmoid scoring
  • 🔧 FP8 Training - Pre-quantized with e4m3 precision

🎨 Other Quantization Options

Choose the right balance for your needs:

Quantization Size Quality Speed Best For
8-bit ~1 TB ⭐⭐⭐⭐⭐ ⭐⭐⭐ Production, best quality
6-bit ~800 GB ⭐⭐⭐⭐ ⭐⭐⭐⭐ Sweet spot for most users
5-bit ~660 GB ⭐⭐⭐⭐ ⭐⭐⭐⭐ Great quality/size balance
4-bit (you are here) ~540 GB ⭐⭐⭐ ⭐⭐⭐⭐⭐ Faster inference, practical
3-bit ~420 GB ⭐⭐ ⭐⭐⭐⭐⭐ Very fast, compact
2-bit ~320 GB ⭐⭐ ⭐⭐⭐⭐⭐ Fastest, most compact
Original ~5 TB ⭐⭐⭐⭐⭐ ⭐⭐ Research only

🔧 How It Was Made

This model was quantized using MLX's built-in quantization:

mlx_lm.convert \
  --hf-path moonshotai/Kimi-K2-Instruct-0905 \
  --mlx-path Kimi-K2-Instruct-0905-MLX-4bit \
  -q --q-bits 4 \
  --trust-remote-code

Result: 4.502 bits per weight (includes metadata overhead)

⚡ Performance Tips

Getting the best performance
  1. Close other applications - Free up as much RAM as possible
  2. Use an external SSD - If your internal drive is full
  3. Monitor memory - Watch Activity Monitor during inference
  4. Adjust batch size - If you get OOM errors, reduce max_tokens
  5. Keep your Mac cool - Good airflow helps maintain peak performance
  6. Great for development - Fast iteration with good quality retention

⚠️ Known Limitations

  • 🍎 Apple Silicon Only - Won't work on Intel Macs or NVIDIA GPUs
  • 💾 Storage Needs - Make sure you have 600+ GB free
  • 🐏 RAM Needs - Requires 48+ GB unified memory minimum
  • 🎯 Quality Trade-off - Some quality loss vs 8-bit (still very usable!)
  • 🌐 Bilingual Focus - Optimized for English and Chinese

💡 Why 4-bit: Most popular choice for practical deployments! Excellent balance between size, speed, and quality. Most users won't notice significant degradation while enjoying much faster speeds.

📄 License

Apache 2.0 - Same as the original model. Free for commercial use!

🙏 Acknowledgments

  • Original Model: Moonshot AI for creating Kimi K2
  • Framework: Apple's MLX team for the amazing framework
  • Inspiration: DeepSeek V3 architecture

📚 Citation

If you use this model in your research or product, please cite:

@misc{kimi-k2-2025,
  title={Kimi K2: Advancing Long-Context Language Models},
  author={Moonshot AI},
  year={2025},
  url={https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905}
}

🔗 Useful Links


Quantized with ❤️ by richardyoung

If you find this useful, please ⭐ star the repo and share with others!

Created: October 2025 | Format: MLX 4-bit

Downloads last month
460
Safetensors
Model size
1T params
Tensor type
BF16
·
U32
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for richardyoung/Kimi-K2-Instruct-0905-MLX-4bit

Quantized
(21)
this model

Collection including richardyoung/Kimi-K2-Instruct-0905-MLX-4bit