🌙 Kimi K2 Instruct - MLX 4-bit
State-of-the-Art 671B MoE Model, Optimized for Apple Silicon
Original Model | MLX Framework | More Quantizations
📖 What is This?
This is a high-performance 4-bit quantized version of Kimi K2 Instruct, optimized to run on Apple Silicon (M1/M2/M3/M4) Macs using the MLX framework. The 4-bit version offers an excellent balance between quality and efficiency - the sweet spot for most practical deployments!
✨ Why You'll Love It
- 🚀 Massive Context Window - Handle up to 262,144 tokens (~200,000 words!)
- 🧠 671B Parameters - One of the most capable open models available
- ⚡ Apple Silicon Native - Fully optimized for M-series chips with Metal acceleration
- 🎯 4-bit Sweet Spot - Excellent balance of quality, speed, and size
- ⚡ Fast Inference - Lightning-quick generation speeds
- 🌏 Bilingual - Fluent in both English and Chinese
- 💬 Instruction-Tuned - Ready for conversations, coding, analysis, and more
🎯 Quick Start
Hardware Requirements
Kimi-K2 is a massive 671B parameter MoE model. Choose your quantization based on available unified memory:
| Quantization | Model Size | Min RAM | Quality |
|---|---|---|---|
| 2-bit | ~84 GB | 96 GB | Acceptable - some quality loss |
| 3-bit | ~126 GB | 128 GB | Good - recommended minimum |
| 4-bit | ~168 GB | 192 GB | Very Good - best quality/size balance |
| 5-bit | ~210 GB | 256 GB | Excellent |
| 6-bit | ~252 GB | 288 GB | Near original |
| 8-bit | ~336 GB | 384 GB | Original quality |
Recommended Configurations
| Mac Model | Max RAM | Recommended Quantization |
|---|---|---|
| Mac Studio M2 Ultra | 192 GB | 4-bit |
| Mac Studio M4 Ultra | 512 GB | 8-bit |
| Mac Pro M2 Ultra | 192 GB | 4-bit |
| MacBook Pro M3 Max | 128 GB | 3-bit |
| MacBook Pro M4 Max | 128 GB | 3-bit |
Performance Notes
- Inference Speed: Expect ~5-15 tokens/sec depending on quantization and hardware
- First Token Latency: 10-30 seconds for model loading
- Context Window: Full 128K context supported
- Active Parameters: Only ~37B parameters active per token (MoE architecture)
Installation
pip install mlx-lm
Your First Generation (3 lines of code!)
from mlx_lm import load, generate
model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-4bit")
print(generate(model, tokenizer, prompt="Explain quantum entanglement simply:", max_tokens=200))
That's it! 🎉
💻 System Requirements
| Component | Minimum | Recommended |
|---|---|---|
| Mac | M1 or newer | M2 Ultra / M3 Max / M4 Max+ |
| Memory | 48 GB unified | 96 GB+ unified |
| Storage | 600 GB free | Fast SSD (1.5+ TB) |
| macOS | 12.0+ | Latest version |
⚡ Note: The 4-bit version is popular for practical deployments - great balance of performance!
📚 Usage Examples
Command Line Interface
mlx_lm.generate \
--model richardyoung/Kimi-K2-Instruct-0905-MLX-4bit \
--prompt "Write a Python script to analyze CSV files." \
--max-tokens 500
Chat Conversation
from mlx_lm import load, generate
model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-4bit")
conversation = """<|im_start|>system
You are a helpful AI assistant specialized in coding and problem-solving.<|im_end|>
<|im_start|>user
Can you help me optimize this Python code?<|im_end|>
<|im_start|>assistant
"""
response = generate(model, tokenizer, prompt=conversation, max_tokens=500)
print(response)
Advanced: Streaming Output
from mlx_lm import load, generate
model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-4bit")
for token in generate(
model,
tokenizer,
prompt="Tell me about the future of AI:",
max_tokens=500,
stream=True
):
print(token, end="", flush=True)
🏗️ Architecture Highlights
Click to expand technical details
Model Specifications
| Feature | Value |
|---|---|
| Total Parameters | ~671 Billion |
| Architecture | DeepSeek V3 (MoE) |
| Experts | 384 routed + 1 shared |
| Active Experts | 8 per token |
| Hidden Size | 7168 |
| Layers | 61 |
| Heads | 56 |
| Context Length | 262,144 tokens |
| Quantization | 4.502 bits per weight |
Advanced Features
- 🎯 YaRN Rope Scaling - 64x factor for extended context
- 🗜️ KV Compression - LoRA-based (rank 512)
- ⚡ Query Compression - Q-LoRA (rank 1536)
- 🧮 MoE Routing - Top-8 expert selection with sigmoid scoring
- 🔧 FP8 Training - Pre-quantized with e4m3 precision
🎨 Other Quantization Options
Choose the right balance for your needs:
| Quantization | Size | Quality | Speed | Best For |
|---|---|---|---|---|
| 8-bit | ~1 TB | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | Production, best quality |
| 6-bit | ~800 GB | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | Sweet spot for most users |
| 5-bit | ~660 GB | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | Great quality/size balance |
| 4-bit (you are here) | ~540 GB | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | Faster inference, practical |
| 3-bit | ~420 GB | ⭐⭐ | ⭐⭐⭐⭐⭐ | Very fast, compact |
| 2-bit | ~320 GB | ⭐⭐ | ⭐⭐⭐⭐⭐ | Fastest, most compact |
| Original | ~5 TB | ⭐⭐⭐⭐⭐ | ⭐⭐ | Research only |
🔧 How It Was Made
This model was quantized using MLX's built-in quantization:
mlx_lm.convert \
--hf-path moonshotai/Kimi-K2-Instruct-0905 \
--mlx-path Kimi-K2-Instruct-0905-MLX-4bit \
-q --q-bits 4 \
--trust-remote-code
Result: 4.502 bits per weight (includes metadata overhead)
⚡ Performance Tips
Getting the best performance
- Close other applications - Free up as much RAM as possible
- Use an external SSD - If your internal drive is full
- Monitor memory - Watch Activity Monitor during inference
- Adjust batch size - If you get OOM errors, reduce max_tokens
- Keep your Mac cool - Good airflow helps maintain peak performance
- Great for development - Fast iteration with good quality retention
⚠️ Known Limitations
- 🍎 Apple Silicon Only - Won't work on Intel Macs or NVIDIA GPUs
- 💾 Storage Needs - Make sure you have 600+ GB free
- 🐏 RAM Needs - Requires 48+ GB unified memory minimum
- 🎯 Quality Trade-off - Some quality loss vs 8-bit (still very usable!)
- 🌐 Bilingual Focus - Optimized for English and Chinese
💡 Why 4-bit: Most popular choice for practical deployments! Excellent balance between size, speed, and quality. Most users won't notice significant degradation while enjoying much faster speeds.
📄 License
Apache 2.0 - Same as the original model. Free for commercial use!
🙏 Acknowledgments
- Original Model: Moonshot AI for creating Kimi K2
- Framework: Apple's MLX team for the amazing framework
- Inspiration: DeepSeek V3 architecture
📚 Citation
If you use this model in your research or product, please cite:
@misc{kimi-k2-2025,
title={Kimi K2: Advancing Long-Context Language Models},
author={Moonshot AI},
year={2025},
url={https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905}
}
🔗 Useful Links
- 📦 Original Model: moonshotai/Kimi-K2-Instruct-0905
- 🛠️ MLX Framework: GitHub
- 📖 MLX LM Docs: GitHub
- 💬 Discussions: Ask questions here!
Quantized with ❤️ by richardyoung
If you find this useful, please ⭐ star the repo and share with others!
Created: October 2025 | Format: MLX 4-bit
- Downloads last month
- 460
Model tree for richardyoung/Kimi-K2-Instruct-0905-MLX-4bit
Base model
moonshotai/Kimi-K2-Instruct-0905