Instructions to use moonshotai/Kimi-K2-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use moonshotai/Kimi-K2-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="moonshotai/Kimi-K2-Instruct", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("moonshotai/Kimi-K2-Instruct", trust_remote_code=True, dtype="auto") - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use moonshotai/Kimi-K2-Instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "moonshotai/Kimi-K2-Instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moonshotai/Kimi-K2-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/moonshotai/Kimi-K2-Instruct
- SGLang
How to use moonshotai/Kimi-K2-Instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "moonshotai/Kimi-K2-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moonshotai/Kimi-K2-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "moonshotai/Kimi-K2-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moonshotai/Kimi-K2-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use moonshotai/Kimi-K2-Instruct with Docker Model Runner:
docker model run hf.co/moonshotai/Kimi-K2-Instruct
Can you provide Machine Specs
How many H100s are required to run this model locally and other parameters for hardware optimization.
From the deployment guide:
The smallest deployment unit for Kimi-K2 FP8 weights with 128k seqlen on mainstream H200 or H20 platform is a cluster with 16 GPUs with either Tensor Parallel (TP) or "data parallel + expert parallel" (DP+EP).
https://github.com/MoonshotAI/Kimi-K2/blob/main/docs/deploy_guidance.md
The number of H100s needed at least is 16 with very short sequence length (only for simple testing). For a normal experience, 32 H100s are required.
If someone can actually test this model, tell me if its good.
The number of H100s needed at least is 16 with very short sequence length (only for simple testing). For a normal experience, 32 H100s are required.
Can you provide an sglang example with 32 H100s? :)
Can you provide an sglang example with 32 H100s? :)
In SGLang, the way we recommend to deploy K2 is to use P-D-Disaggregation with DP+EP. It needs 2 prefilling nodes and 4 decoding nodes at least. In our simple testing, only using 32 H100s DP+EP deployment without P-D-Disaggregation has some problems (probably I'm wrong). I think you can also ask for suggestions in SGLang community.
Can I Deploy this setup to. 4 Node that each have RTX4000 Ada + 64GB Ram + 10Gbps Network ultra low latency?
Can I Deploy this setup to. 4 Node that each have RTX4000 Ada + 64GB Ram + 10Gbps Network ultra low latency?
I dont think so. Wait for the quantized version of the model.
Can you provide an sglang example with 32 H100s? :)
In SGLang, the way we recommend to deploy K2 is to use P-D-Disaggregation with DP+EP. It needs 2 prefilling nodes and 4 decoding nodes at least. In our simple testing, only using 32 H100s DP+EP deployment without P-D-Disaggregation has some problems (probably I'm wrong). I think you can also ask for suggestions in SGLang community.
Would you recommend other packages for inference with H100 nodes?
How many tokens per minute can this recommended minimum setup process, approximately? (H200 16x )