Can you provide Machine Specs

#2
by kingabzpro - opened

How many H100s are required to run this model locally and other parameters for hardware optimization.

From the deployment guide:

The smallest deployment unit for Kimi-K2 FP8 weights with 128k seqlen on mainstream H200 or H20 platform is a cluster with 16 GPUs with either Tensor Parallel (TP) or "data parallel + expert parallel" (DP+EP).

https://github.com/MoonshotAI/Kimi-K2/blob/main/docs/deploy_guidance.md

Moonshot AI org

The number of H100s needed at least is 16 with very short sequence length (only for simple testing). For a normal experience, 32 H100s are required.

If someone can actually test this model, tell me if its good.

Moonshot AI org

@vpakarinen it's really good, you should try it!

The number of H100s needed at least is 16 with very short sequence length (only for simple testing). For a normal experience, 32 H100s are required.

Can you provide an sglang example with 32 H100s? :)

Moonshot AI org

Can you provide an sglang example with 32 H100s? :)

In SGLang, the way we recommend to deploy K2 is to use P-D-Disaggregation with DP+EP. It needs 2 prefilling nodes and 4 decoding nodes at least. In our simple testing, only using 32 H100s DP+EP deployment without P-D-Disaggregation has some problems (probably I'm wrong). I think you can also ask for suggestions in SGLang community.

Can I Deploy this setup to. 4 Node that each have RTX4000 Ada + 64GB Ram + 10Gbps Network ultra low latency?

This comment has been hidden (marked as Resolved)

Can I Deploy this setup to. 4 Node that each have RTX4000 Ada + 64GB Ram + 10Gbps Network ultra low latency?

I dont think so. Wait for the quantized version of the model.

Can you provide an sglang example with 32 H100s? :)

In SGLang, the way we recommend to deploy K2 is to use P-D-Disaggregation with DP+EP. It needs 2 prefilling nodes and 4 decoding nodes at least. In our simple testing, only using 32 H100s DP+EP deployment without P-D-Disaggregation has some problems (probably I'm wrong). I think you can also ask for suggestions in SGLang community.

Would you recommend other packages for inference with H100 nodes?

How many tokens per minute can this recommended minimum setup process, approximately? (H200 16x )

Sign up or log in to comment