Can you provide Machine Specs

by kingabzpro - opened Jul 11

Discussion

kingabzpro

Jul 11

How many H100s are required to run this model locally and other parameters for hardware optimization.

aaron-newsome

Jul 12

From the deployment guide:

The smallest deployment unit for Kimi-K2 FP8 weights with 128k seqlen on mainstream H200 or H20 platform is a cluster with 16 GPUs with either Tensor Parallel (TP) or "data parallel + expert parallel" (DP+EP).

https://github.com/MoonshotAI/Kimi-K2/blob/main/docs/deploy_guidance.md

lsw825

Moonshot AI org Jul 12

The number of H100s needed at least is 16 with very short sequence length (only for simple testing). For a normal experience, 32 H100s are required.

vpakarinen

Jul 12

If someone can actually test this model, tell me if its good.

xxr3376

Moonshot AI org Jul 12

@vpakarinen it's really good, you should try it!

halldorj

Jul 12

The number of H100s needed at least is 16 with very short sequence length (only for simple testing). For a normal experience, 32 H100s are required.

Can you provide an sglang example with 32 H100s? :)

lsw825

Moonshot AI org Jul 13

Can you provide an sglang example with 32 H100s? :)

In SGLang, the way we recommend to deploy K2 is to use P-D-Disaggregation with DP+EP. It needs 2 prefilling nodes and 4 decoding nodes at least. In our simple testing, only using 32 H100s DP+EP deployment without P-D-Disaggregation has some problems (probably I'm wrong). I think you can also ask for suggestions in SGLang community.

ersintarhan

Jul 13

Can I Deploy this setup to. 4 Node that each have RTX4000 Ada + 64GB Ram + 10Gbps Network ultra low latency?

io-taas

Jul 14

This comment has been hidden (marked as Resolved)

kingabzpro

Jul 14

Can I Deploy this setup to. 4 Node that each have RTX4000 Ada + 64GB Ram + 10Gbps Network ultra low latency?

I dont think so. Wait for the quantized version of the model.

shujian2025

Jul 20

Can you provide an sglang example with 32 H100s? :)

In SGLang, the way we recommend to deploy K2 is to use P-D-Disaggregation with DP+EP. It needs 2 prefilling nodes and 4 decoding nodes at least. In our simple testing, only using 32 H100s DP+EP deployment without P-D-Disaggregation has some problems (probably I'm wrong). I think you can also ask for suggestions in SGLang community.

Would you recommend other packages for inference with H100 nodes?

berageo

Jul 29

How many tokens per minute can this recommended minimum setup process, approximately? (H200 16x )

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment