Run 1T-param on A100/H100(80G)x8 using FP4
Docker Instructions (from https://hub.docker.com/r/tutelgroup/deepseek-671b):
# For A100/A800/H100/H800/H20/H200 (80G x 8):
# Step-1: Download 1TB Model
huggingface-cli download moonshotai/Kimi-K2-Instruct --local-dir ./moonshotai/Kimi-K2-Instruct
# Step-2: Run with A100/H100 (80G x 8):
docker run -it --rm --ipc=host --net=host --shm-size=8g --ulimit memlock=-1 \
--ulimit stack=67108864 --gpus=all -v /:/host -w /host$(pwd) \
tutelgroup/deepseek-671b:a100x8-chat-20250712 \
--try_path ./moonshotai/Kimi-K2-Instruct \
--serve --listen_port 8000 \
--prompt "Calculate the indefinite integral of 1/sin(x) + x"
Great work! Thanks a lot.
Could you please introduce what framework is used for reasoning?
Do you mean inference framework?
We integrate a couple of well-tuned MoE operators (i.e.Kimi fused gating, low-precision MoE FFN forwarding, etc. all of which are compatible for cheap GPUs) into Tutel, a library containing a collection of efficient MoE computing and communication operators, so the model can leverage these public-unoptimized fixes to resolve their slow execution phases, and finally support a very effective overall inference throughput even using A100.
This is FP4? I think you mean int4?
This is FP4? I think you mean int4?
It inline quants to FP4 so that 8 A100 (80GB) can run this 1T model.