Run 1T-param on A100/H100(80G)x8 using FP4

#9
by ghostplant - opened

Docker Instructions (from https://hub.docker.com/r/tutelgroup/deepseek-671b):

# For A100/A800/H100/H800/H20/H200 (80G x 8):

# Step-1: Download 1TB Model
huggingface-cli download moonshotai/Kimi-K2-Instruct --local-dir ./moonshotai/Kimi-K2-Instruct

# Step-2: Run with A100/H100 (80G x 8):
docker run -it --rm --ipc=host --net=host --shm-size=8g --ulimit memlock=-1 \
      --ulimit stack=67108864 --gpus=all -v /:/host -w /host$(pwd) \
      tutelgroup/deepseek-671b:a100x8-chat-20250712 \
        --try_path ./moonshotai/Kimi-K2-Instruct \
        --serve --listen_port 8000 \
        --prompt "Calculate the indefinite integral of 1/sin(x) + x"
Moonshot AI org

Great work! Thanks a lot.

Could you please introduce what framework is used for reasoning?

Do you mean inference framework?

Do you mean inference framework?

@ghostplant Yes!

We integrate a couple of well-tuned MoE operators (i.e.Kimi fused gating, low-precision MoE FFN forwarding, etc. all of which are compatible for cheap GPUs) into Tutel, a library containing a collection of efficient MoE computing and communication operators, so the model can leverage these public-unoptimized fixes to resolve their slow execution phases, and finally support a very effective overall inference throughput even using A100.

ghostplant changed discussion status to closed
ghostplant changed discussion status to open

This is FP4? I think you mean int4?

This is FP4? I think you mean int4?

It inline quants to FP4 so that 8 A100 (80GB) can run this 1T model.

ghostplant changed discussion status to closed
ghostplant changed discussion status to open

Sign up or log in to comment