Run DeepSeek-V4-Flash on more hardware: FP8/BF16 adapted versions for 8 AI chips (ready to download)

#20
by Yonghua - opened

Hi everyone! πŸ‘‹

If you've been excited to try DeepSeek-V4-Flash but don't have a Blackwell GPU (B200+), good news – we've got you covered.

The FlagOS community has completed Day‑0 adaptation of DeepSeek-V4-Flash across 8 different AI chip platforms. We converted the original FP4+FP8 weights to FP8/BF16, so you can run V4-Flash on a much wider range of hardware.

No conversion needed on your side – just download and run!


πŸ“¦Full deployment guide & source code:

β†’ github.com/flagos-ai/DeepSeek-V4-FlagOS

πŸ“¦ Pre-converted Models (ready to use)

NVIDIA (H100/H20 – FP8)

β†’ HuggingFace
β†’ ModelScope

Huawei Ascend (BF16)

β†’ HuggingFace
β†’ ModelScope

Hygon DCU (BF16)

β†’ HuggingFace
β†’ ModelScope

MetaX GPU (BF16)

β†’ HuggingFace
β†’ ModelScope

Moore Threads MTT (FP8)

β†’ HuggingFace
β†’ ModelScope

Kunlunxin XPU (BF16)

β†’ HuggingFace
β†’ ModelScope

T-Head Zhenwu (BF16)

β†’ HuggingFace
β†’ ModelScope

Iluvatar GPU (BF16)

β†’ HuggingFace
β†’ ModelScope


βš™οΈ What we did (technical summary)

  1. FP4 β†’ FP8/BF16 weight conversion
    The original V4-Flash uses FP4+FP8 (Blackwell-only). We dequantized it to FP8/BF16 with no quality loss, verified against standard benchmarks.

  2. Full Triton operator coverage via FlagGems
    FlagGems is the world's largest Triton operator library (400+ ops) and part of the PyTorch Foundation ecosystem. Every op in V4-Flash runs with zero CUDA-specific dependencies.

  3. Extended tensor parallelism (bypass o_groups limit)
    V4’s o_groups=8 normally caps tensor parallelism at 8‑way. We implemented independent o-group communication, allowing you to scale beyond 8‑way (e.g. 16‑way) – perfect for chips with 32GB/64GB VRAM.


πŸ“ˆ Performance (NVIDIA H20, V4-Flash FP8)

Implementation Tokens/s
FlagGems (C++ Wrapper + Triton) 70.7
DeepSeek’s native TileLang 62.99

πŸš€ 12.24% faster than the official implementation.


πŸ§ͺ Quick Start (one line to run)

pip install flag-gems==5.0.2
git clone https://github.com/flagos-ai/DeepSeek-V4-FlagOS
cd DeepSeek-V4-FlagOS
pip install -r requirements.txt

USE_FLAGGEMS=1 torchrun --nproc-per-node 8 generate.py \
  --ckpt-path /path/to/model_bf16_mp8 \
  --config config_from_bf16.json \
  --input-file prompt.txt \
  --max-new-tokens 64

Sign up or log in to comment