you know which nightly it worked with? because it does not with current one

#1
by willfalco - opened

you know which nightly it worked with? because it does not with current one
https://hub.docker.com/r/vllm/vllm-openai/tags

nvm , it is vllm/vllm-openai:nightly-11fd69dd54060a59c6f62a6d217e1ecc47d74a68

let me know if it works for you with that image

it does, same performance as AWQ - maybe extra 1% on tests
we got to get to the bottom of what is going on with vllm nvfp4 and blackwell sm120

how is support on non-blackwell gpus?

Really wanted to try this model but it looks like that nightly is no more available :(

The commit is:
https://github.com/vllm-project/vllm/commit/11fd69dd54060a59c6f62a6d217e1ecc47d74a68

The commit is in v0.11.2 (doesn't work), v0.11.1 (doesn't work), v0.11.1rc7 (not available), v0.11.1rc6 (not available).
I also tried a bunch of nightly with that commit but nothing.
I've not tried to build a dockerfile for it but if someone has and it works it'd be great!

Unfortunately NVFP4, despite being pretty amazing (on paper) doesn't receive enough love yet. Hopefully soon.

In any case, thank you for quantizing this awesome model!

try running with latest vllm compiled, not docker . there is cuda image regression in docker currently
you want ENABLE_CUTLASS_MOE_SM120=1 https://github.com/vllm-project/vllm/pull/29242

I was getting errors like this:

ValueError: CutlassExpertsFp4 doesn't support DP. Use flashinfer CUTLASS Fused MoE backend instead (set VLLM_USE_FLASHINFER_MOE_FP4=1)

I set VLLM_USE_FLASHINFER_MOE_FP4=1 but the errors remain.

Examining the vllm code, this error is thrown because only 2 VLLM_FLASHINFER_MOE_BACKEND are supported: masked_gemm and throughput.

I set export VLLM_FLASHINFER_MOE_BACKEND=throughput and it loaded it.

This seems to work reliably on 2x NVIDIA RTX 6000 PRO Blackwell (2x 96GB VRAM):

export VLLM_ATTENTION_BACKEND=FLASHINFER
export VLLM_FLASHINFER_MOE_BACKEND=throughput
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_MOE_FP8=1
export VLLM_USE_FLASHINFER_MOE_FP4=1
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1

# Run on 2 GPUs with tensor parallelism
CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-0,1} /opt/vllm/bin/vllm serve lukealonso/MiniMax-M2-NVFP4 \
  --host 0.0.0.0 \
  --port 8345 \
  --served-model-name default-model lukealonso/MiniMax-M2-NVFP4 \
  --trust-remote-code \
  --gpu-memory-utilization 0.95 \
  --pipeline-parallel-size 1 \
  --enable-expert-parallel \
  --tensor-parallel-size 2 \
  --max-model-len 196608 \
  --max-num-seqs 32 \
  --enable-auto-tool-choice \
  --reasoning-parser minimax_m2_append_think \
  --tool-call-parser minimax_m2 \
  --all2all-backend pplx \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --max-num-batched-tokens 16384 \
  --dtype auto \
  --kv-cache-dtype fp8

My environment:

  Python: 3.12.3
  vLLM: 0.11.2.dev360+g8e7a89160
  PyTorch: 2.9.0+cu130
  CUDA: 13.0
  GPU 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition
  GPU 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition
  Triton: 3.5.0
  FlashInfer: 0.5.3

works with above
though with no gains over vllm/vllm-openai:nightly-11fd69dd54060a59c6f62a6d217e1ecc47d74a68 in high ~72tps, ~332tps x10 (2 x RTX PRO 6000 @50k context)
python -c "import vllm; print(vllm.version); import torch; print(f'Torch Version: {torch.version}'); print(f'CUDA Available: {torch.cuda.is_available()}'); print(f'CUDA Version: {torch.version.cuda}'); import triton; print(triton.version); import flashinfer; print(flashinfer.version)"
0.11.2.dev403+gb9d0504a3
Torch Version: 2.9.0+cu130
CUDA Available: True
CUDA Version: 13.0
3.5.0
0.5.3

that vllm-openai:nightly-11fd69dd54060a59c6f62a6d217e1ecc47d74a68 seems to be the last one that actually ran nvfp4 out of the box at all
for those who stumble on this just use it via
docker pull lavd/vllm-openai:nightly-11fd69dd54060a59c6f62a6d217e1ecc47d74a68
or
docker pull lavd/vllm-openai:nvfp4

Thanks, finally it worked. But as others confirmed it's the same speed of AWQ at the moment. :(

same speed, but better perplexity
if someone can measure against a2s-ai/MiniMax-M2-AWQ for example

yeah awq was never very good quality it seems, it failed every-time in testing vs q4 ggufs. hopefully nvfp4 is much higher quality quantization method

Interesting. My experience with the awq model was pretty bad. Has anyone tried a qwen3 vl 235B nvfp4? Was looking for a quantized version of that one.

Like this? RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4

AWQ loosing 1 point here

lm_eval --model local-completions --model_args model=mmm2,system_prompt="Provide only the final numerical answer. No reasoning or explanations.",base_url=http://192.168.1.200:8001/v1/completions,tokenizer=/data1/MiniMax-M2-AWQ/,max_tokens=4096 --tasks gsm8k_cot,humaneval --confirm_run_unsafe_code --gen_kwargs do_sample=False --trust_remote_code --log_samples --output_path eval_results/ --batch_size 20

Tasks Version Filter n-shot Metric Value Stderr
gsm8k_cot 3 flexible-extract 8 exact_match 0.9189 ± 0.0075
strict-match 8 exact_match 0.9143 ± 0.0077
humaneval 1 create_test 0 pass@1 0.5000 ± 0.0392
Requesting API: 100% ███████ 1319/1319 [06:37<00:00, 3.32it/s]
Tasks Version Filter n-shot Metric Value Stderr
----- ------: ---------------- -----: ----------- --- -----: --- -----:
gsm8k 3 flexible-extract 5 exact_match 0.9272 ± 0.0072
strict-match 5 exact_match 0.9265 ± 0.0072
Requesting API: 100% ███████ 56168/56168 [18:22<00:00, 50.94it/s]
Groups Version Filter n-shot Metric Value Stderr
------------------ ------: ------ ------ ------ --- -----: --- -----:
mmlu 2 none acc 0.8164 ± 0.0031
- humanities 2 none acc 0.7488 ± 0.0061
- other 2 none acc 0.8616 ± 0.0060
- social sciences 2 none acc 0.8921 ± 0.0055
- stem 2 none acc 0.7989 ± 0.0069

==========================================================================================================================
lm_eval --model local-completions --model_args model=mmm2nv,system_prompt="Provide only the final numerical answer. No reasoning or explanations.",base_url=http://192.168.1.200:8003/v1/completions,tokenizer=/data1/MiniMax-M2-NVFP4/,max_tokens=4096 --tasks gsm8k_cot,humaneval --confirm_run_unsafe_code --gen_kwargs do_sample=False --trust_remote_code --log_samples --output_path eval_results/ --batch_size 20

Tasks Version Filter n-shot Metric Value Stderr
gsm8k_cot 3 flexible-extract 8 exact_match 0.9257 ± 0.0072
strict-match 8 exact_match 0.9212 ± 0.0074
humaneval 1 create_test 0 pass@1 0.5854 ± 0.0386
Requesting API: 100% ███████ 1319/1319 [06:31<00:00, 3.37it/s]
Tasks Version Filter n-shot Metric Value Stderr
----- ------: ---------------- -----: ----------- --- -----: --- -----:
gsm8k 3 flexible-extract 5 exact_match 0.9378 ± 0.0067
strict-match 5 exact_match 0.9363 ± 0.0067
Requesting API: 100% ███████ 56168/56168 [16:58<00:00, 55.17it/s]
Groups Version Filter n-shot Metric Value Stderr
------------------ ------: ------ ------ ------ --- -----: --- -----:
mmlu 2 none acc 0.8173 ± 0.0031
- humanities 2 none acc 0.7564 ± 0.0060
- other 2 none acc 0.8622 ± 0.0060
- social sciences 2 none acc 0.8859 ± 0.0056
- stem 2 none acc 0.7967 ± 0.0069

============================================================================================================================

full just for comparison
lm_eval --model local-completions --model_args model=mmm2,system_prompt="Provide only the final numerical answer. No reasoning or explanations.",base_url=http://192.168.1.200:8001/v1/completions,tokenizer=/data1/MiniMax-M2/,num_concurrent=40,max_tokens=8168 --tasks mmlu --confirm_run_unsafe_code --gen_kwargs do_sample=False --trust_remote_code --log_samples --output_path eval_results/

Groups Version Filter n-shot Metric Value Stderr
mmlu 2 none acc 0.8329 ± 0.0030
- humanities 2 none acc 0.7747 ± 0.0059
- other 2 none acc 0.8774 ± 0.0057
- social sciences 2 none acc 0.9022 ± 0.0053
- stem 2 none acc 0.8081 ± 0.0068

Interesting. My experience with the awq model was pretty bad. Has anyone tried a qwen3 vl 235B nvfp4? Was looking for a quantized version of that one.

could also be that modern Q4 ggufs are mostly dynamic using imatrix so a lot of the important layers are kept at higher precision. I think i've seen some AWQ athat are fp4/fp8 mix? However for Deepseek the only one I can run in 4x96gb is the smallest 4bit AWQ, the other 4bit variations i found did not fit

The new vllm 0.12.0 seems to have better support for nvfp4.

If anyone tries it would be nice to see if there is any improvement for speed.

for this model 0.11.0 vs 0.12.0 is ~1-2% difference is tps on 50k context (2 x RTX PRO 6000), 1:1 catching up with sglang 73tps@50kcontext

Sign up or log in to comment