you know which nightly it worked with? because it does not with current one
you know which nightly it worked with? because it does not with current one
https://hub.docker.com/r/vllm/vllm-openai/tags
nvm , it is vllm/vllm-openai:nightly-11fd69dd54060a59c6f62a6d217e1ecc47d74a68
let me know if it works for you with that image
it does, same performance as AWQ - maybe extra 1% on tests
we got to get to the bottom of what is going on with vllm nvfp4 and blackwell sm120
how is support on non-blackwell gpus?
Really wanted to try this model but it looks like that nightly is no more available :(
The commit is:
https://github.com/vllm-project/vllm/commit/11fd69dd54060a59c6f62a6d217e1ecc47d74a68
The commit is in v0.11.2 (doesn't work), v0.11.1 (doesn't work), v0.11.1rc7 (not available), v0.11.1rc6 (not available).
I also tried a bunch of nightly with that commit but nothing.
I've not tried to build a dockerfile for it but if someone has and it works it'd be great!
Unfortunately NVFP4, despite being pretty amazing (on paper) doesn't receive enough love yet. Hopefully soon.
In any case, thank you for quantizing this awesome model!
try running with latest vllm compiled, not docker . there is cuda image regression in docker currently
you want ENABLE_CUTLASS_MOE_SM120=1 https://github.com/vllm-project/vllm/pull/29242
I was getting errors like this:
ValueError: CutlassExpertsFp4 doesn't support DP. Use flashinfer CUTLASS Fused MoE backend instead (set VLLM_USE_FLASHINFER_MOE_FP4=1)
I set VLLM_USE_FLASHINFER_MOE_FP4=1 but the errors remain.
Examining the vllm code, this error is thrown because only 2 VLLM_FLASHINFER_MOE_BACKEND are supported: masked_gemm and throughput.
I set export VLLM_FLASHINFER_MOE_BACKEND=throughput and it loaded it.
This seems to work reliably on 2x NVIDIA RTX 6000 PRO Blackwell (2x 96GB VRAM):
export VLLM_ATTENTION_BACKEND=FLASHINFER
export VLLM_FLASHINFER_MOE_BACKEND=throughput
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_MOE_FP8=1
export VLLM_USE_FLASHINFER_MOE_FP4=1
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
# Run on 2 GPUs with tensor parallelism
CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-0,1} /opt/vllm/bin/vllm serve lukealonso/MiniMax-M2-NVFP4 \
--host 0.0.0.0 \
--port 8345 \
--served-model-name default-model lukealonso/MiniMax-M2-NVFP4 \
--trust-remote-code \
--gpu-memory-utilization 0.95 \
--pipeline-parallel-size 1 \
--enable-expert-parallel \
--tensor-parallel-size 2 \
--max-model-len 196608 \
--max-num-seqs 32 \
--enable-auto-tool-choice \
--reasoning-parser minimax_m2_append_think \
--tool-call-parser minimax_m2 \
--all2all-backend pplx \
--enable-prefix-caching \
--enable-chunked-prefill \
--max-num-batched-tokens 16384 \
--dtype auto \
--kv-cache-dtype fp8
My environment:
Python: 3.12.3
vLLM: 0.11.2.dev360+g8e7a89160
PyTorch: 2.9.0+cu130
CUDA: 13.0
GPU 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition
GPU 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition
Triton: 3.5.0
FlashInfer: 0.5.3
works with above
though with no gains over vllm/vllm-openai:nightly-11fd69dd54060a59c6f62a6d217e1ecc47d74a68 in high ~72tps, ~332tps x10 (2 x RTX PRO 6000 @50k context)
python -c "import vllm; print(vllm.version); import torch; print(f'Torch Version: {torch.version}'); print(f'CUDA Available: {torch.cuda.is_available()}'); print(f'CUDA Version: {torch.version.cuda}'); import triton; print(triton.version); import flashinfer; print(flashinfer.version)"
0.11.2.dev403+gb9d0504a3
Torch Version: 2.9.0+cu130
CUDA Available: True
CUDA Version: 13.0
3.5.0
0.5.3
that vllm-openai:nightly-11fd69dd54060a59c6f62a6d217e1ecc47d74a68 seems to be the last one that actually ran nvfp4 out of the box at all
for those who stumble on this just use it via
docker pull lavd/vllm-openai:nightly-11fd69dd54060a59c6f62a6d217e1ecc47d74a68
or
docker pull lavd/vllm-openai:nvfp4
Thanks, finally it worked. But as others confirmed it's the same speed of AWQ at the moment. :(
same speed, but better perplexity
if someone can measure against a2s-ai/MiniMax-M2-AWQ for example
yeah awq was never very good quality it seems, it failed every-time in testing vs q4 ggufs. hopefully nvfp4 is much higher quality quantization method
Interesting. My experience with the awq model was pretty bad. Has anyone tried a qwen3 vl 235B nvfp4? Was looking for a quantized version of that one.
Like this? RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4
AWQ loosing 1 point here
lm_eval --model local-completions --model_args model=mmm2,system_prompt="Provide only the final numerical answer. No reasoning or explanations.",base_url=http://192.168.1.200:8001/v1/completions,tokenizer=/data1/MiniMax-M2-AWQ/,max_tokens=4096 --tasks gsm8k_cot,humaneval --confirm_run_unsafe_code --gen_kwargs do_sample=False --trust_remote_code --log_samples --output_path eval_results/ --batch_size 20
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k_cot | 3 | flexible-extract | 8 | exact_match | ↑ | 0.9189 | ± | 0.0075 |
| strict-match | 8 | exact_match | ↑ | 0.9143 | ± | 0.0077 | ||
| humaneval | 1 | create_test | 0 | pass@1 | 0.5000 | ± | 0.0392 | |
| Requesting API: 100% | ███████ | 1319/1319 [06:37<00:00, 3.32it/s] | ||||||
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
| ----- | ------: | ---------------- | -----: | ----------- | --- | -----: | --- | -----: |
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.9272 | ± | 0.0072 |
| strict-match | 5 | exact_match | ↑ | 0.9265 | ± | 0.0072 | ||
| Requesting API: 100% | ███████ | 56168/56168 [18:22<00:00, 50.94it/s] | ||||||
| Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
| ------------------ | ------: | ------ | ------ | ------ | --- | -----: | --- | -----: |
| mmlu | 2 | none | acc | ↑ | 0.8164 | ± | 0.0031 | |
| - humanities | 2 | none | acc | ↑ | 0.7488 | ± | 0.0061 | |
| - other | 2 | none | acc | ↑ | 0.8616 | ± | 0.0060 | |
| - social sciences | 2 | none | acc | ↑ | 0.8921 | ± | 0.0055 | |
| - stem | 2 | none | acc | ↑ | 0.7989 | ± | 0.0069 |
==========================================================================================================================
lm_eval --model local-completions --model_args model=mmm2nv,system_prompt="Provide only the final numerical answer. No reasoning or explanations.",base_url=http://192.168.1.200:8003/v1/completions,tokenizer=/data1/MiniMax-M2-NVFP4/,max_tokens=4096 --tasks gsm8k_cot,humaneval --confirm_run_unsafe_code --gen_kwargs do_sample=False --trust_remote_code --log_samples --output_path eval_results/ --batch_size 20
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k_cot | 3 | flexible-extract | 8 | exact_match | ↑ | 0.9257 | ± | 0.0072 |
| strict-match | 8 | exact_match | ↑ | 0.9212 | ± | 0.0074 | ||
| humaneval | 1 | create_test | 0 | pass@1 | 0.5854 | ± | 0.0386 | |
| Requesting API: 100% | ███████ | 1319/1319 [06:31<00:00, 3.37it/s] | ||||||
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
| ----- | ------: | ---------------- | -----: | ----------- | --- | -----: | --- | -----: |
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.9378 | ± | 0.0067 |
| strict-match | 5 | exact_match | ↑ | 0.9363 | ± | 0.0067 | ||
| Requesting API: 100% | ███████ | 56168/56168 [16:58<00:00, 55.17it/s] | ||||||
| Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
| ------------------ | ------: | ------ | ------ | ------ | --- | -----: | --- | -----: |
| mmlu | 2 | none | acc | ↑ | 0.8173 | ± | 0.0031 | |
| - humanities | 2 | none | acc | ↑ | 0.7564 | ± | 0.0060 | |
| - other | 2 | none | acc | ↑ | 0.8622 | ± | 0.0060 | |
| - social sciences | 2 | none | acc | ↑ | 0.8859 | ± | 0.0056 | |
| - stem | 2 | none | acc | ↑ | 0.7967 | ± | 0.0069 |
============================================================================================================================
full just for comparison
lm_eval --model local-completions --model_args model=mmm2,system_prompt="Provide only the final numerical answer. No reasoning or explanations.",base_url=http://192.168.1.200:8001/v1/completions,tokenizer=/data1/MiniMax-M2/,num_concurrent=40,max_tokens=8168 --tasks mmlu --confirm_run_unsafe_code --gen_kwargs do_sample=False --trust_remote_code --log_samples --output_path eval_results/
| Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| mmlu | 2 | none | acc | ↑ | 0.8329 | ± | 0.0030 | |
| - humanities | 2 | none | acc | ↑ | 0.7747 | ± | 0.0059 | |
| - other | 2 | none | acc | ↑ | 0.8774 | ± | 0.0057 | |
| - social sciences | 2 | none | acc | ↑ | 0.9022 | ± | 0.0053 | |
| - stem | 2 | none | acc | ↑ | 0.8081 | ± | 0.0068 |
Interesting. My experience with the awq model was pretty bad. Has anyone tried a qwen3 vl 235B nvfp4? Was looking for a quantized version of that one.
could also be that modern Q4 ggufs are mostly dynamic using imatrix so a lot of the important layers are kept at higher precision. I think i've seen some AWQ athat are fp4/fp8 mix? However for Deepseek the only one I can run in 4x96gb is the smallest 4bit AWQ, the other 4bit variations i found did not fit
The new vllm 0.12.0 seems to have better support for nvfp4.
If anyone tries it would be nice to see if there is any improvement for speed.
for this model 0.11.0 vs 0.12.0 is ~1-2% difference is tps on 50k context (2 x RTX PRO 6000), 1:1 catching up with sglang 73tps@50kcontext