lukealonso/MiniMax-M2-NVFP4 · you know which nightly it worked with? because it does not with current one

22 days ago

you know which nightly it worked with? because it does not with current one
https://hub.docker.com/r/vllm/vllm-openai/tags

willfalco

22 days ago

nvm , it is vllm/vllm-openai:nightly-11fd69dd54060a59c6f62a6d217e1ecc47d74a68

lukealonso

Owner 22 days ago

let me know if it works for you with that image

willfalco

21 days ago

it does, same performance as AWQ - maybe extra 1% on tests
we got to get to the bottom of what is going on with vllm nvfp4 and blackwell sm120

Abduali

18 days ago

how is support on non-blackwell gpus?

maleal

6 days ago

•

edited 6 days ago

Really wanted to try this model but it looks like that nightly is no more available :(

The commit is:
https://github.com/vllm-project/vllm/commit/11fd69dd54060a59c6f62a6d217e1ecc47d74a68

The commit is in v0.11.2 (doesn't work), v0.11.1 (doesn't work), v0.11.1rc7 (not available), v0.11.1rc6 (not available).
I also tried a bunch of nightly with that commit but nothing.
I've not tried to build a dockerfile for it but if someone has and it works it'd be great!

Unfortunately NVFP4, despite being pretty amazing (on paper) doesn't receive enough love yet. Hopefully soon.

In any case, thank you for quantizing this awesome model!

willfalco

6 days ago

try running with latest vllm compiled, not docker . there is cuda image regression in docker currently
you want ENABLE_CUTLASS_MOE_SM120=1 https://github.com/vllm-project/vllm/pull/29242

ktsaou

6 days ago

I was getting errors like this:

ValueError: CutlassExpertsFp4 doesn't support DP. Use flashinfer CUTLASS Fused MoE backend instead (set VLLM_USE_FLASHINFER_MOE_FP4=1)

I set VLLM_USE_FLASHINFER_MOE_FP4=1 but the errors remain.

Examining the vllm code, this error is thrown because only 2 VLLM_FLASHINFER_MOE_BACKEND are supported: masked_gemm and throughput.

I set export VLLM_FLASHINFER_MOE_BACKEND=throughput and it loaded it.

ktsaou

6 days ago

•

edited 6 days ago

This seems to work reliably on 2x NVIDIA RTX 6000 PRO Blackwell (2x 96GB VRAM):

export VLLM_ATTENTION_BACKEND=FLASHINFER
export VLLM_FLASHINFER_MOE_BACKEND=throughput
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_MOE_FP8=1
export VLLM_USE_FLASHINFER_MOE_FP4=1
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1

# Run on 2 GPUs with tensor parallelism
CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-0,1} /opt/vllm/bin/vllm serve lukealonso/MiniMax-M2-NVFP4 \
  --host 0.0.0.0 \
  --port 8345 \
  --served-model-name default-model lukealonso/MiniMax-M2-NVFP4 \
  --trust-remote-code \
  --gpu-memory-utilization 0.95 \
  --pipeline-parallel-size 1 \
  --enable-expert-parallel \
  --tensor-parallel-size 2 \
  --max-model-len 196608 \
  --max-num-seqs 32 \
  --enable-auto-tool-choice \
  --reasoning-parser minimax_m2_append_think \
  --tool-call-parser minimax_m2 \
  --all2all-backend pplx \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --max-num-batched-tokens 16384 \
  --dtype auto \
  --kv-cache-dtype fp8

My environment:

  Python: 3.12.3
  vLLM: 0.11.2.dev360+g8e7a89160
  PyTorch: 2.9.0+cu130
  CUDA: 13.0
  GPU 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition
  GPU 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition
  Triton: 3.5.0
  FlashInfer: 0.5.3

willfalco

6 days ago

works with above
though with no gains over vllm/vllm-openai:nightly-11fd69dd54060a59c6f62a6d217e1ecc47d74a68 in high ~72tps, ~332tps x10 (2 x RTX PRO 6000 @50k context)
python -c "import vllm; print(vllm.version); import torch; print(f'Torch Version: {torch.version}'); print(f'CUDA Available: {torch.cuda.is_available()}'); print(f'CUDA Version: {torch.version.cuda}'); import triton; print(triton.version); import flashinfer; print(flashinfer.version)"
0.11.2.dev403+gb9d0504a3
Torch Version: 2.9.0+cu130
CUDA Available: True
CUDA Version: 13.0
3.5.0
0.5.3

willfalco

5 days ago

that vllm-openai:nightly-11fd69dd54060a59c6f62a6d217e1ecc47d74a68 seems to be the last one that actually ran nvfp4 out of the box at all
for those who stumble on this just use it via
docker pull lavd/vllm-openai:nightly-11fd69dd54060a59c6f62a6d217e1ecc47d74a68
or
docker pull lavd/vllm-openai:nvfp4

maleal

5 days ago

Thanks, finally it worked. But as others confirmed it's the same speed of AWQ at the moment. :(

willfalco

5 days ago

same speed, but better perplexity
if someone can measure against a2s-ai/MiniMax-M2-AWQ for example

Fernanda24

4 days ago

yeah awq was never very good quality it seems, it failed every-time in testing vs q4 ggufs. hopefully nvfp4 is much higher quality quantization method

maleal

4 days ago

Interesting. My experience with the awq model was pretty bad. Has anyone tried a qwen3 vl 235B nvfp4? Was looking for a quantized version of that one.

willfalco

4 days ago

Like this? RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4

AWQ loosing 1 point here

lm_eval --model local-completions --model_args model=mmm2,system_prompt="Provide only the final numerical answer. No reasoning or explanations.",base_url=http://192.168.1.200:8001/v1/completions,tokenizer=/data1/MiniMax-M2-AWQ/,max_tokens=4096 --tasks gsm8k_cot,humaneval --confirm_run_unsafe_code --gen_kwargs do_sample=False --trust_remote_code --log_samples --output_path eval_results/ --batch_size 20

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k_cot	3	flexible-extract	8	exact_match	↑	0.9189	±	0.0075
		strict-match	8	exact_match	↑	0.9143	±	0.0077
humaneval	1	create_test	0	pass@1		0.5000	±	0.0392
Requesting API: 100%	███████	1319/1319 [06:37<00:00, 3.32it/s]
Tasks	Version	Filter	n-shot	Metric		Value		Stderr
-----	------:	----------------	-----:	-----------	---	-----:	---	-----:
gsm8k	3	flexible-extract	5	exact_match	↑	0.9272	±	0.0072
		strict-match	5	exact_match	↑	0.9265	±	0.0072
Requesting API: 100%	███████	56168/56168 [18:22<00:00, 50.94it/s]
Groups	Version	Filter	n-shot	Metric		Value		Stderr
------------------	------:	------	------	------	---	-----:	---	-----:
mmlu	2	none		acc	↑	0.8164	±	0.0031
- humanities	2	none		acc	↑	0.7488	±	0.0061
- other	2	none		acc	↑	0.8616	±	0.0060
- social sciences	2	none		acc	↑	0.8921	±	0.0055
- stem	2	none		acc	↑	0.7989	±	0.0069

==========================================================================================================================
lm_eval --model local-completions --model_args model=mmm2nv,system_prompt="Provide only the final numerical answer. No reasoning or explanations.",base_url=http://192.168.1.200:8003/v1/completions,tokenizer=/data1/MiniMax-M2-NVFP4/,max_tokens=4096 --tasks gsm8k_cot,humaneval --confirm_run_unsafe_code --gen_kwargs do_sample=False --trust_remote_code --log_samples --output_path eval_results/ --batch_size 20

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k_cot	3	flexible-extract	8	exact_match	↑	0.9257	±	0.0072
		strict-match	8	exact_match	↑	0.9212	±	0.0074
humaneval	1	create_test	0	pass@1		0.5854	±	0.0386
Requesting API: 100%	███████	1319/1319 [06:31<00:00, 3.37it/s]
Tasks	Version	Filter	n-shot	Metric		Value		Stderr
-----	------:	----------------	-----:	-----------	---	-----:	---	-----:
gsm8k	3	flexible-extract	5	exact_match	↑	0.9378	±	0.0067
		strict-match	5	exact_match	↑	0.9363	±	0.0067
Requesting API: 100%	███████	56168/56168 [16:58<00:00, 55.17it/s]
Groups	Version	Filter	n-shot	Metric		Value		Stderr
------------------	------:	------	------	------	---	-----:	---	-----:
mmlu	2	none		acc	↑	0.8173	±	0.0031
- humanities	2	none		acc	↑	0.7564	±	0.0060
- other	2	none		acc	↑	0.8622	±	0.0060
- social sciences	2	none		acc	↑	0.8859	±	0.0056
- stem	2	none		acc	↑	0.7967	±	0.0069

============================================================================================================================

willfalco

4 days ago

full just for comparison
lm_eval --model local-completions --model_args model=mmm2,system_prompt="Provide only the final numerical answer. No reasoning or explanations.",base_url=http://192.168.1.200:8001/v1/completions,tokenizer=/data1/MiniMax-M2/,num_concurrent=40,max_tokens=8168 --tasks mmlu --confirm_run_unsafe_code --gen_kwargs do_sample=False --trust_remote_code --log_samples --output_path eval_results/

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.8329	±	0.0030
- humanities	2	none	acc	↑	0.7747	±	0.0059
- other	2	none	acc	↑	0.8774	±	0.0057
- social sciences	2	none	acc	↑	0.9022	±	0.0053
- stem	2	none	acc	↑	0.8081	±	0.0068

Fernanda24

3 days ago

•

edited 3 days ago

Interesting. My experience with the awq model was pretty bad. Has anyone tried a qwen3 vl 235B nvfp4? Was looking for a quantized version of that one.

could also be that modern Q4 ggufs are mostly dynamic using imatrix so a lot of the important layers are kept at higher precision. I think i've seen some AWQ athat are fp4/fp8 mix? However for Deepseek the only one I can run in 4x96gb is the smallest 4bit AWQ, the other 4bit variations i found did not fit

maleal

about 12 hours ago

The new vllm 0.12.0 seems to have better support for nvfp4.

If anyone tries it would be nice to see if there is any improvement for speed.

willfalco

about 9 hours ago

for this model 0.11.0 vs 0.12.0 is ~1-2% difference is tps on 50k context (2 x RTX PRO 6000), 1:1 catching up with sglang 73tps@50kcontext