modelopt NVFP4 quantized MiniMax-M2
Tested (but not extensively validated) on 2x RTX Pro 6000 Blackwell via:
inference:
image: vllm/vllm-openai:nightly
container_name: inference
ports:
- "0.0.0.0:8000:8000"
gpus: "all"
shm_size: "32g"
ipc: "host"
ulimits:
memlock: -1
nofile: 1048576
environment:
- NCCL_IB_DISABLE=1
- NCCL_NVLS_ENABLE=0
- NCCL_P2P_DISABLE=0
- NCCL_SHM_DISABLE=0
- VLLM_USE_V1=1
- VLLM_USE_FLASHINFER_MOE_FP4=1
- OMP_NUM_THREADS=8
- SAFETENSORS_FAST_GPU=1
volumes:
- /dev/shm:/dev/shm
command:
- lukealonso/MiniMax-M2-NVFP4
- --enable-auto-tool-choice
- --tool-call-parser
- minimax_m2
- --reasoning-parser
- minimax_m2_append_think
- --all2all-backend
- pplx
- --enable-expert-parallel
- --enable-prefix-caching
- --enable-chunked-prefill
- --served-model-name
- "MiniMax-M2"
- --tensor-parallel-size
- "2"
- --gpu-memory-utilization
- "0.95"
- --max-num-batched-tokens
- "16384"
- --dtype
- "auto"
- --max-num-seqs
- "8"
- --kv-cache-dtype
- fp8
- --host
- "0.0.0.0"
- --port
- "8000"
- Downloads last month
- 440
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for lukealonso/MiniMax-M2-NVFP4
Base model
MiniMaxAI/MiniMax-M2