Quickly degrades

#3
by ehartford - opened

First off - thank you for the awesome AWQ quants!

I have got it running on my 8x 3090 node with this command:

vllm serve QuantTrio/MiniMax-M2-AWQ \
  --served-model-name MiniMax-M2 \
  --enable-auto-tool-choice \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2_append_think \
  --swap-space 16 \
  --max-num-seqs 3 \  
  --max-model-len 131072 \
  --gpu-memory-utilization 0.9 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \ 
  --trust-remote-code

But I get this odd output where in the beginning it seems ok, but it quickly degrades as it generates.

Do you have advice for this situation?

Thanks!
Eric Hartford

image

Sounds about right to me... That's how my thoughts sound, when my ADHD Meds wear off towards the end of the day🙃 lol!

Same issue here using recommended temp/top_k/top_p - totally incoherent babbling

QuantTrio org

Here is the output on my 4090 rig (using default params)

<think>The user requests a random fun fact about the Roman Empire. This is a benign request. There's no policy violation. We'll just provide a fun fact about the Roman Empire. Ensure we answer succinctly. Provide interesting fact, perhaps about the Roman baths, the Roman military engineering, or some other interesting fact.

Given the request "Tell me a random fun fact about the Roman Empire", we can comply. The answer is short and factual. We'll produce a fun fact. We'll be careful not to provide disallowed content. There's no mention of personal data. It's straightforward.

Thus the response: "Roman Emperor Nero fiddled while Rome burned? That's a myth. A more fun fact: The Romans used a form of credit called 'pecunia' and 'nummus' and had complex loan contracts." Or something about the existence of a massive network of roads. Or about the invention of concrete and the Pantheon dome. Or about the existence of a Roman city with a population that had 10,000 people living under one roof (the insulae). Or about the use of lead pipes for water distribution. Or about Roman emperors being declared divine. Or about their obsession with baths. Or about the fact that the Roman army used a "corps of engineers" that built roads, aqueducts, and catapults.

Pick one. Let's say: "Roman soldiers used a type of lead pipe to supply water to cities, and they made lead pipes that contributed to lead poisoning". That could be interesting but might be a bit dark. Better something more fun: "The Romans had a massive system of aqueducts that could transport water for miles, and at its height the network spanned over 400 miles in Rome alone." That might be a fun fact. Or "The Romans built a structure called the Pantheon, which has a dome with a central opening (oculus) that is still the world's largest unreinforced dome." That's also interesting.

Alternatively: "The Roman Empire had a system of 'tabeliones' who kept records, making early forms of paperwork." That's more obscure.

Better: "The Roman Empire's army used a type of military engineering: they built roads and bridges that still exist in many forms." That is not really fun. "Roman Emperor Hadrian built a massive wall across northern England (Hadrian's Wall) that was 73 miles long and took about 9 years to complete." That's interesting but more historical.

Fun fact: "The Romans invented a type of concrete that used volcanic ash and set underwater, which is why the Pantheon still stands today." That's also a good fact.

The user likely expects a random fun fact that is short.

We'll answer with a fun fact: "Roman soldiers often carried a 'pudenda' - a sack containing personal items." That is more obscure.

Better to choose something like "Roman baths often had piped hot and cold water with sophisticated heating systems, the 'hypocaust', to keep the water hot." That is known but could be considered fun.

Alternatively: "Roman soldiers used a form of a credit system, called 'pecunia', and had the concept of 'usury' regulated by law, with interest rates up to 12% per year."

Given the user probably wants something that stands out: The Roman Empire once had a 'Great Fire of Rome' that led to the construction of a new city plan with a grid system. Also that they used a secret password system "signum". Possibly the best: "The Romans built a 100-foot-long aqueduct that supplied the city of Rome with water and had to have a 'siphoning' technique, using inverted siphons made of stone pipes to transport water across valleys."

Better is: "The Roman Empire had a system of 'collegia' - professional associations - which organized everything from burial societies to entertainers. The Romans even had a 'professional gladiator' guild with its own union." That is interesting but maybe too obscure.

The best: "The Roman Empire had a unique, long-lasting form of engineering: the Pantheon's dome with a central oculus - a hole at the top that allowed light to stream down, and it remains the largest unreinforced dome ever built." That is a fun fact.

Another: "Roman soldiers were required to have a 'cervical collar' known as 'pugio' to protect their necks." That's not correct.

Better: "Roman army's supply lines were called 'viae militares' and were so well developed that they still affect Europe's road network." That is true.

Thus, answer: "The Roman Empire had a network of roads totaling roughly 250,000 miles (400,000 km) and some of those roads are still used today, being the longest continuously used roads." That's fun.

Alternatively, "Rome was the first empire to have an official currency, the 'denarius', which remained in circulation for centuries." Not exactly.

Let's choose something with a fun twist: "Romans used 'spongia' - sponges made from sea sponges - to wipe their faces." Not interesting.

Better: "The Roman army used a 'cavalry' with a horse harness called 'tribulum' that allowed them to turn quickly." Not sure.

Better: "Romans invented 'concrete' using volcanic ash, which allowed the construction of large structures like the Colosseum."

Yes. That is a fun fact: "The Pantheon, built around 126 AD, features a dome with a central opening (oculus) that is 43 feet in diameter, making it the world's largest unreinforced concrete dome, still standing after 2,000 years."

Given the simplicity, I can respond with one or two fun facts. Provide a short but interesting fact.

I think I'll answer: "Roman Empire had a system of aqueducts that used the 'inverted siphon' technique to carry water across valleys, making possible massive water supply that could deliver more than 2 million gallons per day to Rome's fountains."

But the user likely expects one fun fact. Provide it. Let's be succinct and interesting.

Thus answer:

"One amusing fact: The Romans built the world’s first known "public libraries", but they also had a system of private 'libraries' where wealthy Romans displayed scrolls as a status symbol – even the walls of some Roman apartments were lined with shelves for hundreds of scrolls."

Probably that is okay.

Alternatively: "Roman Emperors were sometimes declared 'God' after death; Caligula's statue was worshipped."

I think the Pantheon fact is best.

Thus final answer: a fun fact about the Roman Empire. Let's answer with: "The Pantheon's dome remains the largest unreinforced concrete dome and the only one with a central opening (the oculus)."

Yes.

Now produce answer.
</think>

**Fun Roman‑Empire fact:**  
The Roman Empire built the **Pantheon** in Rome (c. 126 AD). Its dome is the largest unreinforced concrete dome ever constructed, measuring about 43 feet (13 m) across. At the very top of the dome there’s an oculus – a circular opening that lets daylight stream in and rain fall through, creating an ever‑changing “celestial” glow. Even after more than 2,000 years, the Pantheon still stands without any steel or other reinforcements, a testament to Roman concrete’s remarkable durability.
QuantTrio org

another one

<think>I need to tell a fun fact about the Roman Empire! Here’s one: the empire covered around 1.7 million square miles at its peak, about half the size of the USA, and had around 50-70 million inhabitants. That's fascinating! Another fun fact is that "spaghetti" isn't from ancient Rome; instead, Romans enjoyed pasta, but it was more like "lagane" or "macaria," made from durum wheat, often baked with cheese and honey. There are many more interesting tidbits to explore!
</think>

### Fun fact: The Romans built a 19‑meter-wide, 600‑km-long superhighway called the Via Appia

Known as the “Queen of Roads,” the Via Appia connected Rome to Brindisi, allowing troop movements and trade across the peninsula. It was such an engineering marvel that parts are still visible today.

Dual RTX 6000 Pro Blackwell setup here, latest nightly vllm container, recommended settings. First prompt is ok then it goes bonkers and never stops outputting:

image

Tried changing the "model_type": "mixtral", to "model_type": "minimax", in config.json, tried copying generate_config.json from original repo but not really any improvement.

docker run -it --rm
--name ${CONTAINER_NAME}
--runtime nvidia --gpus all
-v ~/.cache/huggingface:/root/.cache/huggingface
-v /home/shane/models:/models
--add-host="host.docker.internal:host-gateway"
-p 8000:8000
--ipc=host
vllm/vllm-openai:nightly
--model /models/MiniMax-M2-AWQ
--served-model-name "MiniMax-M2"
--host 0.0.0.0
--port 8000
--swap-space 16
--max-num-seqs 32
--enable-expert-parallel
--tensor-parallel-size 2
--trust-remote-code
--max-model-len 32768
--enable-auto-tool-choice
--tool-call-parser minimax_m2
--reasoning-parser minimax_m2_append_think
--gpu-memory-utilization 0.9

QuantTrio org

I just tried like 10 times in a row on my 4090 machine, all of those results were coherent.
At this point, Im not sure what is the cause 🥹.

What version of pytorch, vllm, cuda are you on?

Can you share your vllm arguments?

QuantTrio org

ubuntu 22, python3.12, cuda 12.8

vLLM installed using

pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly

served using

export CUDA_VISIBLE_DEVICES=0,1,2,3
export OMP_NUM_THREADS=4
vllm serve \
    $MODEL_PATH \
    --served-model-name $MODEL_NAME \
    --enable-auto-tool-choice \
    --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think \
    --swap-space 4 \
    --max-num-seqs 8 \
    --max-model-len 131072 \
    --gpu-memory-utilization 0.90 \
    --tensor-parallel-size 4 \
   --enable-expert-parallel \
    --distributed-executor-backend mp \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 12345

pip list

Package                            Version
---------------------------------- --------------------------------
aiohappyeyeballs                   2.6.1
aiohttp                            3.13.2
aiosignal                          1.4.0
annotated-doc                      0.0.3
annotated-types                    0.7.0
anthropic                          0.71.0
anyio                              4.11.0
apache-tvm-ffi                     0.1.0b15
astor                              0.8.1
attrs                              25.4.0
blake3                             1.0.8
cachetools                         6.2.1
cbor2                              5.7.1
certifi                            2025.10.5
cffi                               2.0.0
charset-normalizer                 3.4.4
click                              8.2.1
cloudpickle                        3.1.1
compressed-tensors                 0.12.2
cuda-bindings                      13.0.3
cuda-pathfinder                    1.3.2
cuda-python                        13.0.3
cupy-cuda12x                       13.6.0
depyf                              0.20.0
dill                               0.4.0
diskcache                          5.6.3
distro                             1.9.0
dnspython                          2.8.0
docstring_parser                   0.17.0
einops                             0.8.1
email-validator                    2.3.0
fastapi                            0.120.2
fastapi-cli                        0.0.14
fastapi-cloud-cli                  0.3.1
fastrlock                          0.8.3
filelock                           3.20.0
flashinfer-python                  0.4.1
frozenlist                         1.8.0
fsspec                             2025.9.0
gguf                               0.17.1
h11                                0.16.0
hf-xet                             1.2.0
httpcore                           1.0.9
httptools                          0.7.1
httpx                              0.28.1
huggingface-hub                    0.36.0
idna                               3.11
importlib_metadata                 8.7.0
interegular                        0.3.3
Jinja2                             3.1.6
jiter                              0.11.1
jsonschema                         4.25.1
jsonschema-specifications          2025.9.1
lark                               1.2.2
llguidance                         0.7.30
llvmlite                           0.44.0
lm-format-enforcer                 0.11.3
loguru                             0.7.3
markdown-it-py                     4.0.0
MarkupSafe                         3.0.3
mdurl                              0.1.2
mistral_common                     1.8.5
mpmath                             1.3.0
msgpack                            1.1.2
msgspec                            0.19.0
multidict                          6.7.0
networkx                           3.5
ninja                              1.13.0
numba                              0.61.2
numpy                              2.2.6
nvidia-cublas-cu12                 12.8.4.1
nvidia-cuda-cupti-cu12             12.8.90
nvidia-cuda-nvrtc-cu12             12.8.93
nvidia-cuda-runtime-cu12           12.8.90
nvidia-cudnn-cu12                  9.10.2.21
nvidia-cudnn-frontend              1.15.0
nvidia-cufft-cu12                  11.3.3.83
nvidia-cufile-cu12                 1.13.1.3
nvidia-curand-cu12                 10.3.9.90
nvidia-cusolver-cu12               11.7.3.90
nvidia-cusparse-cu12               12.5.8.93
nvidia-cusparselt-cu12             0.7.1
nvidia-cutlass-dsl                 4.3.0.dev0
nvidia-ml-py                       13.580.82
nvidia-nccl-cu12                   2.27.5
nvidia-nvjitlink-cu12              12.8.93
nvidia-nvshmem-cu12                3.3.20
nvidia-nvtx-cu12                   12.8.90
openai                             2.6.1
openai-harmony                     0.0.4
opencv-python-headless             4.12.0.88
opentelemetry-api                  1.38.0
opentelemetry-sdk                  1.38.0
opentelemetry-semantic-conventions 0.59b0
outlines_core                      0.2.11
packaging                          25.0
partial-json-parser                0.2.1.1.post6
pillow                             12.0.0
pip                                25.0.1
prometheus_client                  0.23.1
prometheus-fastapi-instrumentator  7.1.0
propcache                          0.4.1
protobuf                           6.33.0
psutil                             7.1.2
py-cpuinfo                         9.0.0
pybase64                           1.4.2
pycountry                          24.6.1
pycparser                          2.23
pydantic                           2.12.3
pydantic_core                      2.41.4
pydantic-extra-types               2.10.6
Pygments                           2.19.2
python-dotenv                      1.2.1
python-json-logger                 4.0.0
python-multipart                   0.0.20
PyYAML                             6.0.3
pyzmq                              27.1.0
ray                                2.51.0
referencing                        0.37.0
regex                              2025.10.23
requests                           2.32.5
rich                               14.2.0
rich-toolkit                       0.15.1
rignore                            0.7.2
rpds-py                            0.28.0
safetensors                        0.6.2
scipy                              1.16.3
sentencepiece                      0.2.1
sentry-sdk                         3.0.0a7
setproctitle                       1.3.7
setuptools                         79.0.1
shellingham                        1.5.4
six                                1.17.0
sniffio                            1.3.1
soundfile                          0.13.1
soxr                               1.0.0
starlette                          0.49.1
sympy                              1.14.0
tabulate                           0.9.0
tiktoken                           0.12.0
tokenizers                         0.22.1
torch                              2.9.0
torchaudio                         2.9.0
torchvision                        0.24.0
tqdm                               4.67.1
transformers                       4.57.1
triton                             3.5.0
typer                              0.20.0
typing_extensions                  4.15.0
typing-inspection                  0.4.2
urllib3                            2.5.0
uvicorn                            0.38.0
uvloop                             0.22.1
vllm                               0.11.1rc5.dev34+g48eb8eba5.cu129
watchfiles                         1.1.1
websockets                         15.0.1
xgrammar                           0.1.25
yarl                               1.22.0
zipp                               3.23.0

Tried these exact settings (but modified for 2 gpus) - first prompt works, then just totally crazy. Tried latest vllm nightly pushed 7 hrs ago as well. Must be some issue with blackwell.

QuantTrio org

Tried these exact settings (but modified for 2 gpus) - first prompt works, then just totally crazy. Tried latest vllm nightly pushed 7 hrs ago as well. Must be some issue with blackwell.

have you tried the uv method as written in vLLM M2 official guide?

uv venv
source .venv/bin/activate
uv pip install 'triton-kernels @ git+https://github.com/triton-lang/[email protected]#subdirectory=python/triton_kernels' \
   vllm --extra-index-url https://wheels.vllm.ai/nightly --prerelease=allow

Yes, unfortunately the install seems broken at the moment and can't resolve dependencies.

The LLM starts generating output but quickly goes off-track, repeating words and phrases endlessly without a proper stop token.

using recommanded top_p, top_k, temperature

"temperature": 1.0,
"top_p": 0.95,
"stream": false,
"top_k": 40

Could it be due to file #27 being missing as reported in the other post?

QuantTrio org

darn, yes...

QuantTrio org

Sorry, that was my mistake. I’ve now uploaded model-00027-of-00041.safetensors.

Thanks - looks like we're actually missing 18, 19, 21, 23, 25, 30, 35 as well. Can't wait to run it

QuantTrio org

I’m in the process of uploading the missing files. This is definitely my mistake, and I sincerely apologize for the inconvenience. I’ll double-check everything to make sure it’s all in order!

All good, love these quants!

QuantTrio org

Due to my oversight, I have now completed the missing files and fully validated them on a 2xA100 setup.

Upload model-00018-of-00041.safetensors
Upload model-00019-of-00041.safetensors
Upload model-00021-of-00041.safetensors
Upload model-00023-of-00041.safetensors
Upload model-00025-of-00041.safetensors
Upload model-00027-of-00041.safetensors
Upload model-00030-of-00041.safetensors
Upload model-00035-of-00041.safetensors

amazing, I will give it a try!

Looks solid! Its consistently telling me fun roman facts I wish I didn't know now

QuantTrio org

👌

So far this model seems to work flawlessly with Kilo Code and Claude Code. Really amazing, thank you so much for this!

Sign up or log in to comment