Can't stop it from speweing garbage after about 1000 tokens...
#2
by
SuperbEmphasis
- opened
My vllm command:
vllm serve /models/Llama-3_3-Nemotron-Super-49B-v1-FP8 --tensor-parallel-size 1 --gpu-memory-utilization 0.95 --quantization modelopt --trust-remote-code --max-model-len 32786
I also tried to add in a jinja llama3 template:
--chat-template /models/templates/llama3.jinja
But for every response... it seems to stream forever...and never stops correctly.