Broken results with vLLM

by meiragat - opened 9 days ago

9 days ago

Hi,
I'm getting nonsensical results, hopefully you can help me.
RTX 5090 on Runpod (also get similar results with L40)
Also tried without specifying top_k, min_p, temperature etc and got worse results.

vllm/vllm-openai:latest
--model gaunernst/gemma-3-27b-it-qat-autoawq --served-model-name gemma-3-27b --max-model-len 20072 --trust-remote-code

{
"model": "gemma-3-27b",
"temperature": 0.6,
"top_p": 0.95,
"top_k": 64,
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Give me a short introduction to Large language models"
}
]
}
],
"max_tokens": 1000
}

{
"id": "chatcmpl-459b8693ea39461698d0425665881c50",
"object": "chat.completion",
"created": 1752996762,
"model": "gemma-3-27b",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"reasoning_content": null,
"content": "## Introduction to the Large language models are large language models (LLMs.\n\nLarge language models (LLMs are artificial intelligence models (LLMs) are artificial intelligence (LLMs) are artificial intelligence systems that are a type of artificial intelligence (AI) are a type of of are a type of AI that are designed to understand and generate human-like text, translate, summarize, generate, and generate, and process, and text, and text. They are trained on large amounts of data. They are trained on large amounts of data, large datasets, often based on massive of data, and trained large of text data.\n\nLarge language. of text data, and they can perform tasks, and a variety of text. These models are used for a wide range of tasks, such as, such as translation, such as text, of tasks tasks text, and text, and translation, summarization, of text, and and question-answering, and coding, and text, and and translation, and and code, and text, and coding, and of text. They are a wide range of tasks. are, of tasks.\n\nLarge language.\n\nHere's, such as and of and text, such as question-answering, and of text, translation, summarization, and, coding, text, text, and code.\n\nLarge are and, and coding.\n\nHere are the of text, and text, and text, and text and code, and summarization, text, text, and are a range of tasks, and text, text and coding, and text and text and text, text, and text, text, and text, and text, and text, text, text, and text, and text, text, text, and text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text, text,",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "length",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 18,
"total_tokens": 1018,
"completion_tokens": 1000,
"prompt_tokens_details": null
},
"prompt_logprobs": null,
"kv_transfer_params": null
}

Logs:

2025-07-20T07:05:31.740083903Z INFO 07-20 00:05:31 [init.py:244] Automatically detected platform cuda.
2025-07-20T07:05:35.064061639Z INFO 07-20 00:05:35 [api_server.py:1395] vLLM API server version 0.9.2
2025-07-20T07:05:35.065934621Z INFO 07-20 00:05:35 [cli_args.py:325] non-default args: {'model': 'gaunernst/gemma-3-27b-it-qat-autoawq', 'trust_remote_code': True, 'max_model_len': 20072, 'served_model_name': ['gemma-3-27b']}
2025-07-20T07:05:40.834146359Z INFO 07-20 00:05:40 [config.py:841] This model supports multiple tasks: {'generate', 'embed', 'reward', 'classify'}. Defaulting to 'generate'.
2025-07-20T07:05:40.834722871Z INFO 07-20 00:05:40 [config.py:1472] Using max model len 20072
2025-07-20T07:05:41.108038153Z INFO 07-20 00:05:41 [awq_marlin.py:116] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
2025-07-20T07:05:41.445748630Z INFO 07-20 00:05:41 [config.py:2285] Chunked prefill is enabled with max_num_batched_tokens=2048.
2025-07-20T07:05:48.560100137Z INFO 07-20 00:05:48 [init.py:244] Automatically detected platform cuda.
2025-07-20T07:05:50.002781161Z INFO 07-20 00:05:50 [core.py:526] Waiting for init message from front-end.
2025-07-20T07:05:50.011532241Z INFO 07-20 00:05:50 [core.py:69] Initializing a V1 LLM engine (v0.9.2) with config: model='gaunernst/gemma-3-27b-it-qat-autoawq', speculative_config=None, tokenizer='gaunernst/gemma-3-27b-it-qat-autoawq', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=20072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=gemma-3-27b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
2025-07-20T07:05:51.417619288Z INFO 07-20 00:05:51 [parallel_state.py:1076] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
2025-07-20T07:05:53.053094227Z Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False.
2025-07-20T07:05:58.732675992Z INFO 07-20 00:05:58 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
2025-07-20T07:05:58.741920089Z INFO 07-20 00:05:58 [gpu_model_runner.py:1770] Starting to load model gaunernst/gemma-3-27b-it-qat-autoawq...
2025-07-20T07:05:58.985243348Z INFO 07-20 00:05:58 [gpu_model_runner.py:1775] Loading model from scratch...
2025-07-20T07:05:58.988289424Z INFO 07-20 00:05:58 [cuda.py:287] Using FlexAttention backend on V1 engine.
2025-07-20T07:05:59.053760347Z INFO 07-20 00:05:59 [cuda.py:284] Using Flash Attention backend on V1 engine.
2025-07-20T07:05:59.364090756Z INFO 07-20 00:05:59 [weight_utils.py:292] Using model weights format ['*.safetensors']
2025-07-20T07:06:22.014475905Z INFO 07-20 00:06:22 [weight_utils.py:308] Time spent downloading weights for gaunernst/gemma-3-27b-it-qat-autoawq: 22.648438 seconds
2025-07-20T07:06:22.305408208Z
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
2025-07-20T07:06:22.558428602Z
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:00, 3.95it/s]
2025-07-20T07:06:22.817926284Z
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:00<00:00, 3.89it/s]
2025-07-20T07:06:23.082282084Z
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:00<00:00, 3.84it/s]
2025-07-20T07:06:23.283935862Z
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:00<00:00, 4.22it/s]
2025-07-20T07:06:23.283972987Z
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:00<00:00, 4.09it/s]
2025-07-20T07:06:23.287605691Z INFO 07-20 00:06:23 [default_loader.py:272] Loading weights took 0.98 seconds
2025-07-20T07:06:26.139166409Z INFO 07-20 00:06:26 [gpu_model_runner.py:1801] Model loading took 17.5712 GiB and 26.748016 seconds
2025-07-20T07:06:26.335375223Z INFO 07-20 00:06:26 [gpu_model_runner.py:2238] Encoder cache will be initialized with a budget of 2048 tokens, and profiled with 8 image items of the maximum feature size.
2025-07-20T07:06:42.237965400Z INFO 07-20 00:06:42 [backends.py:508] Using cache directory: /root/.cache/vllm/torch_compile_cache/dd76174038/rank_0_0/backbone for vLLM's torch.compile
2025-07-20T07:06:42.237995906Z INFO 07-20 00:06:42 [backends.py:519] Dynamo bytecode transform time: 14.72 s
2025-07-20T07:06:50.210074899Z INFO 07-20 00:06:50 [backends.py:181] Cache the graph of shape None for later use
2025-07-20T07:07:51.980818740Z INFO 07-20 00:07:51 [backends.py:193] Compiling a graph for general shape takes 68.81 s
2025-07-20T07:09:33.874965419Z INFO 07-20 00:09:33 [monitor.py:34] torch.compile takes 83.54 s in total
2025-07-20T07:09:33.947406998Z /usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
2025-07-20T07:09:33.947433057Z If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
2025-07-20T07:09:33.947442982Z warnings.warn(
2025-07-20T07:09:35.090758048Z INFO 07-20 00:09:35 [gpu_worker.py:232] Available KV cache memory: 9.57 GiB
2025-07-20T07:09:35.605174614Z INFO 07-20 00:09:35 [kv_cache_utils.py:716] GPU KV cache size: 20,224 tokens
2025-07-20T07:09:35.605211369Z INFO 07-20 00:09:35 [kv_cache_utils.py:720] Maximum concurrency for 20,072 tokens per request: 6.55x
2025-07-20T07:10:17.211346641Z
Capturing CUDA graph shapes: 0%| | 0/67 [00:00<?, ?it/s]
Capturing CUDA graph shapes: 1%|▏ | 1/67 [00:00<00:43, 1.50it/s]
Capturing CUDA graph shapes: 3%|▎ | 2/67 [00:01<00:43, 1.50it/s]
Capturing CUDA graph shapes: 4%|▍ | 3/67 [00:01<00:42, 1.50it/s]
Capturing CUDA graph shapes: 6%|▌ | 4/67 [00:02<00:41, 1.50it/s]
Capturing CUDA graph shapes: 7%|▋ | 5/67 [00:03<00:41, 1.51it/s]
Capturing CUDA graph shapes: 9%|▉ | 6/67 [00:03<00:40, 1.51it/s]
Capturing CUDA graph shapes: 10%|█ | 7/67 [00:04<00:39, 1.52it/s]
Capturing CUDA graph shapes: 12%|█▏ | 8/67 [00:05<00:38, 1.52it/s]
Capturing CUDA graph shapes: 13%|█▎ | 9/67 [00:05<00:37, 1.53it/s]
Capturing CUDA graph shapes: 15%|█▍ | 10/67 [00:06<00:37, 1.53it/s]
Capturing CUDA graph shapes: 16%|█▋ | 11/67 [00:07<00:36, 1.53it/s]
Capturing CUDA graph shapes: 18%|█▊ | 12/67 [00:07<00:35, 1.53it/s]
Capturing CUDA graph shapes: 19%|█▉ | 13/67 [00:08<00:35, 1.53it/s]
Capturing CUDA graph shapes: 21%|██ | 14/67 [00:09<00:34, 1.53it/s]
Capturing CUDA graph shapes: 22%|██▏ | 15/67 [00:09<00:33, 1.54it/s]
Capturing CUDA graph shapes: 24%|██▍ | 16/67 [00:10<00:33, 1.54it/s]
Capturing CUDA graph shapes: 25%|██▌ | 17/67 [00:11<00:32, 1.54it/s]
Capturing CUDA graph shapes: 27%|██▋ | 18/67 [00:11<00:31, 1.55it/s]
Capturing CUDA graph shapes: 28%|██▊ | 19/67 [00:12<00:30, 1.55it/s]
Capturing CUDA graph shapes: 30%|██▉ | 20/67 [00:13<00:30, 1.55it/s]
Capturing CUDA graph shapes: 31%|███▏ | 21/67 [00:13<00:29, 1.56it/s]
Capturing CUDA graph shapes: 33%|███▎ | 22/67 [00:14<00:28, 1.56it/s]
Capturing CUDA graph shapes: 34%|███▍ | 23/67 [00:14<00:28, 1.56it/s]
Capturing CUDA graph shapes: 36%|███▌ | 24/67 [00:15<00:27, 1.57it/s]
Capturing CUDA graph shapes: 37%|███▋ | 25/67 [00:16<00:26, 1.58it/s]
Capturing CUDA graph shapes: 39%|███▉ | 26/67 [00:16<00:26, 1.58it/s]
Capturing CUDA graph shapes: 40%|████ | 27/67 [00:17<00:25, 1.58it/s]
Capturing CUDA graph shapes: 42%|████▏ | 28/67 [00:18<00:24, 1.58it/s]
Capturing CUDA graph shapes: 43%|████▎ | 29/67 [00:18<00:23, 1.58it/s]
Capturing CUDA graph shapes: 45%|████▍ | 30/67 [00:19<00:23, 1.59it/s]
Capturing CUDA graph shapes: 46%|████▋ | 31/67 [00:20<00:22, 1.59it/s]
Capturing CUDA graph shapes: 48%|████▊ | 32/67 [00:20<00:21, 1.59it/s]
Capturing CUDA graph shapes: 49%|████▉ | 33/67 [00:21<00:21, 1.60it/s]
Capturing CUDA graph shapes: 51%|█████ | 34/67 [00:21<00:20, 1.61it/s]
Capturing CUDA graph shapes: 52%|█████▏ | 35/67 [00:22<00:19, 1.61it/s]
Capturing CUDA graph shapes: 54%|█████▎ | 36/67 [00:23<00:19, 1.61it/s]
Capturing CUDA graph shapes: 55%|█████▌ | 37/67 [00:23<00:18, 1.61it/s]
Capturing CUDA graph shapes: 57%|█████▋ | 38/67 [00:24<00:17, 1.61it/s]
Capturing CUDA graph shapes: 58%|█████▊ | 39/67 [00:24<00:17, 1.62it/s]
Capturing CUDA graph shapes: 60%|█████▉ | 40/67 [00:25<00:16, 1.62it/s]
Capturing CUDA graph shapes: 61%|██████ | 41/67 [00:26<00:15, 1.64it/s]
Capturing CUDA graph shapes: 63%|██████▎ | 42/67 [00:26<00:15, 1.63it/s]
Capturing CUDA graph shapes: 64%|██████▍ | 43/67 [00:27<00:14, 1.63it/s]
Capturing CUDA graph shapes: 66%|██████▌ | 44/67 [00:28<00:14, 1.63it/s]
Capturing CUDA graph shapes: 67%|██████▋ | 45/67 [00:28<00:13, 1.64it/s]
Capturing CUDA graph shapes: 69%|██████▊ | 46/67 [00:29<00:12, 1.64it/s]
Capturing CUDA graph shapes: 70%|███████ | 47/67 [00:29<00:12, 1.64it/s]
Capturing CUDA graph shapes: 72%|███████▏ | 48/67 [00:30<00:11, 1.65it/s]
Capturing CUDA graph shapes: 73%|███████▎ | 49/67 [00:31<00:10, 1.66it/s]
Capturing CUDA graph shapes: 75%|███████▍ | 50/67 [00:31<00:10, 1.66it/s]
Capturing CUDA graph shapes: 76%|███████▌ | 51/67 [00:32<00:09, 1.66it/s]
Capturing CUDA graph shapes: 78%|███████▊ | 52/67 [00:32<00:09, 1.66it/s]
Capturing CUDA graph shapes: 79%|███████▉ | 53/67 [00:33<00:08, 1.66it/s]
Capturing CUDA graph shapes: 81%|████████ | 54/67 [00:34<00:07, 1.67it/s]
Capturing CUDA graph shapes: 82%|████████▏ | 55/67 [00:34<00:07, 1.67it/s]
Capturing CUDA graph shapes: 84%|████████▎ | 56/67 [00:35<00:06, 1.68it/s]
Capturing CUDA graph shapes: 85%|████████▌ | 57/67 [00:35<00:05, 1.68it/s]
Capturing CUDA graph shapes: 87%|████████▋ | 58/67 [00:36<00:05, 1.68it/s]
Capturing CUDA graph shapes: 88%|████████▊ | 59/67 [00:36<00:04, 1.69it/s]
Capturing CUDA graph shapes: 90%|████████▉ | 60/67 [00:37<00:04, 1.70it/s]
Capturing CUDA graph shapes: 91%|█████████ | 61/67 [00:38<00:03, 1.71it/s]
Capturing CUDA graph shapes: 93%|█████████▎| 62/67 [00:38<00:02, 1.71it/s]
Capturing CUDA graph shapes: 94%|█████████▍| 63/67 [00:39<00:02, 1.72it/s]
Capturing CUDA graph shapes: 96%|█████████▌| 64/67 [00:39<00:01, 1.73it/s]
Capturing CUDA graph shapes: 97%|█████████▋| 65/67 [00:40<00:01, 1.73it/s]
Capturing CUDA graph shapes: 99%|█████████▊| 66/67 [00:41<00:00, 1.73it/s]
Capturing CUDA graph shapes: 100%|██████████| 67/67 [00:41<00:00, 1.73it/s]
Capturing CUDA graph shapes: 100%|██████████| 67/67 [00:41<00:00, 1.61it/s]
2025-07-20T07:10:17.211532318Z INFO 07-20 00:10:17 [gpu_model_runner.py:2326] Graph capturing finished in 42 secs, took 1.26 GiB
2025-07-20T07:10:17.273730035Z INFO 07-20 00:10:17 [core.py:172] init engine (profile, create kv cache, warmup model) took 231.13 seconds
2025-07-20T07:10:18.826781985Z INFO 07-20 00:10:18 [loggers.py:137] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 1264
2025-07-20T07:10:19.253876833Z WARNING 07-20 00:10:19 [config.py:1392] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with --generation-config vllm.
2025-07-20T07:10:19.253904815Z INFO 07-20 00:10:19 [serving_chat.py:125] Using default chat sampling params from model: {'top_k': 64, 'top_p': 0.95}
2025-07-20T07:10:19.418755151Z INFO 07-20 00:10:19 [serving_completion.py:72] Using default completion sampling params from model: {'top_k': 64, 'top_p': 0.95}
2025-07-20T07:10:19.418792728Z INFO 07-20 00:10:19 [api_server.py:1457] Starting vLLM API server 0 on http://0.0.0.0:8000
2025-07-20T07:10:19.418884084Z INFO 07-20 00:10:19 [launcher.py:29] Available routes are:
2025-07-20T07:10:19.418890494Z INFO 07-20 00:10:19 [launcher.py:37] Route: /openapi.json, Methods: GET, HEAD
2025-07-20T07:10:19.418941991Z INFO 07-20 00:10:19 [launcher.py:37] Route: /docs, Methods: GET, HEAD
2025-07-20T07:10:19.418977494Z INFO 07-20 00:10:19 [launcher.py:37] Route: /docs/oauth2-redirect, Methods: GET, HEAD
2025-07-20T07:10:19.419030874Z INFO 07-20 00:10:19 [launcher.py:37] Route: /redoc, Methods: GET, HEAD
2025-07-20T07:10:19.419056582Z INFO 07-20 00:10:19 [launcher.py:37] Route: /health, Methods: GET
2025-07-20T07:10:19.419061840Z INFO 07-20 00:10:19 [launcher.py:37] Route: /load, Methods: GET
2025-07-20T07:10:19.419113628Z INFO 07-20 00:10:19 [launcher.py:37] Route: /ping, Methods: POST
2025-07-20T07:10:19.419120548Z INFO 07-20 00:10:19 [launcher.py:37] Route: /ping, Methods: GET
2025-07-20T07:10:19.419172496Z INFO 07-20 00:10:19 [launcher.py:37] Route: /tokenize, Methods: POST
2025-07-20T07:10:19.419185054Z INFO 07-20 00:10:19 [launcher.py:37] Route: /detokenize, Methods: POST
2025-07-20T07:10:19.419215179Z INFO 07-20 00:10:19 [launcher.py:37] Route: /v1/models, Methods: GET
2025-07-20T07:10:19.419255780Z INFO 07-20 00:10:19 [launcher.py:37] Route: /version, Methods: GET
2025-07-20T07:10:19.419283371Z INFO 07-20 00:10:19 [launcher.py:37] Route: /v1/chat/completions, Methods: POST
2025-07-20T07:10:19.419311924Z INFO 07-20 00:10:19 [launcher.py:37] Route: /v1/completions, Methods: POST
2025-07-20T07:10:19.419343251Z INFO 07-20 00:10:19 [launcher.py:37] Route: /v1/embeddings, Methods: POST
2025-07-20T07:10:19.419369510Z INFO 07-20 00:10:19 [launcher.py:37] Route: /pooling, Methods: POST
2025-07-20T07:10:19.419398233Z INFO 07-20 00:10:19 [launcher.py:37] Route: /classify, Methods: POST
2025-07-20T07:10:19.419428439Z INFO 07-20 00:10:19 [launcher.py:37] Route: /score, Methods: POST
2025-07-20T07:10:19.419455018Z INFO 07-20 00:10:19 [launcher.py:37] Route: /v1/score, Methods: POST
2025-07-20T07:10:19.419483351Z INFO 07-20 00:10:19 [launcher.py:37] Route: /v1/audio/transcriptions, Methods: POST
2025-07-20T07:10:19.419517362Z INFO 07-20 00:10:19 [launcher.py:37] Route: /v1/audio/translations, Methods: POST
2025-07-20T07:10:19.419544482Z INFO 07-20 00:10:19 [launcher.py:37] Route: /rerank, Methods: POST
2025-07-20T07:10:19.419577511Z INFO 07-20 00:10:19 [launcher.py:37] Route: /v1/rerank, Methods: POST
2025-07-20T07:10:19.419605884Z INFO 07-20 00:10:19 [launcher.py:37] Route: /v2/rerank, Methods: POST
2025-07-20T07:10:19.419638553Z INFO 07-20 00:10:19 [launcher.py:37] Route: /invocations, Methods: POST
2025-07-20T07:10:19.419660966Z INFO 07-20 00:10:19 [launcher.py:37] Route: /metrics, Methods: GET
2025-07-20T07:10:20.388408360Z INFO: Started server process [20]
2025-07-20T07:10:20.388438856Z INFO: Waiting for application startup.
2025-07-20T07:10:20.614299734Z INFO: Application startup complete.
2025-07-20T07:11:55.274704509Z INFO: 100.64.0.37:43858 - "GET / HTTP/1.1" 404 Not Found
2025-07-20T07:11:57.314886098Z INFO: 100.64.0.28:51616 - "GET / HTTP/1.1" 404 Not Found
2025-07-20T07:11:57.804051452Z INFO: 100.64.0.28:51616 - "GET /favicon.ico HTTP/1.1" 404 Not Found
2025-07-20T07:12:07.202306451Z INFO 07-20 00:12:07 [logger.py:43] Received request cmpl-626b66a24c364c38b27e09df4a845aee-0: prompt: 'Give me an introduction to large language models.', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.95, top_k=64, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=7000, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: [2, 46762, 786, 614, 12650, 531, 2455, 5192, 4681, 236761], prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
2025-07-20T07:12:07.202589194Z INFO 07-20 00:12:07 [async_llm.py:270] Added request cmpl-626b66a24c364c38b27e09df4a845aee-0.
2025-07-20T07:12:30.618077642Z INFO 07-20 00:12:30 [loggers.py:118] Engine 000: Avg prompt throughput: 1.0 tokens/s, Avg generation throughput: 20.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.2%, Prefix cache hit rate: 0.0%
2025-07-20T07:12:40.618054189Z INFO 07-20 00:12:40 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 56.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.0%, Prefix cache hit rate: 0.0%
2025-07-20T07:12:50.618263594Z INFO 07-20 00:12:50 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 25.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.1%, Prefix cache hit rate: 0.0%
2025-07-20T07:13:00.619117894Z INFO 07-20 00:13:00 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.1%, Prefix cache hit rate: 0.0%
2025-07-20T07:13:20.618737679Z INFO 07-20 00:13:20 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 11.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.2%, Prefix cache hit rate: 0.0%
2025-07-20T07:13:30.619661723Z INFO 07-20 00:13:30 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 52.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.2%, Prefix cache hit rate: 0.0%
2025-07-20T07:13:40.620195894Z INFO 07-20 00:13:40 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 52.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.2%, Prefix cache hit rate: 0.0%
2025-07-20T07:13:46.666448724Z INFO 07-20 00:13:46 [async_llm.py:431] Aborted request cmpl-626b66a24c364c38b27e09df4a845aee-0.
2025-07-20T07:13:46.666490977Z INFO 07-20 00:13:46 [async_llm.py:339] Request cmpl-626b66a24c364c38b27e09df4a845aee-0 aborted.
2025-07-20T07:13:50.621300278Z INFO 07-20 00:13:50 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 31.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.2%, Prefix cache hit rate: 0.0%
2025-07-20T07:14:00.621959818Z INFO 07-20 00:14:00 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.2%, Prefix cache hit rate: 0.0%
2025-07-20T07:15:00.190002891Z INFO 07-20 00:15:00 [logger.py:43] Received request cmpl-3cc731590f684fea80a1461c4ae7fdab-0: prompt: 'Give me a short introduction to large language models.', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.95, top_k=64, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1000, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: [2, 46762, 786, 496, 2822, 12650, 531, 2455, 5192, 4681, 236761], prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
2025-07-20T07:15:00.190631021Z INFO 07-20 00:15:00 [async_llm.py:270] Added request cmpl-3cc731590f684fea80a1461c4ae7fdab-0.
2025-07-20T07:15:00.622192342Z INFO 07-20 00:15:00 [loggers.py:118] Engine 000: Avg prompt throughput: 1.1 tokens/s, Avg generation throughput: 2.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 0.0%
2025-07-20T07:15:10.621752238Z INFO 07-20 00:15:10 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 57.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.1%, Prefix cache hit rate: 0.0%
2025-07-20T07:15:18.030356790Z INFO: 100.64.0.28:51762 - "POST /v1/completions HTTP/1.1" 200 OK
2025-07-20T07:15:20.621826153Z INFO 07-20 00:15:20 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 40.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
2025-07-20T07:15:30.622538432Z INFO 07-20 00:15:30 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
2025-07-20T07:25:00.914066128Z INFO 07-20 00:25:00 [logger.py:43] Received request cmpl-2015349cde934ea8b53c6bcafbf69e3b-0: prompt: 'Give me a short introduction to large language models.', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=0.95, top_k=64, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1000, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: [2, 46762, 786, 496, 2822, 12650, 531, 2455, 5192, 4681, 236761], prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
2025-07-20T07:25:00.914504794Z INFO 07-20 00:25:00 [async_llm.py:270] Added request cmpl-2015349cde934ea8b53c6bcafbf69e3b-0.
2025-07-20T07:25:10.640807125Z INFO 07-20 00:25:10 [loggers.py:118] Engine 000: Avg prompt throughput: 1.1 tokens/s, Avg generation throughput: 55.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.9%, Prefix cache hit rate: 0.0%
2025-07-20T07:25:18.729321858Z INFO: 100.64.0.25:60450 - "POST /v1/completions HTTP/1.1" 200 OK
2025-07-20T07:25:20.641386439Z INFO 07-20 00:25:20 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 44.2 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
2025-07-20T07:25:30.642593413Z INFO 07-20 00:25:30 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
2025-07-20T07:31:11.963993041Z INFO: 100.64.0.28:59172 - "POST /v1/completions HTTP/1.1" 400 Bad Request
2025-07-20T07:31:20.248181148Z Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False.
2025-07-20T07:31:22.766465899Z INFO 07-20 00:31:22 [chat_utils.py:444] Detected the chat template content format to be 'openai'. You can set --chat-template-content-format to override this.
2025-07-20T07:31:22.768093133Z INFO 07-20 00:31:22 [logger.py:43] Received request chatcmpl-88a3b6bdb5864cc595982552e867ed80: prompt: 'user\nGive me a short introduction to Large language models\nmodel\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=0.95, top_k=64, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1000, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
2025-07-20T07:31:22.768343868Z INFO 07-20 00:31:22 [async_llm.py:270] Added request chatcmpl-88a3b6bdb5864cc595982552e867ed80.
2025-07-20T07:31:32.768393358Z INFO 07-20 00:31:32 [loggers.py:118] Engine 000: Avg prompt throughput: 1.8 tokens/s, Avg generation throughput: 57.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.0%, Prefix cache hit rate: 0.0%
2025-07-20T07:31:40.580917047Z INFO: 100.64.0.20:35558 - "POST /v1/chat/completions HTTP/1.1" 200 OK
2025-07-20T07:31:42.769111212Z INFO 07-20 00:31:42 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
2025-07-20T07:31:52.768166840Z INFO 07-20 00:31:52 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
2025-07-20T07:32:42.454820225Z INFO 07-20 00:32:42 [logger.py:43] Received request chatcmpl-459b8693ea39461698d0425665881c50: prompt: 'user\nGive me a short introduction to Large language models\nmodel\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.95, top_k=64, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1000, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
2025-07-20T07:32:42.455284079Z INFO 07-20 00:32:42 [async_llm.py:270] Added request chatcmpl-459b8693ea39461698d0425665881c50.
2025-07-20T07:32:42.769729447Z INFO 07-20 00:32:42 [loggers.py:118] Engine 000: Avg prompt throughput: 1.8 tokens/s, Avg generation throughput: 1.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 23.5%
2025-07-20T07:32:52.769856087Z INFO 07-20 00:32:52 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 57.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.1%, Prefix cache hit rate: 23.5%
2025-07-20T07:33:00.350529316Z INFO: 100.64.0.29:33770 - "POST /v1/chat/completions HTTP/1.1" 200 OK
2025-07-20T07:33:02.769826924Z INFO 07-20 00:33:02 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 41.2 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 23.5%
2025-07-20T07:33:12.769930290Z INFO 07-20 00:33:12 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 23.5%
2025-07-20T07:34:49.807665962Z INFO 07-20 00:34:49 [logger.py:43] Received request chatcmpl-0c62a3df3bdb45fb8e27b9dc89a94e31: prompt: 'user\nGive me a short introduction to Large language models\nmodel\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=0.95, top_k=64, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1000, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
2025-07-20T07:34:49.808081313Z INFO 07-20 00:34:49 [async_llm.py:270] Added request chatcmpl-0c62a3df3bdb45fb8e27b9dc89a94e31.
2025-07-20T07:34:52.771077704Z INFO 07-20 00:34:52 [loggers.py:118] Engine 000: Avg prompt throughput: 1.8 tokens/s, Avg generation throughput: 17.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.0%, Prefix cache hit rate: 37.2%
2025-07-20T07:35:02.770970214Z INFO 07-20 00:35:02 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 56.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.9%, Prefix cache hit rate: 37.2%
2025-07-20T07:35:07.651820086Z INFO: 100.64.0.28:59134 - "POST /v1/chat/completions HTTP/1.1" 200 OK
2025-07-20T07:35:12.770766831Z INFO 07-20 00:35:12 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 26.4 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 37.2%
2025-07-20T07:35:22.770790918Z INFO 07-20 00:35:22 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 37.2%

gaunernst

Owner 9 days ago

It works fine on my 5090

uv venv --python=3.12 --managed-python
source .venv/bin/activate
uv pip install vllm

# reinstall PyTorch 2.7.1 with CUDA 12.8 to enable sm120 support
uv pip install -U torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

vllm serve gaunernst/gemma-3-27b-it-qat-autoawq --max-model-len 16000

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"gaunernst/gemma-3-27b-it-qat-autoawq","messages":[{"role":"user","content":"Give me a short introduction to Large language models"}]}'

{"id":"chatcmpl-e5e6147cee2940e2acdd09b72cfd3356","object":"chat.completion","created":1753020318,"model":"gaunernst/gemma-3-27b-it-qat-autoawq","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"## Large Language Models: A Quick Intro\n\nLarge Language Models (LLMs) are the powerhouse behind many of the AI tools you're hearing about today, like ChatGPT, Bard, and even features in Microsoft Word. \n\nEssentially, they're incredibly powerful computer programs trained on massive amounts of text data. Think almost the entire internet, books, articles, code - you name it. \n\nWhat do they do? LLMs learn to predict the next word in a sequence, but this seemingly simple task allows them to do amazing things:\n\n* Generate text: Write essays, poems, code, emails, scripts, and more.\n* Translate languages: Quickly and accurately convert text between different languages.\n* Answer questions: Provide informative responses based on the knowledge they've absorbed.\n* Summarize text: Condense long documents into shorter, more digestible versions.\n* Chat: Engage in conversations, mimicking human-like dialogue.\n\n*Key things to know:\n\n "Large" is important: The more data they're trained on, the better they perform.\n They don't "think" or "understand": They are pattern-matching machines, statistically predicting what comes next.\n They can make mistakes: LLMs can sometimes produce inaccurate, biased, or nonsensical outputs ("hallucinations"). \n\n\n\nLLMs are constantly evolving and are rapidly changing the way we interact with technology. They represent a significant leap forward in artificial intelligence!\n\n\n\n","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":106}],"usage":{"prompt_tokens":18,"total_tokens":342,"completion_tokens":324,"prompt_tokens_details":null},"prompt_logprobs":null,"kv_transfer_params":null}

I noticed in your logs

Using FlashInfer for top-p & top-k sampling.

Can you try not using FlashInfer?

meiragat

8 days ago

•

edited 8 days ago

Thanks for the suggestions. I tried disabling Flashinfer and it didn't solve the issue.
I also updated pytorch in the container to 2.7.1.
I then removed the trust remote code flag and that didn't help either.

I ran your code and it did indeed give a good result, but I'm hoping to use docker so I can deploy it easily to customers.
Any other thoughts what I could do to fix this issue?

{
"id": "chatcmpl-3e6d810b07c94e4a9b26eb9adc45d93b",
"object": "chat.completion",
"created": 1753098468,
"model": "gemma-3-27b",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"reasoning_content": null,
"content": "## Large Language models are becoming increasingly sophisticated AI models (LLMs are complex and powerful, with increasing\n\nLarge language models (LLMs are becoming powerful, and capable of\n## increasingly powerful, but also are also increasingly sophisticated.\n\nLarge language models are becoming increasingly capable of understanding, also capable. They’resembling to generate, generating human-text. also of generating and also, they are also complex and of text, and human-like text. are capable of generating. generating human-like human-like\n## Large language models language increasingly human-like text and text, but text.\n\nLarge increasingly powerful capable of generating human-like text.\n\n## becoming human-like and increasingly language text. They are also generating and more and increasingly and more capable of are also capable text-like generating text.\n\nHere is of performing and performing increasingly capable of and of performing a are also text are performing a diverse tasks, are performing tasks like. here are also tasks such also capable text-like tasks, of are increasing\n\nHere and generating and diverse tasks like.\n## are tasks diverse tasks, performing a also increasingly performing a more tasks.\n\nHere tasks. performing of a\n## Large language are increasingly\n## are increasingly tasks, tasks.\n\nHere are performing tasks.\npython\ntasks arealso powerful of tasks.Here tasks.\n\n##\n\n## performing tasks.\nHere are modelsLarge language models are also\n\n\nThis here\npython\nThe large and also tasks\npython\n\nWhat is a of diverse.\nHere is the language performing a.\n\nPlease revise and powerful generating text are\n\nI’mething language\n\n## models are\nmodels Large also tasks are\npython\nHere are also tasks.python\n\n\nHere models increasingly\nare complex and are performing tasks.\ntasksHere tasks.\n\nHere are also diverse.\n\nHere tasks are also\n are\npython\n## Large\n\nThe are performing human-like\n tasks.\n\n**Large language are\n\n## models\n\n\n\nThis is python\npython\n\nThis is a tasks.Here are\n\n\nare complex language models are tasks.\nHere\n models\n\nAlso tasks complex.Here are tasks.\n\nmodels are.\n\n\nHere are\nThis language tasks are.\n\n\n\n also.\n\nPlease provide me\n\nAlso\n\n\n\n## language\n\nThis text tasks.\n\nmodels are are\nHere are\n also\n\n language\n\nThis\n\n\n.\n\nPlease write a\nAre tasks.\n\n\n\n.\n\n\nmodels are also\nare\ntasks also\nAlso\nHere\nAlso models\n\n\n language. tasks.\n\nare\n",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": 106
}
],
"usage": {
"prompt_tokens": 18,
"total_tokens": 605,
"completion_tokens": 587,
"prompt_tokens_details": null
},
"prompt_logprobs": null,
"kv_transfer_params": null
}

Logs:

INFO 07-21 04:48:14 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 49.9 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
INFO 07-21 04:48:24 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
INFO 07-21 04:55:15 [logger.py:43] Received request chatcmpl-1ff818a756d34250888b63963af6344c: prompt: 'user\nGive me a short introduction to Large language models\nmodel\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=0.95, top_k=64, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=20054, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 07-21 04:55:15 [async_llm.py:270] Added request chatcmpl-1ff818a756d34250888b63963af6344c.
INFO 07-21 04:55:24 [loggers.py:118] Engine 000: Avg prompt throughput: 1.8 tokens/s, Avg generation throughput: 62.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.9%, Prefix cache hit rate: 44.4%
INFO 07-21 04:55:34 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 37.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 6.1%, Prefix cache hit rate: 44.4%
INFO 07-21 04:55:44 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 6.1%, Prefix cache hit rate: 44.4%
INFO 07-21 04:56:04 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 55.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 6.2%, Prefix cache hit rate: 44.4%
INFO 07-21 04:56:14 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 60.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 6.2%, Prefix cache hit rate: 44.4%
INFO 07-21 04:56:24 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 60.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 6.2%, Prefix cache hit rate: 44.4%
INFO 07-21 04:56:34 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 60.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 6.2%, Prefix cache hit rate: 44.4%
INFO 07-21 04:56:44 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 61.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 6.2%, Prefix cache hit rate: 44.4%
INFO 07-21 04:56:54 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 60.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 6.2%, Prefix cache hit rate: 44.4%
INFO 07-21 04:56:54 [async_llm.py:431] Aborted request chatcmpl-1ff818a756d34250888b63963af6344c.
INFO 07-21 04:56:54 [async_llm.py:339] Request chatcmpl-1ff818a756d34250888b63963af6344c aborted.
INFO 07-21 04:57:04 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 2.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 6.2%, Prefix cache hit rate: 44.4%
INFO 07-21 04:57:14 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 6.2%, Prefix cache hit rate: 44.4%

gaunernst

Owner 4 days ago

Sorry for the late reply, I didn't see the notification. It looks like some issues with the vLLM Docker environment. I will check if have time

meiragat

2 days ago

vLLM have now released version 0.10.0 of their official docker image and it resolved this issue.
Thanks for all your help!

meiragat changed discussion status to closed 2 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment