Qwen/Qwen2.5-14B-Instruct-1M · Model no longer working with vLLM

5 days ago

These 7B/14B models work fantastically well in my application. I use then with vLLM v.0.8.5 with the default flash attention backend and the it works great with context up to 256k tokens.
However, if I try to serve it or run it offline with newer vllm versions, it fails with:
TypeError: FlashAttentionImpl.init() got an unexpected keyword argument 'layer_idx'

A similar error happens if using FlashInfer. Using with the 'dual_chunk_attn' backend also fails.
Has anybody else tried running this model with a newer versions of vLLM?
Many thanks for any useful comments.

stev236

5 days ago

The chatbot of vLLM gave me the solution.
If you encounter this problem, edit the config.json file and remove the "dual_chunk_attention_config" item, and it will then work with flash attention or flashinfer.

stev236 changed discussion status to closed 5 days ago

Qwen
/

Qwen2.5-14B-Instruct-1M

Model no longer working with vLLM > v.0.8.5