Model no longer working with vLLM > v.0.8.5

#13
by stev236 - opened

These 7B/14B models work fantastically well in my application. I use then with vLLM v.0.8.5 with the default flash attention backend and the it works great with context up to 256k tokens.
However, if I try to serve it or run it offline with newer vllm versions, it fails with:
TypeError: FlashAttentionImpl.init() got an unexpected keyword argument 'layer_idx'

A similar error happens if using FlashInfer. Using with the 'dual_chunk_attn' backend also fails.
Has anybody else tried running this model with a newer versions of vLLM?
Many thanks for any useful comments.

The chatbot of vLLM gave me the solution.
If you encounter this problem, edit the config.json file and remove the "dual_chunk_attention_config" item, and it will then work with flash attention or flashinfer.

stev236 changed discussion status to closed

Sign up or log in to comment