Qwen
/

GGUF
conversational

Huge VRAM consumption with 32k context

#9
by fimbulvntr - opened

If I try to embed a 32k token document, the model wants to allocate almost 50 GiB of VRAM as scratch space, even with the Q8_0.

I tried both llama-server and the embeddings version of llama.cpp, and it's the same story.

Is this normal?

I doubt it.
But also llama.cpp does not give me the same embeddings output as when using python. Plus I get different output on Metal compared to CUDA.
Pretty sure this model is not working properly at the moment using llama.cpp

The huge memory usage is related to the fact that llama-embedding has this in the code:

// For non-causal models, batch size must be equal to ubatch size
params.n_ubatch = params.n_batch;

But this is a causal model I am pretty sure

Sign up or log in to comment