Huge VRAM consumption with 32k context

by fimbulvntr - opened Jun 7

Jun 7

•

If I try to embed a 32k token document, the model wants to allocate almost 50 GiB of VRAM as scratch space, even with the Q8_0.

I tried both llama-server and the embeddings version of llama.cpp, and it's the same story.

Is this normal?

marcov-dart

Jun 8

I doubt it.
But also llama.cpp does not give me the same embeddings output as when using python. Plus I get different output on Metal compared to CUDA.
Pretty sure this model is not working properly at the moment using llama.cpp

marcov-dart

Jun 9

The huge memory usage is related to the fact that llama-embedding has this in the code:

// For non-causal models, batch size must be equal to ubatch size
params.n_ubatch = params.n_batch;

But this is a causal model I am pretty sure

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment