Great model but missing cache-aware feature

#6
by janvliet - opened

This is a great model making streaming and offline with same model a reality.
Currently it is missing the cache-aware feature.
This results in streaming with small chunks having high overhead as needs recomputing the whole left context.
Would be nice this gets added in an updated model, making streaming with this model more practical.

NVIDIA org

Hi @janvliet , thank you for the feedback!

The cache-aware decoding can be added later in 2-3 months. We have a multilingual priority now.

BTW, what latency are you targeting for streaming inference?

Latency 300ms or less is usually preferred.

Sign up or log in to comment