Great model but missing cache-aware feature

by janvliet - opened 7 days ago

This is a great model making streaming and offline with same model a reality.
Currently it is missing the cache-aware feature.
This results in streaming with small chunks having high overhead as needs recomputing the whole left context.
Would be nice this gets added in an updated model, making streaming with this model more practical.

aandrusenko

NVIDIA org 2 days ago

Hi @janvliet , thank you for the feedback!

The cache-aware decoding can be added later in 2-3 months. We have a multilingual priority now.

BTW, what latency are you targeting for streaming inference?

janvliet

1 day ago

Latency 300ms or less is usually preferred.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment