Questions about streaming with Parakeet and TDT merging methods

#63
by alexandreacff - opened

I’m currently trying to work with Parakeet in streaming mode, receiving microphone chunks and generating live transcriptions.

As a reference, I’m using the following code for streaming:
https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_chunked_inference/rnnt/speech_to_text_buffered_infer_rnnt.py

However, I’ve run into some questions:

  1. Why do the more conventional merging methods not work well for TDT? I tested them, but the performance dropped significantly.

  2. Is there already an implementation available for this use case (streaming with Parakeet using microphone chunks)?

Yes. Please provide a minial python script to streaming with parakeet with microphone chunks.

NVIDIA org

@alexandreacff Please, use the new streaming pipeline https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/asr/asr_chunked_inference/rnnt/speech_to_text_streaming_infer_rnnt.py

The pipeline is significantly simplified, merging is removed, but the state is preserved between chunks. We observe both better quality and speed using the new approach.
You can see some details about the results in the description of the following PR https://github.com/NVIDIA-NeMo/NeMo/pull/9106

Feel free to ask questions about the new approach.

NVIDIA org

Regarding the streaming implementation with microphone.
I created a draft PR with demo app based on Gradio, you can use the code as a starting point https://github.com/NVIDIA-NeMo/NeMo/pull/14759
Please note that the PR is a subject to change.

Regarding the streaming implementation, is it possible to output word and segment timestamps like with the non-streaming usage of the model? I have found it difficult to implement this functionality.

Sign up or log in to comment