Questions about streaming with Parakeet and TDT merging methods

#63

by alexandreacff - opened Sep 15

Sep 15

I’m currently trying to work with Parakeet in streaming mode, receiving microphone chunks and generating live transcriptions.

As a reference, I’m using the following code for streaming:
https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_chunked_inference/rnnt/speech_to_text_buffered_infer_rnnt.py

However, I’ve run into some questions:

Why do the more conventional merging methods not work well for TDT? I tested them, but the performance dropped significantly.
Is there already an implementation available for this use case (streaming with Parakeet using microphone chunks)?

abhijithmallya

Sep 16

Yes. Please provide a minial python script to streaming with parakeet with microphone chunks.

artbataev

NVIDIA org Sep 18

@alexandreacff Please, use the new streaming pipeline https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/asr/asr_chunked_inference/rnnt/speech_to_text_streaming_infer_rnnt.py

The pipeline is significantly simplified, merging is removed, but the state is preserved between chunks. We observe both better quality and speed using the new approach.
You can see some details about the results in the description of the following PR https://github.com/NVIDIA-NeMo/NeMo/pull/9106

Feel free to ask questions about the new approach.

artbataev

NVIDIA org Sep 18

Regarding the streaming implementation with microphone.
I created a draft PR with demo app based on Gradio, you can use the code as a starting point https://github.com/NVIDIA-NeMo/NeMo/pull/14759
Please note that the PR is a subject to change.

bentdixon

Sep 25

Regarding the streaming implementation, is it possible to output word and segment timestamps like with the non-streaming usage of the model? I have found it difficult to implement this functionality.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment