How to improve inference speed ?

#74

by Anilosan15 - opened 18 days ago

18 days ago

I fine-tuned the Dia 1.6B TTS model for a different language.
When I deploy it on an H200 GPU, the average token generation speed is around 70–80 tokens/s, and on a B200 GPU, it drops to 110–120 tokens/s. I’m using the bf16 version of the model.

However, during generation, it still feels quite slow. For example, generating 5 seconds of audio takes around 4 seconds, even though the reported speed is about 86 tokens per second.

What can I do to improve the inference or token generation speed?
Are there any recommended optimizations, such as quantization, batching, torch.compile settings, or kernel tweaks for Dia?

NariLabs

Nari Labs org 18 days ago

What is your batch size? Did you turn on the torch.compile?

Anilosan15

17 days ago

•

edited 17 days ago

We are not using batching.
Here are our current benchmarks on a B200 GPU:

torch.compile = False: 110–120 tokens/s → real-time factor = 1.28x

torch.compile = True: 200–210 tokens/s → real-time factor = 2.3x

When we enable torch.compile, we can clearly see some improvement in inference speed, but it’s still quite slow compared to what we expect from a B200 GPU.

We are using this unofficial Dia-TTS server implementation:
https://github.com/devnen/Dia-TTS-Server?ysclid=mgrn41ltt6790467722

What else can we do to improve inference or token generation speed?
The output sample rate is set to 44100 Hz by default — could reducing it (e.g., to 22050 Hz) help speed up generation? Can we change this feature ?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment