How to improve inference speed ?

#74
by Anilosan15 - opened

I fine-tuned the Dia 1.6B TTS model for a different language.
When I deploy it on an H200 GPU, the average token generation speed is around 70–80 tokens/s, and on a B200 GPU, it drops to 110–120 tokens/s. I’m using the bf16 version of the model.

However, during generation, it still feels quite slow. For example, generating 5 seconds of audio takes around 4 seconds, even though the reported speed is about 86 tokens per second.

What can I do to improve the inference or token generation speed?
Are there any recommended optimizations, such as quantization, batching, torch.compile settings, or kernel tweaks for Dia?

Nari Labs org

What is your batch size? Did you turn on the torch.compile?

We are not using batching.
Here are our current benchmarks on a B200 GPU:

torch.compile = False: 110–120 tokens/s → real-time factor = 1.28x

torch.compile = True: 200–210 tokens/s → real-time factor = 2.3x

When we enable torch.compile, we can clearly see some improvement in inference speed, but it’s still quite slow compared to what we expect from a B200 GPU.

We are using this unofficial Dia-TTS server implementation:
https://github.com/devnen/Dia-TTS-Server?ysclid=mgrn41ltt6790467722

  1. What else can we do to improve inference or token generation speed?
  2. The output sample rate is set to 44100 Hz by default — could reducing it (e.g., to 22050 Hz) help speed up generation? Can we change this feature ?

Sign up or log in to comment