GLM-OCR very slow on Tesla T4 (~40s per image) even with GPU — is this expected?

#13
by 905saini - opened

Hi,

I’m testing GLM-OCR on Google Colab with a Tesla T4 (15GB VRAM).

Setup:

Model: zai-org/GLM-OCR
Image size : 1024X1024
max_new_tokens :2048
GPU utilization ~60%, VRAM ~4.4GB

However, inference time is still ~40 seconds per image:

Questions:

  1. Is ~40–50s on T4 expected for GLM-OCR?
  2. Any recommended settings for faster inference ?
Z.ai org

@905saini Hi, thanks for sharing the detailed setup and metrics — that’s very helpful.

We haven’t test GLM-OCR on a T4 GPU yet, so we don’t have an official reference for the expected latency in this configuration.

Could you let us know which inference framework you’re currently using?
For example: Transformers, vLLM, SGLang, or Ollama?

Z.ai org

If possible, you can also share the image that you’re testing with. We can run it on our side to better understand the latency and give more specific feedback.

Hi,

I’m testing GLM-OCR on Google Colab with a Tesla T4 (15GB VRAM).

Setup:

Model: zai-org/GLM-OCR
Image size : 1024X1024
max_new_tokens :2048
GPU utilization ~60%, VRAM ~4.4GB

However, inference time is still ~40 seconds per image:

Questions:

  1. Is ~40–50s on T4 expected for GLM-OCR?
  2. Any recommended settings for faster inference ?

T4 doesn't support bf16, are you using fp16?

Sign up or log in to comment