GLM-OCR very slow on Tesla T4 (~40s per image) even with GPU — is this expected?
Hi,
I’m testing GLM-OCR on Google Colab with a Tesla T4 (15GB VRAM).
Setup:
Model: zai-org/GLM-OCR
Image size : 1024X1024
max_new_tokens :2048
GPU utilization ~60%, VRAM ~4.4GB
However, inference time is still ~40 seconds per image:
Questions:
- Is ~40–50s on T4 expected for GLM-OCR?
- Any recommended settings for faster inference ?
@905saini Hi, thanks for sharing the detailed setup and metrics — that’s very helpful.
We haven’t test GLM-OCR on a T4 GPU yet, so we don’t have an official reference for the expected latency in this configuration.
Could you let us know which inference framework you’re currently using?
For example: Transformers, vLLM, SGLang, or Ollama?
If possible, you can also share the image that you’re testing with. We can run it on our side to better understand the latency and give more specific feedback.
Hi,
I’m testing GLM-OCR on Google Colab with a Tesla T4 (15GB VRAM).
Setup:
Model: zai-org/GLM-OCR
Image size : 1024X1024
max_new_tokens :2048
GPU utilization ~60%, VRAM ~4.4GBHowever, inference time is still ~40 seconds per image:
Questions:
- Is ~40–50s on T4 expected for GLM-OCR?
- Any recommended settings for faster inference ?
T4 doesn't support bf16, are you using fp16?