Add OlmOCRBench evaluation results

#43
by staghado - opened

This PR ensures your model shows up at https://huggingface.co/datasets/allenai/olmOCR-bench
The evaluation was done through the official SDK.

Z.ai org

@staghado Thanks for running the evaluation and submitting this PR! We appreciate you taking the time to benchmark the model.

However, the reported metrics look a bit unusual to us. We’re planning to rerun the evaluation on our side using the official SDK to double-check the results. We’ll follow up once we’ve reproduced and verified the numbers.

Thanks again for the effort!

Z.ai org

Also, could you confirm the inference setup you used? For example, did you run inference via the MaaS API, or through the SDK provided in our GitHub repo (https://github.com/zai-org/GLM-OCR)? Knowing the exact setup would help us reproduce the evaluation more accurately.

Z.ai org

It seems the evaluation was run using the ZAI API for inference. We’ll try reproducing the results with the same setup on our side. Thanks!

Thanks for looking into this! Here's what I did:

I used the ZAI Python SDK (zai-sdk==0.2.2) with the layout_parsing.create endpoint. The olmOCR-bench PDFs were pre-rendered to PNG at 200 DPI with a max side length of 1540px (aspect ratio preserved; native resolution kept if smaller). Each image was processed 3 times and test pass rates were averaged across repeats. I then ran the official olmocr.bench.benchmark evaluation script with the standard test JSONL files.

For context, I had previously run GLM-OCR standalone using vLLM with just the "Text Recognition:" prompt (no layout detection), which scored 67.5% overall (excl. h&f). The per-category scores largely match between the two setups, except for tables (42.5% → 77.6%) — which makes sense since the API includes layout detection that routes table regions to the "Table Recognition:" prompt. The other categories see only minor differences, confirming that the evaluation is correct.

Category vLLM (w/o layout) ZAI API (with layout)
arxiv_math 80.4% 80.7%
multi_column 79.9% 76.7%
old_scans_math 74.9% 68.3%
old_scans 39.9% 37.6%
long_tiny_text 87.6% 86.9%
table_tests 42.5% 77.6%
Overall (excl. h&f) 67.5% 75.2%

The full extraction script is available as a gist.

Hope this helps reproduce!

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment