DocSim LoRA

A LoRA-adapted CLIP+DINOv2 similarity head trained to measure visual similarity between original document page scans and their OCR reconstructions. Released as part of the OmniDocBench Render-and-Compare project.

Architecture

  • Backbone A: CLIP ViT-B/32 (pretrained on LAION-2B via open_clip)
  • Backbone B: DINOv2 ViT-B/14
  • Projection head: two-layer MLP (hidden=512, output=256)
  • LoRA adapters: rank 16, alpha 32, dropout 0.05 โ€” applied to both CLIP and DINOv2

Training

Trained on 20,280 triplets (anchor original page, positive OCR reconstruction, negative mis-matched reconstruction) derived from the OmniDocBench Render-and-Compare dataset.

Setting Value
Epochs 3
Batch size 16
Learning rate 1e-4
Train triplets 19,266
Val triplets 1,014
Best val accuracy 99.90%
Margin (triplet loss) 0.1

Files

File Description
lora_adapter_best/ LoRA adapter at best validation accuracy
lora_adapter_final/ LoRA adapter at end of training
head_state_best.pt Projection head weights (best checkpoint)
head_state_final.pt Projection head weights (final epoch)
config.json Full architecture config

Use lora_adapter_best/ + head_state_best.pt for inference.

Usage

Download via GT-free-ocr-metrics:

bash download_models.sh

Then run any DocSim-based method:

bash scripts/run_method.sh docsim_lora

License

Apache-2.0. The companion datasets (OmniDocBench Render-and-Compare) are CC-BY-NC-4.0.

Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for gt-free-ocr-metrics/docsim-lora

Adapter
(28)
this model

Space using gt-free-ocr-metrics/docsim-lora 1

Collection including gt-free-ocr-metrics/docsim-lora