DocSim LoRA

A LoRA-adapted CLIP+DINOv2 similarity head trained to measure visual similarity between original document page scans and their OCR reconstructions. Released as part of the OmniDocBench Render-and-Compare project.

Architecture

Backbone A: CLIP ViT-B/32 (pretrained on LAION-2B via open_clip)
Backbone B: DINOv2 ViT-B/14
Projection head: two-layer MLP (hidden=512, output=256)
LoRA adapters: rank 16, alpha 32, dropout 0.05 — applied to both CLIP and DINOv2

Training

Trained on 20,280 triplets (anchor original page, positive OCR reconstruction, negative mis-matched reconstruction) derived from the OmniDocBench Render-and-Compare dataset.

Setting	Value
Epochs	3
Batch size	16
Learning rate	1e-4
Train triplets	19,266
Val triplets	1,014
Best val accuracy	99.90%
Margin (triplet loss)	0.1

Files

File	Description
`lora_adapter_best/`	LoRA adapter at best validation accuracy
`lora_adapter_final/`	LoRA adapter at end of training
`head_state_best.pt`	Projection head weights (best checkpoint)
`head_state_final.pt`	Projection head weights (final epoch)
`config.json`	Full architecture config