Image classification using fine-tuned CLIP - for historical document sorting
Goal: solve a task of archive page images sorting (for their further content-based processing)
Scope: Processing of images, training / evaluation of CLIP model, input file/directory processing, class πͺ§ (category) results of top N predictions output, predictions summarizing into a tabular format.
Versions π
There are currently 4 version of the model available for download, both of them have the same set of categories,
but different data annotations. The latest approved v1.1
is considered to be default and can be found in the main
branch
of HF π hub ^1 π
Version | Base | Pages | PDFs | Description |
---|---|---|---|---|
v1.1 |
ViT-B/16 |
15855 | 5730 | smallest |
v1.2 |
ViT-B/32 |
15855 | 5730 | small with higher granularity |
v2.1 |
ViT-L/14 |
15855 | 5730 | large |
v2.2 |
ViT-L/14@336 |
15855 | 5730 | large with highest resolution |
Version | Disk space |
---|---|
openai/clip-vit-base-patch16 |
992 Mb |
openai/clip-vit-base-patch32 |
1008 Mb |
openai/clip-vit-large-patch14 |
1.5 Gb |
openai/clip-vit-large-patch14-336 |
1.5 Gb |
Model description π
π² Fine-tuned model repository: UFAL's clip-historical-page ^1 π
π³ Base model repository: OpenAI's clip-vit-base-patch16, clip-vit-base-patch32, clip-vit-large-patch14, clip-vit-large-patch14-336 ^2 ^13 ^14 ^15 π
The model was trained on the manually βοΈ annotated dataset of historical documents, in particular, images of pages from the archival documents with paper sources that were scanned into digital form.
The images contain various combinations of texts οΈπ, tables π, drawings π, and photos π - categories πͺ§ described below were formed based on those archival documents. Page examples can be found in the category_samples π directory.
The key use case of the provided model and data processing pipeline is to classify an input PNG image from PDF scanned paper source into one of the categories - each responsible for the following content-specific data processing pipeline.
In other words, when several APIs for different OCR subtasks are at your disposal - run this classifier first to mark the input data as machine-typed (old style fonts) / handwritten βοΈ / just printed plain οΈπ text or structured in tabular π format text, as well as to mark the presence of the printed π or drawn π graphic materials yet to be extracted from the page images.
Data π
The dataset is provided under Public Domain license, and consists of 15855 PNG images of pages from the archival documents. The source image files and their annotation can be found in the LINDAT repository ^10 π.
Training πͺ set of the model: 14267 images
90% of all - proportion in categories πͺ§ tabulated below
Evaluation π set: 1583 images
10% of all - same proportion in categories πͺ§ as below and demonstrated in model_EVAL.csv π
Manual βοΈ annotation was performed beforehand and took some time β, the categories πͺ§ were formed from different sources of the archival documents originated in the 1920-2020 years span.
Disproportion of the categories πͺ§ in both training data and provided evaluation category_samples π is NOT intentional, but rather a result of the source data nature.
In total, several thousands of separate PDF files were selected and split into PNG pages, ~4k of scanned documents were one-page long which covered around a third of all data, and ~2k of them were much longer (dozens and hundreds of pages) covering the rest (more than 60% of all annotated data).
The specific content and language of the source data is irrelevant considering the model's vision resolution, however, all of the data samples were from archaeological reports which may somehow affect the drawing detection preferences due to the common form of objects being ceramic pieces, arrowheads, and rocks formerly drawn by hand and later illustrated with digital tools (examples can be found in category_samples/DRAW π)
Categories πͺ§
LabelοΈ | Description |
---|---|
DRAW |
π - drawings, maps, paintings, schematics, or graphics, potentially containing some text labels or captions |
DRAW_L |
ππ - drawings, etc but presented within a table-like layout or includes a legend formatted as a table |
LINE_HW |
βοΈπ - handwritten text organized in a tabular or form-like structure |
LINE_P |
π - printed text organized in a tabular or form-like structure |
LINE_T |
π - machine-typed text organized in a tabular or form-like structure |
PHOTO |
π - photographs or photographic cutouts, potentially with text captions |
PHOTO_L |
ππ - photos presented within a table-like layout or accompanied by tabular annotations |
TEXT |
π° - mixtures of printed, handwritten, and/or typed text, potentially with minor graphical elements |
TEXT_HW |
βοΈπ - only handwritten text in paragraph or block form (non-tabular) |
TEXT_P |
π - only printed text in paragraph or block form (non-tabular) |
TEXT_T |
π - only machine-typed text in paragraph or block form (non-tabular) |
The categories were chosen to sort the pages by the following criteria:
- presence of graphical elements (drawings π OR photos π)
- type of text π (handwritten βοΈοΈ OR printed OR typed OR mixed π°)
- presence of tabular layout / forms π
The reasons for such distinction are different processing pipelines for different types of pages, which would be applied after the classification.
Examples of pages sorted by category πͺ§ can be found in the category_samples π directory which is also available as a testing subset of the training data.
Results π
v1.1
Evaluation set's accuracy (Top-1): 100.00% π
v1.2
Evaluation set's accuracy (Top-1): 100.00% π
v2.1
Evaluation set's accuracy (Top-1): 99.94% π
v2.2
Evaluation set's accuracy (Top-1): 99.87% π
Confusion matrices provided above show the diagonal of matching gold and predicted categories πͺ§ while their off-diagonal elements show inter-class errors. By those graphs you can judge what type of mistakes to expect from your model.
- transforms.ColorJitter(brightness 0.5)
- transforms.ColorJitter(contrast 0.5)
- transforms.ColorJitter(saturation 0.5)
- transforms.ColorJitter(hue 0.5)
- transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
- transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))
Training hyperparameters π
- eval_strategy "epoch"
- save_strategy "epoch"
- learning_rate 5e-5
- per_device_train_batch_size 8
- per_device_eval_batch_size 8
- num_train_epochs 3
- warmup_ratio 0.1
- logging_steps 10
- load_best_model_at_end True
- metric_for_best_model "accuracy"
Contacts π§
For support write to: [email protected] responsible for this GitHub repository ^8 π
Information about the authors of this project, including their names and ORCIDs, can be found in the CITATION.cff π file.
Acknowledgements π
- Developed by UFAL ^7 π₯
- Funded by ATRIUM ^4 π°
- Shared by ATRIUM ^4 & UFAL ^7 π
- Model type: fine-tuned ViT with a 224x224 ^2 π or 384x384 ^13 ^14 π resolution size
Β©οΈ 2022 UFAL & ATRIUM
- Downloads last month
- 31
Model tree for ufal/clip-historical-page
Base model
openai/clip-vit-base-patch16