Image classification using fine-tuned CLIP - for historical document sorting

Goal: solve a task of archive page images sorting (for their further content-based processing)

Scope: Processing of images, training / evaluation of CLIP model, input file/directory processing, class 🪧 (category) results of top N predictions output, predictions summarizing into a tabular format.

Versions 🏁

There are currently 4 version of the model available for download, both of them have the same set of categories, but different data annotations. The latest approved v1.1 is considered to be default and can be found in the main branch of HF 😊 hub ^1 🔗

Version	Base	Pages	PDFs	Description
`v1.1`	`ViT-B/16`	15855	5730	smallest
`v1.2`	`ViT-B/32`	15855	5730	small with higher granularity
`v2.1`	`ViT-L/14`	15855	5730	large
`v2.2`	`ViT-L/14@336`	15855	5730	large with highest resolution

Version	Disk space
`openai/clip-vit-base-patch16`	992 Mb
`openai/clip-vit-base-patch32`	1008 Mb
`openai/clip-vit-large-patch14`	1.5 Gb
`openai/clip-vit-large-patch14-336`	1.5 Gb

Model description 📇

🔲 Fine-tuned model repository: UFAL's clip-historical-page ^1 🔗

🔳 Base model repository: OpenAI's clip-vit-base-patch16, clip-vit-base-patch32, clip-vit-large-patch14, clip-vit-large-patch14-336 ^2 ^13 ^14 ^15 🔗

The model was trained on the manually ✍️ annotated dataset of historical documents, in particular, images of pages from the archival documents with paper sources that were scanned into digital form.

The images contain various combinations of texts ️📄, tables 📏, drawings 📈, and photos 🌄 - categories 🪧 described below were formed based on those archival documents. Page examples can be found in the category_samples 📁 directory.

The key use case of the provided model and data processing pipeline is to classify an input PNG image from PDF scanned paper source into one of the categories - each responsible for the following content-specific data processing pipeline.

In other words, when several APIs for different OCR subtasks are at your disposal - run this classifier first to mark the input data as machine-typed (old style fonts) / handwritten ✏️ / just printed plain ️📄 text or structured in tabular 📏 format text, as well as to mark the presence of the printed 🌄 or drawn 📈 graphic materials yet to be extracted from the page images.

Data 📜

The dataset is provided under Public Domain license, and consists of 15855 PNG images of pages from the archival documents. The source image files and their annotation can be found in the LINDAT repository ^10 🔗.

Training 💪 set of the model: 14267 images

90% of all - proportion in categories 🪧 tabulated below

Evaluation 🏆 set: 1583 images

10% of all - same proportion in categories 🪧 as below and demonstrated in model_EVAL.csv 📎

Manual ✍️ annotation was performed beforehand and took some time ⌛, the categories 🪧 were formed from different sources of the archival documents originated in the 1920-2020 years span.

Disproportion of the categories 🪧 in both training data and provided evaluation category_samples 📁 is NOT intentional, but rather a result of the source data nature.

In total, several thousands of separate PDF files were selected and split into PNG pages, ~4k of scanned documents were one-page long which covered around a third of all data, and ~2k of them were much longer (dozens and hundreds of pages) covering the rest (more than 60% of all annotated data).

The specific content and language of the source data is irrelevant considering the model's vision resolution, however, all of the data samples were from archaeological reports which may somehow affect the drawing detection preferences due to the common form of objects being ceramic pieces, arrowheads, and rocks formerly drawn by hand and later illustrated with digital tools (examples can be found in category_samples/DRAW 📁)

Categories 🪧

Label️	Description
`DRAW`	📈 - drawings, maps, paintings, schematics, or graphics, potentially containing some text labels or captions
`DRAW_L`	📈📏 - drawings, etc but presented within a table-like layout or includes a legend formatted as a table
`LINE_HW`	✏️📏 - handwritten text organized in a tabular or form-like structure
`LINE_P`	📏 - printed text organized in a tabular or form-like structure
`LINE_T`	📏 - machine-typed text organized in a tabular or form-like structure
`PHOTO`	🌄 - photographs or photographic cutouts, potentially with text captions
`PHOTO_L`	🌄📏 - photos presented within a table-like layout or accompanied by tabular annotations
`TEXT`	📰 - mixtures of printed, handwritten, and/or typed text, potentially with minor graphical elements
`TEXT_HW`	✏️📄 - only handwritten text in paragraph or block form (non-tabular)
`TEXT_P`	📄 - only printed text in paragraph or block form (non-tabular)
`TEXT_T`	📄 - only machine-typed text in paragraph or block form (non-tabular)

The categories were chosen to sort the pages by the following criteria:

presence of graphical elements (drawings 📈 OR photos 🌄)
type of text 📄 (handwritten ✏️️ OR printed OR typed OR mixed 📰)
presence of tabular layout / forms 📏

The reasons for such distinction are different processing pipelines for different types of pages, which would be applied after the classification.

Examples of pages sorted by category 🪧 can be found in the category_samples 📁 directory which is also available as a testing subset of the training data.

Results 📊

v1.1 Evaluation set's accuracy (Top-1): 100.00% 🏆

v1.2 Evaluation set's accuracy (Top-1): 100.00% 🏆

v2.1 Evaluation set's accuracy (Top-1): 99.94% 🏆

v2.2 Evaluation set's accuracy (Top-1): 99.87% 🏆

Confusion matrices provided above show the diagonal of matching gold and predicted categories 🪧 while their off-diagonal elements show inter-class errors. By those graphs you can judge what type of mistakes to expect from your model.

Image preprocessing steps 👀

transforms.ColorJitter(brightness 0.5)
transforms.ColorJitter(contrast 0.5)
transforms.ColorJitter(saturation 0.5)
transforms.ColorJitter(hue 0.5)
transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))

Training hyperparameters 👀

eval_strategy "epoch"
save_strategy "epoch"
learning_rate 5e-5
per_device_train_batch_size 8
per_device_eval_batch_size 8
num_train_epochs 3
warmup_ratio 0.1
logging_steps 10
load_best_model_at_end True
metric_for_best_model "accuracy"

Contacts 📧

For support write to: [email protected] responsible for this GitHub repository ^8 🔗

Information about the authors of this project, including their names and ORCIDs, can be found in the CITATION.cff 📎 file.

Acknowledgements 🙏

Developed by UFAL ^7 👥
Funded by ATRIUM ^4 💰
Shared by ATRIUM ^4 & UFAL ^7 🔗
Model type: fine-tuned ViT with a 224x224 ^2 🔗 or 384x384 ^13 ^14 🔗 resolution size

ufal
/

clip-historical-page