ufal
/

Image classification using fine-tuned CLIP - for historical document sorting

Goal: solve a task of archive page images sorting (for their further content-based processing)

Scope: Processing of images, training / evaluation of CLIP model, input file/directory processing, class πŸͺ§ (category) results of top N predictions output, predictions summarizing into a tabular format.

Versions 🏁

There are currently 4 version of the model available for download, both of them have the same set of categories, but different data annotations. The latest approved v1.1 is considered to be default and can be found in the main branch of HF 😊 hub ^1 πŸ”—

Version Base Pages PDFs Description
v1.1 ViT-B/16 15855 5730 smallest
v1.2 ViT-B/32 15855 5730 small with higher granularity
v2.1 ViT-L/14 15855 5730 large
v2.2 ViT-L/14@336 15855 5730 large with highest resolution
Version Disk space
openai/clip-vit-base-patch16 992 Mb
openai/clip-vit-base-patch32 1008 Mb
openai/clip-vit-large-patch14 1.5 Gb
openai/clip-vit-large-patch14-336 1.5 Gb

Model description πŸ“‡

architecture_diagram

πŸ”² Fine-tuned model repository: UFAL's clip-historical-page ^1 πŸ”—

πŸ”³ Base model repository: OpenAI's clip-vit-base-patch16, clip-vit-base-patch32, clip-vit-large-patch14, clip-vit-large-patch14-336 ^2 ^13 ^14 ^15 πŸ”—

The model was trained on the manually ✍️ annotated dataset of historical documents, in particular, images of pages from the archival documents with paper sources that were scanned into digital form.

The images contain various combinations of texts οΈπŸ“„, tables πŸ“, drawings πŸ“ˆ, and photos πŸŒ„ - categories πŸͺ§ described below were formed based on those archival documents. Page examples can be found in the category_samples πŸ“ directory.

The key use case of the provided model and data processing pipeline is to classify an input PNG image from PDF scanned paper source into one of the categories - each responsible for the following content-specific data processing pipeline.

In other words, when several APIs for different OCR subtasks are at your disposal - run this classifier first to mark the input data as machine-typed (old style fonts) / handwritten ✏️ / just printed plain οΈπŸ“„ text or structured in tabular πŸ“ format text, as well as to mark the presence of the printed πŸŒ„ or drawn πŸ“ˆ graphic materials yet to be extracted from the page images.

Data πŸ“œ

The dataset is provided under Public Domain license, and consists of 15855 PNG images of pages from the archival documents. The source image files and their annotation can be found in the LINDAT repository ^10 πŸ”—.

Training πŸ’ͺ set of the model: 14267 images

90% of all - proportion in categories πŸͺ§ tabulated below

Evaluation πŸ† set: 1583 images

10% of all - same proportion in categories πŸͺ§ as below and demonstrated in model_EVAL.csv πŸ“Ž

Manual ✍️ annotation was performed beforehand and took some time βŒ›, the categories πŸͺ§ were formed from different sources of the archival documents originated in the 1920-2020 years span.

Disproportion of the categories πŸͺ§ in both training data and provided evaluation category_samples πŸ“ is NOT intentional, but rather a result of the source data nature.

In total, several thousands of separate PDF files were selected and split into PNG pages, ~4k of scanned documents were one-page long which covered around a third of all data, and ~2k of them were much longer (dozens and hundreds of pages) covering the rest (more than 60% of all annotated data).

The specific content and language of the source data is irrelevant considering the model's vision resolution, however, all of the data samples were from archaeological reports which may somehow affect the drawing detection preferences due to the common form of objects being ceramic pieces, arrowheads, and rocks formerly drawn by hand and later illustrated with digital tools (examples can be found in category_samples/DRAW πŸ“)

Categories πŸͺ§

Label️ Description
DRAW πŸ“ˆ - drawings, maps, paintings, schematics, or graphics, potentially containing some text labels or captions
DRAW_L πŸ“ˆπŸ“ - drawings, etc but presented within a table-like layout or includes a legend formatted as a table
LINE_HW βœοΈπŸ“ - handwritten text organized in a tabular or form-like structure
LINE_P πŸ“ - printed text organized in a tabular or form-like structure
LINE_T πŸ“ - machine-typed text organized in a tabular or form-like structure
PHOTO πŸŒ„ - photographs or photographic cutouts, potentially with text captions
PHOTO_L πŸŒ„πŸ“ - photos presented within a table-like layout or accompanied by tabular annotations
TEXT πŸ“° - mixtures of printed, handwritten, and/or typed text, potentially with minor graphical elements
TEXT_HW βœοΈπŸ“„ - only handwritten text in paragraph or block form (non-tabular)
TEXT_P πŸ“„ - only printed text in paragraph or block form (non-tabular)
TEXT_T πŸ“„ - only machine-typed text in paragraph or block form (non-tabular)

The categories were chosen to sort the pages by the following criteria:

  • presence of graphical elements (drawings πŸ“ˆ OR photos πŸŒ„)
  • type of text πŸ“„ (handwritten ✏️️ OR printed OR typed OR mixed πŸ“°)
  • presence of tabular layout / forms πŸ“

The reasons for such distinction are different processing pipelines for different types of pages, which would be applied after the classification.

Examples of pages sorted by category πŸͺ§ can be found in the category_samples πŸ“ directory which is also available as a testing subset of the training data.

dataset_timeline.png

Results πŸ“Š

v1.1 Evaluation set's accuracy (Top-1): 100.00% πŸ†

TOP-1 confusion matrix

v1.2 Evaluation set's accuracy (Top-1): 100.00% πŸ†

TOP-1 confusion matrix

v2.1 Evaluation set's accuracy (Top-1): 99.94% πŸ†

TOP-1 confusion matrix

v2.2 Evaluation set's accuracy (Top-1): 99.87% πŸ†

TOP-1 confusion matrix

Confusion matrices provided above show the diagonal of matching gold and predicted categories πŸͺ§ while their off-diagonal elements show inter-class errors. By those graphs you can judge what type of mistakes to expect from your model.

Image preprocessing steps πŸ‘€
  • transforms.ColorJitter(brightness 0.5)
  • transforms.ColorJitter(contrast 0.5)
  • transforms.ColorJitter(saturation 0.5)
  • transforms.ColorJitter(hue 0.5)
  • transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
  • transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))
Training hyperparameters πŸ‘€
  • eval_strategy "epoch"
  • save_strategy "epoch"
  • learning_rate 5e-5
  • per_device_train_batch_size 8
  • per_device_eval_batch_size 8
  • num_train_epochs 3
  • warmup_ratio 0.1
  • logging_steps 10
  • load_best_model_at_end True
  • metric_for_best_model "accuracy"

Contacts πŸ“§

For support write to: [email protected] responsible for this GitHub repository ^8 πŸ”—

Information about the authors of this project, including their names and ORCIDs, can be found in the CITATION.cff πŸ“Ž file.

Acknowledgements πŸ™

  • Developed by UFAL ^7 πŸ‘₯
  • Funded by ATRIUM ^4 πŸ’°
  • Shared by ATRIUM ^4 & UFAL ^7 πŸ”—
  • Model type: fine-tuned ViT with a 224x224 ^2 πŸ”— or 384x384 ^13 ^14 πŸ”— resolution size

©️ 2022 UFAL & ATRIUM


Downloads last month
31
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ufal/clip-historical-page

Finetuned
(42)
this model