HAMMALE commited on
Commit
a6cfc87
·
verified ·
1 Parent(s): 51ad790

Upload folder using huggingface_hub

Browse files
Files changed (4) hide show
  1. README.md +136 -0
  2. config.json +61 -0
  3. model.safetensors +3 -0
  4. preprocessor_config.json +23 -0
README.md ADDED
@@ -0,0 +1,136 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: apache-2.0
4
+ tags:
5
+ - vision
6
+ - image-classification
7
+ - document-classification
8
+ - knowledge-distillation
9
+ - vit
10
+ - rvl-cdip
11
+ - tiny-model
12
+ - distilled-model
13
+ datasets:
14
+ - rvl_cdip
15
+ metrics:
16
+ - accuracy
17
+ pipeline_tag: image-classification
18
+ widget:
19
+ - src: https://huggingface.co/datasets/rvl_cdip/resolve/main/sample_images/letter_0.jpg
20
+ example_title: Letter
21
+ - src: https://huggingface.co/datasets/rvl_cdip/resolve/main/sample_images/form_0.jpg
22
+ example_title: Form
23
+ ---
24
+
25
+ # ViT-Tiny Classifier for RVL-CDIP Document Classification (Distilled)
26
+
27
+ This model is a compressed Vision Transformer (ViT-Tiny) trained using knowledge distillation from DiT-Large on the RVL-CDIP dataset for document image classification.
28
+
29
+ ## Model Details
30
+
31
+ - **Student Model**: ViT-Tiny (Vision Transformer)
32
+ - **Teacher Model**: microsoft/dit-large-finetuned-rvlcdip
33
+ - **Training Method**: Knowledge Distillation
34
+ - **Parameters**: ~5.5M (55x smaller than teacher)
35
+ - **Dataset**: RVL-CDIP (320k document images, 16 classes)
36
+ - **Task**: Document Image Classification
37
+ - **Accuracy**: To be evaluated
38
+ - **Compression Ratio**: ~55x parameter reduction from teacher model
39
+
40
+ ## Document Classes
41
+
42
+ The model classifies documents into 16 categories:
43
+
44
+ 1. **letter** - Personal or business correspondence
45
+ 2. **form** - Structured forms and applications
46
+ 3. **email** - Email communications
47
+ 4. **handwritten** - Handwritten documents
48
+ 5. **advertisement** - Marketing materials and ads
49
+ 6. **scientific_report** - Research reports and studies
50
+ 7. **scientific_publication** - Academic papers and journals
51
+ 8. **specification** - Technical specifications
52
+ 9. **file_folder** - File folders and organizational documents
53
+ 10. **news_article** - News articles and press releases
54
+ 11. **budget** - Financial budgets and planning documents
55
+ 12. **invoice** - Bills and invoices
56
+ 13. **presentation** - Presentation slides
57
+ 14. **questionnaire** - Surveys and questionnaires
58
+ 15. **resume** - CVs and resumes
59
+ 16. **memo** - Internal memos and notices
60
+
61
+ ## Usage
62
+
63
+ ```python
64
+ from transformers import AutoImageProcessor, AutoModelForImageClassification
65
+ from PIL import Image
66
+
67
+ # Load model
68
+ processor = AutoImageProcessor.from_pretrained("HAMMALE/vit-tiny-classifier-rvlcdip")
69
+ model = AutoModelForImageClassification.from_pretrained("HAMMALE/vit-tiny-classifier-rvlcdip")
70
+
71
+ # Load and classify an image
72
+ image = Image.open("path_to_your_document_image.jpg")
73
+ inputs = processor(image, return_tensors="pt")
74
+
75
+ # Get predictions
76
+ outputs = model(**inputs)
77
+ predicted_class_id = outputs.logits.argmax(-1).item()
78
+
79
+ # Get class names
80
+ class_names = [
81
+ "letter", "form", "email", "handwritten", "advertisement",
82
+ "scientific_report", "scientific_publication", "specification",
83
+ "file_folder", "news_article", "budget", "invoice",
84
+ "presentation", "questionnaire", "resume", "memo"
85
+ ]
86
+
87
+ predicted_class = class_names[predicted_class_id]
88
+ print("Predicted class:", predicted_class)
89
+ ```
90
+
91
+ ## Performance
92
+
93
+ | Metric | Value |
94
+ |--------|-------|
95
+ | Accuracy | To be evaluated |
96
+ | Parameters | ~5.5M |
97
+ | Model Size | ~22 MB |
98
+ | Input Size | 224x224 pixels |
99
+
100
+ ## Training Details
101
+
102
+ - **Student Architecture**: Vision Transformer (ViT-Tiny)
103
+ - **Teacher Model**: microsoft/dit-large-finetuned-rvlcdip
104
+ - **Distillation Method**: Knowledge Distillation
105
+ - **Input Resolution**: 224x224
106
+ - **Preprocessing**: Standard ImageNet normalization
107
+ - **Framework**: Transformers/PyTorch
108
+ - **Distillation Benefits**: Maintains high accuracy with 55x fewer parameters
109
+
110
+ ## Dataset
111
+
112
+ The RVL-CDIP (Ryerson Vision Lab Complex Document Information Processing) dataset contains:
113
+ - 400,000 grayscale document images
114
+ - 16 document categories
115
+ - Images collected from truth tobacco industry documents
116
+ - Standard train/validation/test splits
117
+
118
+ ## Citation
119
+
120
+ ```bibtex
121
+ @misc{hammale2025vit_tiny_rvlcdip_distilled,
122
+ title={ViT-Tiny Classifier for RVL-CDIP Document Classification (Distilled)},
123
+ author={Hammale, Mourad},
124
+ year={2025},
125
+ howpublished={\url{https://huggingface.co/HAMMALE/vit-tiny-classifier-rvlcdip}},
126
+ note={Knowledge distilled from microsoft/dit-large-finetuned-rvlcdip}
127
+ }
128
+ ```
129
+
130
+ ## Acknowledgments
131
+
132
+ This model was created by HAMMALE (Mourad) through knowledge distillation from the larger DiT-Large model (microsoft/dit-large-finetuned-rvlcdip), achieving significant compression while maintaining competitive performance for document classification tasks.
133
+
134
+ ## License
135
+
136
+ This model is released under the Apache 2.0 license.
config.json ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "ViTForImageClassification"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.0,
6
+ "encoder_stride": 16,
7
+ "hidden_act": "gelu",
8
+ "hidden_dropout_prob": 0.0,
9
+ "hidden_size": 192,
10
+ "id2label": {
11
+ "0": "letter",
12
+ "1": "form",
13
+ "2": "email",
14
+ "3": "handwritten",
15
+ "4": "advertisement",
16
+ "5": "scientific_report",
17
+ "6": "scientific_publication",
18
+ "7": "specification",
19
+ "8": "file_folder",
20
+ "9": "news_article",
21
+ "10": "budget",
22
+ "11": "invoice",
23
+ "12": "presentation",
24
+ "13": "questionnaire",
25
+ "14": "resume",
26
+ "15": "memo"
27
+ },
28
+ "image_size": 224,
29
+ "initializer_range": 0.02,
30
+ "intermediate_size": 768,
31
+ "label2id": {
32
+ "advertisement": 4,
33
+ "budget": 10,
34
+ "email": 2,
35
+ "file_folder": 8,
36
+ "form": 1,
37
+ "handwritten": 3,
38
+ "invoice": 11,
39
+ "letter": 0,
40
+ "memo": 15,
41
+ "news_article": 9,
42
+ "presentation": 12,
43
+ "questionnaire": 13,
44
+ "resume": 14,
45
+ "scientific_publication": 6,
46
+ "scientific_report": 5,
47
+ "specification": 7
48
+ },
49
+ "layer_norm_eps": 1e-12,
50
+ "model_type": "vit",
51
+ "num_attention_heads": 3,
52
+ "num_channels": 3,
53
+ "num_hidden_layers": 12,
54
+ "patch_size": 16,
55
+ "pooler_act": "tanh",
56
+ "pooler_output_size": 192,
57
+ "problem_type": "single_label_classification",
58
+ "qkv_bias": true,
59
+ "torch_dtype": "float32",
60
+ "transformers_version": "4.52.4"
61
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:715e45c6eac8d55c30fa550cc387e5c8508e2beda741c90cf59371d2579a55b5
3
+ size 22132736
preprocessor_config.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_convert_rgb": null,
3
+ "do_normalize": true,
4
+ "do_rescale": true,
5
+ "do_resize": true,
6
+ "image_mean": [
7
+ 0.5,
8
+ 0.5,
9
+ 0.5
10
+ ],
11
+ "image_processor_type": "ViTImageProcessor",
12
+ "image_std": [
13
+ 0.5,
14
+ 0.5,
15
+ 0.5
16
+ ],
17
+ "resample": 2,
18
+ "rescale_factor": 0.00392156862745098,
19
+ "size": {
20
+ "height": 224,
21
+ "width": 224
22
+ }
23
+ }