File size: 4,325 Bytes
a6cfc87
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e5f2796
a6cfc87
 
 
 
 
19d4bc3
a6cfc87
 
 
 
 
 
 
 
e8a2935
a6cfc87
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e5f2796
a6cfc87
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---
language: en
license: apache-2.0
tags:
- vision
- image-classification
- document-classification
- knowledge-distillation
- vit
- rvl-cdip
- tiny-model
- distilled-model
datasets:
- rvl_cdip
metrics:
- accuracy
pipeline_tag: image-classification

---

# ViT-Tiny Classifier for RVL-CDIP Document Classification (Distilled)

This model is a compressed Vision Transformer (ViT-Tiny) trained using knowledge distillation from DiT-Large on the RVL-CDIP dataset for document image classification.
This model was developed as part of a **research internship at the Laboratory of Complex Systems, Ecole Centrale Casablanca**
## Model Details

- **Student Model**: ViT-Tiny (Vision Transformer)
- **Teacher Model**: microsoft/dit-large-finetuned-rvlcdip
- **Training Method**: Knowledge Distillation
- **Parameters**: ~5.5M (55x smaller than teacher)
- **Dataset**: RVL-CDIP (320k document images, 16 classes)
- **Task**: Document Image Classification
- **Accuracy**: 0.9210
- **Compression Ratio**: ~55x parameter reduction from teacher model

## Document Classes

The model classifies documents into 16 categories:

1. **letter** - Personal or business correspondence
2. **form** - Structured forms and applications
3. **email** - Email communications
4. **handwritten** - Handwritten documents
5. **advertisement** - Marketing materials and ads
6. **scientific_report** - Research reports and studies
7. **scientific_publication** - Academic papers and journals
8. **specification** - Technical specifications
9. **file_folder** - File folders and organizational documents
10. **news_article** - News articles and press releases
11. **budget** - Financial budgets and planning documents
12. **invoice** - Bills and invoices
13. **presentation** - Presentation slides
14. **questionnaire** - Surveys and questionnaires
15. **resume** - CVs and resumes
16. **memo** - Internal memos and notices

## Usage

```python
from transformers import AutoImageProcessor, AutoModelForImageClassification
from PIL import Image

# Load model
processor = AutoImageProcessor.from_pretrained("HAMMALE/vit-tiny-classifier-rvlcdip")
model = AutoModelForImageClassification.from_pretrained("HAMMALE/vit-tiny-classifier-rvlcdip")

# Load and classify an image
image = Image.open("path_to_your_document_image.jpg")
inputs = processor(image, return_tensors="pt")

# Get predictions
outputs = model(**inputs)
predicted_class_id = outputs.logits.argmax(-1).item()

# Get class names
class_names = [
    "letter", "form", "email", "handwritten", "advertisement", 
    "scientific_report", "scientific_publication", "specification", 
    "file_folder", "news_article", "budget", "invoice", 
    "presentation", "questionnaire", "resume", "memo"
]

predicted_class = class_names[predicted_class_id]
print("Predicted class:", predicted_class)
```

## Performance

| Metric | Value |
|--------|-------|
| Accuracy | 0.9210 |
| Parameters | ~5.5M |
| Model Size | ~22 MB |
| Input Size | 224x224 pixels |

## Training Details

- **Student Architecture**: Vision Transformer (ViT-Tiny) 
- **Teacher Model**: microsoft/dit-large-finetuned-rvlcdip
- **Distillation Method**: Knowledge Distillation
- **Input Resolution**: 224x224
- **Preprocessing**: Standard ImageNet normalization
- **Framework**: Transformers/PyTorch
- **Distillation Benefits**: Maintains high accuracy with 55x fewer parameters

## Dataset

The RVL-CDIP (Ryerson Vision Lab Complex Document Information Processing) dataset contains:
- 400,000 grayscale document images
- 16 document categories
- Images collected from truth tobacco industry documents
- Standard train/validation/test splits

## Citation

```bibtex
@misc{hammale2025vit_tiny_rvlcdip_distilled,
  title={ViT-Tiny Classifier for RVL-CDIP Document Classification (Distilled)},
  author={Hammale, Mourad},
  year={2025},
  howpublished={\url{https://huggingface.co/HAMMALE/vit-tiny-classifier-rvlcdip}},
  note={Knowledge distilled from microsoft/dit-large-finetuned-rvlcdip}
}
```

## Acknowledgments

This model was created by HAMMALE (Mourad) through knowledge distillation from the larger DiT-Large model (microsoft/dit-large-finetuned-rvlcdip), achieving significant compression while maintaining competitive performance for document classification tasks.

## License

This model is released under the Apache 2.0 license.