|
--- |
|
library_name: transformers |
|
license: apache-2.0 |
|
base_model: google/vit-base-patch16-224-in21k |
|
tags: |
|
- generated_from_trainer |
|
metrics: |
|
- accuracy |
|
model-index: |
|
- name: envisage |
|
results: [] |
|
--- |
|
|
|
# Model Card for envisage |
|
|
|
This is the official model card for `envisage`, a Vision Transformer (ViT) model fine-tuned for image classification. |
|
|
|
This model was fine-tuned from the `google/vit-base-patch16-224-in21k` base model on the `cifar10` dataset, which consists of 60,000 32x32 color images in 10 distinct classes. |
|
|
|
## Model Description |
|
|
|
- **Base Model:** [`google/vit-base-patch16-224-in21k`](https://huggingface.co/google/vit-base-patch16-224-in21k) |
|
- **Dataset:** [`cifar10`](https://huggingface.co/datasets/cifar10) |
|
- **Task:** Image Classification |
|
- **Framework:** PyTorch, Transformers |
|
- **Classes (10):** `airplane`, `automobile`, `bird`, `cat`, `deer`, `dog`, `frog`, `horse`, `ship`, `truck` |
|
|
|
## How to Use |
|
|
|
The easiest way to use this model for inference is with the `pipeline` API from the `transformers` library. |
|
|
|
First, ensure you have the necessary libraries installed: |
|
```bash |
|
pip install transformers torch pillow |
|
``` |
|
|
|
Then, you can use the following Python snippet to classify an image: |
|
|
|
```python |
|
from transformers import pipeline |
|
from PIL import Image |
|
import requests |
|
|
|
# Load the classification pipeline with your model |
|
pipe = pipeline("image-classification", model="louijiec/envisage") |
|
|
|
# Load an image from a URL (e.g., a cat) |
|
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cat-tree.jpeg" |
|
image = Image.open(requests.get(url, stream=True).raw) |
|
|
|
# Get the predictions |
|
predictions = pipe(image) |
|
|
|
print("Predictions:") |
|
for p in predictions: |
|
print(f"- {p['label']}: {p['score']:.4f}") |
|
|
|
# Expected output will show the model's confidence for each class, |
|
# with 'cat' likely having the highest score. |
|
``` |
|
|
|
## Training Procedure |
|
|
|
The model was trained in a Google Colab environment using the `transformers` `Trainer` API. |
|
|
|
### Hyperparameters |
|
|
|
- **Learning Rate:** 5e-5 |
|
- **Training Epochs:** 3 |
|
- **Batch Size:** 16 per device |
|
- **Gradient Accumulation Steps:** 4 (Effective batch size of 64) |
|
- **Optimizer:** AdamW with a linear learning rate schedule |
|
- **Warmup Ratio:** 0.1 |
|
|
|
### Evaluation |
|
|
|
The model was evaluated on the `cifar10` test split, which contains 10,000 images. |
|
|
|
- **Final Accuracy on Test Set:** [TODO: Add final accuracy from the `trainer.evaluate()` step here. For example: 0.965] |
|
|
|
## Intended Use & Limitations |
|
|
|
This model is intended for educational purposes and as a demonstration of fine-tuning a Vision Transformer on a common benchmark dataset. It performs well on images similar to those in the `cifar10` dataset (small, low-resolution images of the 10 specified classes). |
|
|
|
**Limitations:** |
|
- The model will likely perform poorly on images that are significantly different from the `cifar10` data (e.g., high-resolution photos, medical images, or classes not seen during training). |
|
- The training data may reflect biases present in the original `cifar10` dataset. |
|
|