Anime Style Classifier - EfficientNet-B0
A fine-tuned EfficientNet-B0 model for classifying anime/visual novel images into 6 distinct art styles.
Model Description
- Model Architecture: EfficientNet-B0 (~5.3M parameters)
- Base Model: ImageNet pretrained weights
- Task: Multi-class image classification (6 styles)
- Input Resolution: 224x224 RGB
- Framework: PyTorch
- License: MIT
Performance
Test Set Results (Holdout)
- Accuracy: 100.0%
- Macro F1-Score: 1.000
- Validation Accuracy: 98.18%
Perfect classification across all 120 holdout images (20 per class). Note: with n=120 the 95% Wilson confidence interval for this result is approximately 96.90%β100.00%, so the perfect score should be interpreted cautiously alongside validation metrics. Taking both validation and holdout into account, a realistic estimate of the model's true accuracy is likely in the mid-to-high 90s (β96β98%) β still very strong and, for most applications, likely fit for purpose.
Per-Class Performance
| Style | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| dark | 1.000 | 1.000 | 1.000 | 20 |
| flat | 1.000 | 1.000 | 1.000 | 20 |
| modern | 1.000 | 1.000 | 1.000 | 20 |
| moe | 1.000 | 1.000 | 1.000 | 20 |
| painterly | 1.000 | 1.000 | 1.000 | 20 |
| retro | 1.000 | 1.000 | 1.000 | 20 |
Style Definitions
- dark: Low-key lighting, chiaroscuro, desaturated palette, high contrast shadows, moody atmosphere
- flat: Minimalist flat colors, vector illustration, solid color blocks, no gradients or shading
- modern: Clean digital rendering, smooth gradients, glossy finish, contemporary anime aesthetic
- moe: Soft pastel colors, rounded features, cute/adorable character focus, gentle shading
- painterly: Watercolor or gouache appearance, visible brush strokes, paper texture, artistic feel
- retro: 80s/90s anime aesthetic, vintage color palette, classic cel animation style
Training Details
Dataset
- Training Images: 933 (scene-level split)
- Validation Images: 165
- Holdout Images: 120
- Total Scenes: 203 perfectly balanced scenes
- Images per Style: 183 training + 20 holdout = 203 each
- Source Resolution: 1920x1088
- Training Resolution: 224x224
Data Split Strategy: Scene-level 90/10 split to prevent data leakage. All 6 style variants of each scene are kept together in either training or holdout set.
Data Generation: Synthetic images generated via ComfyUI with Flux diffusion model, validated by Gemma-12B vision-language model. Only scenes with 6/6 style agreement (all variants correctly classified) were included.
Training Regime
Architecture: EfficientNet-B0
Pretrained: ImageNet weights
Optimizer: AdamW
Learning Rate: 0.001
Weight Decay: 1e-05
Batch Size: 16
Epochs: 30 (early stopping at ~12-15 epochs typical)
Scheduler: CosineAnnealingLR
Loss: CrossEntropyLoss
Early Stopping: 10 epochs patience (val accuracy)
Data Augmentation (Training Only)
- Resize to 256x256
- Random crop to 224x224
- Random horizontal flip (p=0.5)
- Color jitter (brightness=0.1, saturation=0.1, hue=0.05)
- ImageNet normalization (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
Hardware
- GPU: NVIDIA GPU (CUDA)
- Training Time: ~15 minutes (with early stopping)
Usage
Installation
pip install torch torchvision pillow
Inference
This repository includes a small CLI inference script, inference.py, which auto-detects .safetensors (preferred) or a PyTorch .pth checkpoint and provides a convenient command-line interface for classification. Because inference.py already contains the full, tested loading and preprocessing logic, the README keeps only the minimal usage notes below and a short programmatic example that delegates to the script's functions.
Install (optional: include safetensors for safer loading):
pip install torch torchvision pillow safetensors
CLI usage (example):
python inference.py --model model.safetensors --config config.json examples/retro_1.png
This will print a ranked list of predictions and a top prediction summary (see the script output example above).
Programmatic usage (calls the same functions used by the CLI):
# Minimal programmatic example using functions from inference.py
from inference import load_model, classify_image
model, config = load_model('model.safetensors', 'config.json')
results = classify_image(model, config, 'examples/retro_1.png')
print(results[:3]) # top-3 predictions as (style, confidence)
For the full implementation and additional options (e.g., --top-k), see inference.py in the repository.
Limitations
- Input Resolution: Model processes images at 224x224, which may lose fine texture details from high-resolution sources (1920x1088+)
- Domain: Trained on synthetically generated anime/visual novel images. May not generalize perfectly to all anime art styles, manga, or hand-drawn artwork
- Style Ambiguity: Some real-world images may blend multiple styles (e.g., painterly with modern digital techniques) -- Validation Bias: Ground truth labels come from Gemma-12B vision model, so classifier may inherit some of its biases
Small-sample caution: The internal validation set achieved 98.18% (162/165). The 95% Wilson confidence interval for this is approximately 94.79%β99.38%. Because the holdout set is relatively small (20 images per class, 120 total), perfect classification on that set is possible by chance and should be reported with its confidence interval (see above).
Decision rule note: The model uses the standard softmax + argmax decision rule by default (choose the class with highest predicted probability). No abstain threshold is applied in the shipped inference.py; if you later want an abstain/human-review mode, adding a --min-conf option is straightforward.
Model Selection
This model was selected from a hyperparameter sweep of 144+ configurations across 6 architectures:
- ResNet-18
- MobileNetV3-Large
- MobileNetV3-Small
- EfficientNet-B0 β (winner)
- EfficientNetV2-S
- ViT-B/16
EfficientNet-B0 achieved perfect 100% holdout accuracy with:
- Excellent efficiency (~5.3M parameters)
- Fast inference
- Strong generalization (98.18% val β 100% holdout)
Citation
@software{anime_style_classifier_2025,
author = {Your Name},
title = {Anime Style Classifier},
year = {2025},
url = {https://huggingface.co/Mitchins/anime-style-classifier-efficientnet-b0}
}
Acknowledgments
- Base Model: EfficientNet-B0 from torchvision (ImageNet pretrained)
- Synthetic Data Generation: ComfyUI + Flux diffusion model
- Data Validation: Gemma-12B vision-language model
- Framework: PyTorch, torchvision
Contact
For questions or feedback, please open an issue on the GitHub repository.
Published validation preview
Below is a compact thumbnail-based preview of a subset of the published validation images. Thumbnails link to the full-size images under validation/images/.
Note about thumbnails and deployment
Thumbnails are provided only for the convenience of the model README/gallery and are optional. When you publish the model to Hugging Face you may omit validation/thumbs/ (it's listed in .gitignore) β the important artifacts for deployment are model.safetensors and config.json (and inference.py if you want a runnable CLI). Typical deployments will only fetch the model weights and config; README/gallery images are not required for inference and can be excluded from the model archive to keep downloads small.
- Downloads last month
- 136
Model tree for Mitchins/anime-style-classifier-efficientnet-b0
Base model
google/efficientnet-b0Evaluation results
- Accuracy on synthetic-anime-scenesvalidation set self-reported0.982
- Macro F1 on synthetic-anime-scenesvalidation set self-reported0.988
- Accuracy on synthetic-anime-scenesself-reported1.000
- Macro F1 on synthetic-anime-scenesself-reported1.000







































































