Anime Style Classifier - EfficientNet-B0

A fine-tuned EfficientNet-B0 model for classifying anime/visual novel images into 6 distinct art styles.

Model Description

Model Architecture: EfficientNet-B0 (~5.3M parameters)
Base Model: ImageNet pretrained weights
Task: Multi-class image classification (6 styles)
Input Resolution: 224x224 RGB
Framework: PyTorch
License: MIT

Performance

Test Set Results (Holdout)

Accuracy: 100.0%
Macro F1-Score: 1.000
Validation Accuracy: 98.18%

Perfect classification across all 120 holdout images (20 per class). Note: with n=120 the 95% Wilson confidence interval for this result is approximately 96.90%—100.00%, so the perfect score should be interpreted cautiously alongside validation metrics. Taking both validation and holdout into account, a realistic estimate of the model's true accuracy is likely in the mid-to-high 90s (≈96–98%) — still very strong and, for most applications, likely fit for purpose.

Per-Class Performance

Style	Precision	Recall	F1-Score	Support
dark	1.000	1.000	1.000	20
flat	1.000	1.000	1.000	20
modern	1.000	1.000	1.000	20
moe	1.000	1.000	1.000	20
painterly	1.000	1.000	1.000	20
retro	1.000	1.000	1.000	20

Style Definitions

dark: Low-key lighting, chiaroscuro, desaturated palette, high contrast shadows, moody atmosphere
flat: Minimalist flat colors, vector illustration, solid color blocks, no gradients or shading
modern: Clean digital rendering, smooth gradients, glossy finish, contemporary anime aesthetic
moe: Soft pastel colors, rounded features, cute/adorable character focus, gentle shading
painterly: Watercolor or gouache appearance, visible brush strokes, paper texture, artistic feel
retro: 80s/90s anime aesthetic, vintage color palette, classic cel animation style

Training Details

Dataset

Training Images: 933 (scene-level split)
Validation Images: 165
Holdout Images: 120
Total Scenes: 203 perfectly balanced scenes
Images per Style: 183 training + 20 holdout = 203 each
Source Resolution: 1920x1088
Training Resolution: 224x224

Data Split Strategy: Scene-level 90/10 split to prevent data leakage. All 6 style variants of each scene are kept together in either training or holdout set.

Data Generation: Synthetic images generated via ComfyUI with Flux diffusion model, validated by Gemma-12B vision-language model. Only scenes with 6/6 style agreement (all variants correctly classified) were included.

Training Regime

Architecture: EfficientNet-B0
Pretrained: ImageNet weights
Optimizer: AdamW
Learning Rate: 0.001
Weight Decay: 1e-05
Batch Size: 16
Epochs: 30 (early stopping at ~12-15 epochs typical)
Scheduler: CosineAnnealingLR
Loss: CrossEntropyLoss
Early Stopping: 10 epochs patience (val accuracy)

Data Augmentation (Training Only)

Resize to 256x256
Random crop to 224x224
Random horizontal flip (p=0.5)
Color jitter (brightness=0.1, saturation=0.1, hue=0.05)
ImageNet normalization (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])

Hardware

GPU: NVIDIA GPU (CUDA)
Training Time: ~15 minutes (with early stopping)

Usage

Installation

pip install torch torchvision pillow

Inference

This repository includes a small CLI inference script, inference.py, which auto-detects .safetensors (preferred) or a PyTorch .pth checkpoint and provides a convenient command-line interface for classification. Because inference.py already contains the full, tested loading and preprocessing logic, the README keeps only the minimal usage notes below and a short programmatic example that delegates to the script's functions.

Install (optional: include safetensors for safer loading):

pip install torch torchvision pillow safetensors

CLI usage (example):

python inference.py --model model.safetensors --config config.json examples/retro_1.png

This will print a ranked list of predictions and a top prediction summary (see the script output example above).

Programmatic usage (calls the same functions used by the CLI):

# Minimal programmatic example using functions from inference.py
from inference import load_model, classify_image

model, config = load_model('model.safetensors', 'config.json')
results = classify_image(model, config, 'examples/retro_1.png')
print(results[:3])  # top-3 predictions as (style, confidence)

For the full implementation and additional options (e.g., --top-k), see inference.py in the repository.

Limitations

Input Resolution: Model processes images at 224x224, which may lose fine texture details from high-resolution sources (1920x1088+)
Domain: Trained on synthetically generated anime/visual novel images. May not generalize perfectly to all anime art styles, manga, or hand-drawn artwork
Style Ambiguity: Some real-world images may blend multiple styles (e.g., painterly with modern digital techniques) -- Validation Bias: Ground truth labels come from Gemma-12B vision model, so classifier may inherit some of its biases

Small-sample caution: The internal validation set achieved 98.18% (162/165). The 95% Wilson confidence interval for this is approximately 94.79%—99.38%. Because the holdout set is relatively small (20 images per class, 120 total), perfect classification on that set is possible by chance and should be reported with its confidence interval (see above).

Decision rule note: The model uses the standard softmax + argmax decision rule by default (choose the class with highest predicted probability). No abstain threshold is applied in the shipped inference.py; if you later want an abstain/human-review mode, adding a --min-conf option is straightforward.

Model Selection

This model was selected from a hyperparameter sweep of 144+ configurations across 6 architectures:

ResNet-18
MobileNetV3-Large
MobileNetV3-Small
EfficientNet-B0 ⭐ (winner)
EfficientNetV2-S
ViT-B/16

EfficientNet-B0 achieved perfect 100% holdout accuracy with:

Excellent efficiency (~5.3M parameters)
Fast inference
Strong generalization (98.18% val → 100% holdout)

Citation

@software{anime_style_classifier_2025,
  author = {Your Name},
  title = {Anime Style Classifier},
  year = {2025},
  url = {https://huggingface.co/Mitchins/anime-style-classifier-efficientnet-b0}
}

Acknowledgments

Base Model: EfficientNet-B0 from torchvision (ImageNet pretrained)
Synthetic Data Generation: ComfyUI + Flux diffusion model
Data Validation: Gemma-12B vision-language model
Framework: PyTorch, torchvision

Contact

For questions or feedback, please open an issue on the GitHub repository.

Published validation preview

Below is a compact thumbnail-based preview of a subset of the published validation images. Thumbnails link to the full-size images under validation/images/.

Modern	Painterly	Retro	Moe	Flat	Dark

Note about thumbnails and deployment

Thumbnails are provided only for the convenience of the model README/gallery and are optional. When you publish the model to Hugging Face you may omit validation/thumbs/ (it's listed in .gitignore) — the important artifacts for deployment are model.safetensors and config.json (and inference.py if you want a runnable CLI). Typical deployments will only fetch the model weights and config; README/gallery images are not required for inference and can be excluded from the model archive to keep downloads small.

Downloads last month: 136

Model tree for Mitchins/anime-style-classifier-efficientnet-b0

Base model

google/efficientnet-b0

Finetuned

(34)

this model

Evaluation results

Accuracy on synthetic-anime-scenes
validation set self-reported

0.982
Macro F1 on synthetic-anime-scenes
validation set self-reported

0.988
Accuracy on synthetic-anime-scenes
self-reported

1.000
Macro F1 on synthetic-anime-scenes
self-reported

1.000

View on Papers With Code