--- license: apache-2.0 tags: - vision - image-classification - clip - knowledge-distillation - semi-supervised-learning - imagenet datasets: - imagenet-1k library_name: pytorch pipeline_tag: image-classification --- # DHO: Simple yet Effective Semi-supervised Knowledge Distillation from Vision-Language Models via Dual-Head Optimization [![arXiv](https://img.shields.io/badge/arXiv-2505.07675v1-b31b1b.svg)](https://arxiv.org/abs/2505.07675v1) This repository contains pretrained checkpoints for **DHO (Dual-Head Optimization)**, a simple yet effective approach for semi-supervised knowledge distillation from Vision-Language Models. ## Model Description DHO introduces a dual-head optimization strategy that enables efficient knowledge transfer from large Vision-Language Models (e.g., CLIP) to smaller student models using minimal labeled data. The method achieves state-of-the-art performance on ImageNet semi-supervised learning benchmarks with only 1% and 10% labeled data. **Paper:** [Simple yet Effective Semi-supervised Knowledge Distillation from Vision-Language Models via Dual-Head Optimization](https://arxiv.org/abs/2505.07675) **Authors:** Seongjae Kang, Dong Bok Lee, Hyungjoon Jang, Sung Ju Hwang ## Key Features - ✨ **Dual-head optimization** strategy for semi-supervised distillation - 🏆 **State-of-the-art** performance on ImageNet with 1% and 10% labeled data - 🔄 Efficient transfer from VLMs (e.g., CLIP) to smaller student models - 🧩 Simple, scalable, and easy to integrate into existing pipelines ## Available Checkpoints | Checkpoint Name | Student Model | Teacher Model | Labeled Data | Top-1 Acc. | Parameters | |:----------------|:--------------|:--------------|:-------------|:-----------|:-----------| | `vit_b_1.pt` | ViT-B/16 | ViT-H/14 (DFN5B) | 1% | 81.6% | 86M | | `vit_b_10.pt` | ViT-B/16 | ViT-H/14 (DFN5B) | 10% | 82.8% | 86M | | `vit_l_1.pt` | ViT-L/14 | ViT-H/14 (DFN5B) | 1% | 84.6% | 304M | | `vit_l_10.pt` | ViT-L/14 | ViT-H/14 (DFN5B) | 10% | 85.9% | 304M | ## Usage ### Loading a Checkpoint ```python import torch import torch.nn as nn import torch.nn.functional as F import clip from huggingface_hub import hf_hub_download # Define the DHO StudentModel architecture with dual heads class StudentModel(nn.Module): def __init__(self, num_classes=1000, model_name='ViT-B-16'): super().__init__() # Load CLIP backbone clip_model, _ = clip.load(model_name, device='cpu') self.backbone = clip_model.float().visual # Feature dimensions per architecture in_features = { 'RN50': 1024, 'ViT-B-16': 512, 'ViT-L-14': 768, 'ViT-L-14-336px': 768 }[model_name] # Dual-head architecture self.ce_head = nn.Linear(in_features, num_classes) # CE branch self.kd_head = nn.Linear(in_features, num_classes) # KD branch def forward(self, x): features = self.backbone(x) ce_out = self.ce_head(features) kd_out = self.kd_head(F.normalize(features, dim=1)) * 100 return ce_out, kd_out # Download and load checkpoint device = "cuda" if torch.cuda.is_available() else "cpu" checkpoint_path = hf_hub_download(repo_id="erjui/dho", filename="vit_b_10.pt") checkpoint = torch.load(checkpoint_path, map_location=device) # Initialize model model = StudentModel(num_classes=1000, model_name='ViT-B-16').to(device) # Handle DDP wrapped state_dict state_dict = checkpoint['model_state_dict'] state_dict = {k.replace('module.', ''): v for k, v in state_dict.items()} model.load_state_dict(state_dict) # Get optimal inference parameters alpha = checkpoint['alpha'] # Weight for CE head beta = checkpoint['beta'] # Temperature for KD head model.eval() # Inference example from PIL import Image import torchvision.transforms as transforms # CLIP preprocessing preprocess = transforms.Compose([ transforms.Resize(224), transforms.CenterCrop(224), transforms.ToTensor(), transforms.Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711)) ]) image = preprocess(Image.open("path/to/image.jpg")).unsqueeze(0).to(device) with torch.no_grad(): ce_logits, kd_logits = model(image) # Combine predictions using saved parameters probs_ce = F.softmax(ce_logits, dim=1) probs_kd = F.softmax(kd_logits / beta, dim=1) probs = alpha * probs_ce + (1 - alpha) * probs_kd predicted_class = probs.argmax(dim=1) print(f"Predicted class: {predicted_class.item()}") ``` **Important Notes:** - DHO checkpoints contain: `model_state_dict`, `epoch`, `acc`, `alpha`, `beta` - The model has a **dual-head architecture** (CE head + KD head) - Use the saved `alpha` and `beta` parameters for optimal inference - For ViT-L checkpoints, change `model_name='ViT-L-14'` and use image size 224 (or 336 for ViT-L-14-336px) ### Training Your Own Model To train your own DHO model, please visit the [official GitHub repository](https://github.com/yourusername/DHO) for detailed instructions and training scripts. **Example training command:** ```bash CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=8 --master_port=29500 train_imgnet_semi.py \ --teacher_model "apple/DFN5B-CLIP-ViT-H-14-378" \ --student_model "ViT-B-16" \ --lr 5e-5 \ --train_epoch 32 \ --batch_size 256 \ --percent 10.0 \ | tee ./logs/imagenet/imgnet_lowshot.log ``` ## Model Architecture The DHO student model consists of: - **Backbone:** CLIP Vision Transformer (ViT-B/16 or ViT-L/14) - **Two parallel heads:** - **CE Head:** Optimized with cross-entropy loss on labeled data - **KD Head:** Optimized with knowledge distillation loss from teacher predictions During inference, predictions from both heads are combined using learned weighting parameters (alpha, beta). ## Performance ### ImageNet Semi-supervised Learning | Student | Teacher | Labeled Data | Top-1 Accuracy | |:--------|:--------|:-------------|:---------------| | ViT-B/16 | ViT-H/14 | 1% | **81.6%** | | ViT-B/16 | ViT-H/14 | 10% | **82.8%** | | ViT-L/14 | ViT-H/14 | 1% | **84.6%** | | ViT-L/14 | ViT-H/14 | 10% | **85.9%** | These results establish new state-of-the-art benchmarks for semi-supervised learning on ImageNet-1K. ## Citation If you use these models in your research, please cite: ```bibtex @article{kang2025simple, title={Simple yet Effective Semi-supervised Knowledge Distillation from Vision-Language Models via Dual-Head Optimization}, author={Kang, Seongjae and Lee, Dong Bok and Jang, Hyungjoon and Hwang, Sung Ju}, journal={arXiv preprint arXiv:2505.07675}, year={2025} } ``` ## License This project is licensed under the Apache License 2.0 - see the LICENSE file for details. ## Acknowledgments We appreciate the open-source implementations from: - [Tip-Adapter](https://github.com/gaopengcuhk/Tip-Adapter) - [CLIP](https://github.com/openai/CLIP) - [OpenCLIP](https://github.com/mlfoundations/open_clip) ## Contact For questions or issues, please open an issue on the [GitHub repository](https://github.com/erjui/DHO) or contact the authors.