File size: 7,200 Bytes
88d3ff8 a01185a 88d3ff8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 |
---
license: apache-2.0
tags:
- vision
- image-classification
- clip
- knowledge-distillation
- semi-supervised-learning
- imagenet
datasets:
- imagenet-1k
library_name: pytorch
pipeline_tag: image-classification
---
# DHO: Simple yet Effective Semi-supervised Knowledge Distillation from Vision-Language Models via Dual-Head Optimization
[](https://arxiv.org/abs/2505.07675v1)
This repository contains pretrained checkpoints for **DHO (Dual-Head Optimization)**, a simple yet effective approach for semi-supervised knowledge distillation from Vision-Language Models.
## Model Description
DHO introduces a dual-head optimization strategy that enables efficient knowledge transfer from large Vision-Language Models (e.g., CLIP) to smaller student models using minimal labeled data.
The method achieves state-of-the-art performance on ImageNet semi-supervised learning benchmarks with only 1% and 10% labeled data.
**Paper:** [Simple yet Effective Semi-supervised Knowledge Distillation from Vision-Language Models via Dual-Head Optimization](https://arxiv.org/abs/2505.07675)
**Authors:** Seongjae Kang, Dong Bok Lee, Hyungjoon Jang, Sung Ju Hwang
## Key Features
- ✨ **Dual-head optimization** strategy for semi-supervised distillation
- 🏆 **State-of-the-art** performance on ImageNet with 1% and 10% labeled data
- 🔄 Efficient transfer from VLMs (e.g., CLIP) to smaller student models
- 🧩 Simple, scalable, and easy to integrate into existing pipelines
## Available Checkpoints
| Checkpoint Name | Student Model | Teacher Model | Labeled Data | Top-1 Acc. | Parameters |
|:----------------|:--------------|:--------------|:-------------|:-----------|:-----------|
| `vit_b_1.pt` | ViT-B/16 | ViT-H/14 (DFN5B) | 1% | 81.6% | 86M |
| `vit_b_10.pt` | ViT-B/16 | ViT-H/14 (DFN5B) | 10% | 82.8% | 86M |
| `vit_l_1.pt` | ViT-L/14 | ViT-H/14 (DFN5B) | 1% | 84.6% | 304M |
| `vit_l_10.pt` | ViT-L/14 | ViT-H/14 (DFN5B) | 10% | 85.9% | 304M |
## Usage
### Loading a Checkpoint
```python
import torch
import torch.nn as nn
import torch.nn.functional as F
import clip
from huggingface_hub import hf_hub_download
# Define the DHO StudentModel architecture with dual heads
class StudentModel(nn.Module):
def __init__(self, num_classes=1000, model_name='ViT-B-16'):
super().__init__()
# Load CLIP backbone
clip_model, _ = clip.load(model_name, device='cpu')
self.backbone = clip_model.float().visual
# Feature dimensions per architecture
in_features = {
'RN50': 1024,
'ViT-B-16': 512,
'ViT-L-14': 768,
'ViT-L-14-336px': 768
}[model_name]
# Dual-head architecture
self.ce_head = nn.Linear(in_features, num_classes) # CE branch
self.kd_head = nn.Linear(in_features, num_classes) # KD branch
def forward(self, x):
features = self.backbone(x)
ce_out = self.ce_head(features)
kd_out = self.kd_head(F.normalize(features, dim=1)) * 100
return ce_out, kd_out
# Download and load checkpoint
device = "cuda" if torch.cuda.is_available() else "cpu"
checkpoint_path = hf_hub_download(repo_id="erjui/dho", filename="vit_b_10.pt")
checkpoint = torch.load(checkpoint_path, map_location=device)
# Initialize model
model = StudentModel(num_classes=1000, model_name='ViT-B-16').to(device)
# Handle DDP wrapped state_dict
state_dict = checkpoint['model_state_dict']
state_dict = {k.replace('module.', ''): v for k, v in state_dict.items()}
model.load_state_dict(state_dict)
# Get optimal inference parameters
alpha = checkpoint['alpha'] # Weight for CE head
beta = checkpoint['beta'] # Temperature for KD head
model.eval()
# Inference example
from PIL import Image
import torchvision.transforms as transforms
# CLIP preprocessing
preprocess = transforms.Compose([
transforms.Resize(224),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=(0.48145466, 0.4578275, 0.40821073),
std=(0.26862954, 0.26130258, 0.27577711))
])
image = preprocess(Image.open("path/to/image.jpg")).unsqueeze(0).to(device)
with torch.no_grad():
ce_logits, kd_logits = model(image)
# Combine predictions using saved parameters
probs_ce = F.softmax(ce_logits, dim=1)
probs_kd = F.softmax(kd_logits / beta, dim=1)
probs = alpha * probs_ce + (1 - alpha) * probs_kd
predicted_class = probs.argmax(dim=1)
print(f"Predicted class: {predicted_class.item()}")
```
**Important Notes:**
- DHO checkpoints contain: `model_state_dict`, `epoch`, `acc`, `alpha`, `beta`
- The model has a **dual-head architecture** (CE head + KD head)
- Use the saved `alpha` and `beta` parameters for optimal inference
- For ViT-L checkpoints, change `model_name='ViT-L-14'` and use image size 224 (or 336 for ViT-L-14-336px)
### Training Your Own Model
To train your own DHO model, please visit the [official GitHub repository](https://github.com/yourusername/DHO) for detailed instructions and training scripts.
**Example training command:**
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=8 --master_port=29500 train_imgnet_semi.py \
--teacher_model "apple/DFN5B-CLIP-ViT-H-14-378" \
--student_model "ViT-B-16" \
--lr 5e-5 \
--train_epoch 32 \
--batch_size 256 \
--percent 10.0 \
| tee ./logs/imagenet/imgnet_lowshot.log
```
## Model Architecture
The DHO student model consists of:
- **Backbone:** CLIP Vision Transformer (ViT-B/16 or ViT-L/14)
- **Two parallel heads:**
- **CE Head:** Optimized with cross-entropy loss on labeled data
- **KD Head:** Optimized with knowledge distillation loss from teacher predictions
During inference, predictions from both heads are combined using learned weighting parameters (alpha, beta).
## Performance
### ImageNet Semi-supervised Learning
| Student | Teacher | Labeled Data | Top-1 Accuracy |
|:--------|:--------|:-------------|:---------------|
| ViT-B/16 | ViT-H/14 | 1% | **81.6%** |
| ViT-B/16 | ViT-H/14 | 10% | **82.8%** |
| ViT-L/14 | ViT-H/14 | 1% | **84.6%** |
| ViT-L/14 | ViT-H/14 | 10% | **85.9%** |
These results establish new state-of-the-art benchmarks for semi-supervised learning on ImageNet-1K.
## Citation
If you use these models in your research, please cite:
```bibtex
@article{kang2025simple,
title={Simple yet Effective Semi-supervised Knowledge Distillation from Vision-Language Models via Dual-Head Optimization},
author={Kang, Seongjae and Lee, Dong Bok and Jang, Hyungjoon and Hwang, Sung Ju},
journal={arXiv preprint arXiv:2505.07675},
year={2025}
}
```
## License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
## Acknowledgments
We appreciate the open-source implementations from:
- [Tip-Adapter](https://github.com/gaopengcuhk/Tip-Adapter)
- [CLIP](https://github.com/openai/CLIP)
- [OpenCLIP](https://github.com/mlfoundations/open_clip)
## Contact
For questions or issues, please open an issue on the [GitHub repository](https://github.com/yourusername/DHO) or contact the authors.
|