File size: 7,200 Bytes
88d3ff8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a01185a
88d3ff8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
---
license: apache-2.0
tags:
- vision
- image-classification
- clip
- knowledge-distillation
- semi-supervised-learning
- imagenet
datasets:
- imagenet-1k
library_name: pytorch
pipeline_tag: image-classification
---

# DHO: Simple yet Effective Semi-supervised Knowledge Distillation from Vision-Language Models via Dual-Head Optimization

[![arXiv](https://img.shields.io/badge/arXiv-2505.07675v1-b31b1b.svg)](https://arxiv.org/abs/2505.07675v1)

This repository contains pretrained checkpoints for **DHO (Dual-Head Optimization)**, a simple yet effective approach for semi-supervised knowledge distillation from Vision-Language Models.

## Model Description

DHO introduces a dual-head optimization strategy that enables efficient knowledge transfer from large Vision-Language Models (e.g., CLIP) to smaller student models using minimal labeled data.
The method achieves state-of-the-art performance on ImageNet semi-supervised learning benchmarks with only 1% and 10% labeled data.

**Paper:** [Simple yet Effective Semi-supervised Knowledge Distillation from Vision-Language Models via Dual-Head Optimization](https://arxiv.org/abs/2505.07675)

**Authors:** Seongjae Kang, Dong Bok Lee, Hyungjoon Jang, Sung Ju Hwang

## Key Features

-**Dual-head optimization** strategy for semi-supervised distillation
- 🏆 **State-of-the-art** performance on ImageNet with 1% and 10% labeled data
- 🔄 Efficient transfer from VLMs (e.g., CLIP) to smaller student models
- 🧩 Simple, scalable, and easy to integrate into existing pipelines

## Available Checkpoints

| Checkpoint Name | Student Model | Teacher Model | Labeled Data | Top-1 Acc. | Parameters |
|:----------------|:--------------|:--------------|:-------------|:-----------|:-----------|
| `vit_b_1.pt` | ViT-B/16 | ViT-H/14 (DFN5B) | 1% | 81.6% | 86M |
| `vit_b_10.pt` | ViT-B/16 | ViT-H/14 (DFN5B) | 10% | 82.8% | 86M |
| `vit_l_1.pt` | ViT-L/14 | ViT-H/14 (DFN5B) | 1% | 84.6% | 304M |
| `vit_l_10.pt` | ViT-L/14 | ViT-H/14 (DFN5B) | 10% | 85.9% | 304M |

## Usage

### Loading a Checkpoint

```python
import torch
import torch.nn as nn
import torch.nn.functional as F
import clip
from huggingface_hub import hf_hub_download

# Define the DHO StudentModel architecture with dual heads
class StudentModel(nn.Module):
    def __init__(self, num_classes=1000, model_name='ViT-B-16'):
        super().__init__()
        # Load CLIP backbone
        clip_model, _ = clip.load(model_name, device='cpu')
        self.backbone = clip_model.float().visual
        
        # Feature dimensions per architecture
        in_features = {
            'RN50': 1024,
            'ViT-B-16': 512,
            'ViT-L-14': 768,
            'ViT-L-14-336px': 768
        }[model_name]
        
        # Dual-head architecture
        self.ce_head = nn.Linear(in_features, num_classes)  # CE branch
        self.kd_head = nn.Linear(in_features, num_classes)  # KD branch
    
    def forward(self, x):
        features = self.backbone(x)
        ce_out = self.ce_head(features)
        kd_out = self.kd_head(F.normalize(features, dim=1)) * 100
        return ce_out, kd_out

# Download and load checkpoint
device = "cuda" if torch.cuda.is_available() else "cpu"
checkpoint_path = hf_hub_download(repo_id="erjui/dho", filename="vit_b_10.pt")
checkpoint = torch.load(checkpoint_path, map_location=device)

# Initialize model
model = StudentModel(num_classes=1000, model_name='ViT-B-16').to(device)

# Handle DDP wrapped state_dict
state_dict = checkpoint['model_state_dict']
state_dict = {k.replace('module.', ''): v for k, v in state_dict.items()}
model.load_state_dict(state_dict)

# Get optimal inference parameters
alpha = checkpoint['alpha']  # Weight for CE head
beta = checkpoint['beta']    # Temperature for KD head
model.eval()

# Inference example
from PIL import Image
import torchvision.transforms as transforms

# CLIP preprocessing
preprocess = transforms.Compose([
    transforms.Resize(224),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=(0.48145466, 0.4578275, 0.40821073),
                        std=(0.26862954, 0.26130258, 0.27577711))
])

image = preprocess(Image.open("path/to/image.jpg")).unsqueeze(0).to(device)
with torch.no_grad():
    ce_logits, kd_logits = model(image)
    
    # Combine predictions using saved parameters
    probs_ce = F.softmax(ce_logits, dim=1)
    probs_kd = F.softmax(kd_logits / beta, dim=1)
    probs = alpha * probs_ce + (1 - alpha) * probs_kd
    
    predicted_class = probs.argmax(dim=1)
    print(f"Predicted class: {predicted_class.item()}")
```

**Important Notes:**
- DHO checkpoints contain: `model_state_dict`, `epoch`, `acc`, `alpha`, `beta`
- The model has a **dual-head architecture** (CE head + KD head)
- Use the saved `alpha` and `beta` parameters for optimal inference
- For ViT-L checkpoints, change `model_name='ViT-L-14'` and use image size 224 (or 336 for ViT-L-14-336px)

### Training Your Own Model

To train your own DHO model, please visit the [official GitHub repository](https://github.com/yourusername/DHO) for detailed instructions and training scripts.

**Example training command:**
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=8 --master_port=29500 train_imgnet_semi.py \
    --teacher_model "apple/DFN5B-CLIP-ViT-H-14-378" \
    --student_model "ViT-B-16" \
    --lr 5e-5 \
    --train_epoch 32 \
    --batch_size 256 \
    --percent 10.0 \
    | tee ./logs/imagenet/imgnet_lowshot.log
```

## Model Architecture

The DHO student model consists of:
- **Backbone:** CLIP Vision Transformer (ViT-B/16 or ViT-L/14)
- **Two parallel heads:**
  - **CE Head:** Optimized with cross-entropy loss on labeled data
  - **KD Head:** Optimized with knowledge distillation loss from teacher predictions

During inference, predictions from both heads are combined using learned weighting parameters (alpha, beta).

## Performance

### ImageNet Semi-supervised Learning

| Student | Teacher | Labeled Data | Top-1 Accuracy |
|:--------|:--------|:-------------|:---------------|
| ViT-B/16 | ViT-H/14 | 1% | **81.6%** |
| ViT-B/16 | ViT-H/14 | 10% | **82.8%** |
| ViT-L/14 | ViT-H/14 | 1% | **84.6%** |
| ViT-L/14 | ViT-H/14 | 10% | **85.9%** |

These results establish new state-of-the-art benchmarks for semi-supervised learning on ImageNet-1K.

## Citation

If you use these models in your research, please cite:

```bibtex
@article{kang2025simple,
  title={Simple yet Effective Semi-supervised Knowledge Distillation from Vision-Language Models via Dual-Head Optimization},
  author={Kang, Seongjae and Lee, Dong Bok and Jang, Hyungjoon and Hwang, Sung Ju},
  journal={arXiv preprint arXiv:2505.07675},
  year={2025}
}
```

## License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

## Acknowledgments

We appreciate the open-source implementations from:
- [Tip-Adapter](https://github.com/gaopengcuhk/Tip-Adapter)
- [CLIP](https://github.com/openai/CLIP)
- [OpenCLIP](https://github.com/mlfoundations/open_clip)

## Contact

For questions or issues, please open an issue on the [GitHub repository](https://github.com/yourusername/DHO) or contact the authors.