|
|
--- |
|
|
library_name: pytorch |
|
|
license: mit |
|
|
pipeline_tag: image-feature-extraction |
|
|
tags: |
|
|
- computer-vision |
|
|
- image-generation |
|
|
- tokenizer |
|
|
- autoencoder |
|
|
- denoising |
|
|
- visual-tokenizer |
|
|
- imagenet |
|
|
- generative-modeling |
|
|
--- |
|
|
|
|
|
# DeTok: Latent Denoising Makes Good Visual Tokenizers |
|
|
|
|
|
[](https://arxiv.org/abs/2507.15856) |
|
|
[](https://github.com/Jiawei-Yang/DeTok) |
|
|
|
|
|
## Model Description |
|
|
|
|
|
**l-DeTok** (Latent Denoising Tokenizer) is a simple approach for training visual tokenizers by incorporating denoising objectives during tokenizer training. We observe that many modern generative models share a common training paradigm of reconstructing clean signals from corrupted inputs, and explore whether aligning tokenizer training with this principle might be beneficial for downstream generation tasks. |
|
|
|
|
|
### Approach |
|
|
|
|
|
We note that modern generative models often involve **reconstructing clean signals from corrupted inputs** (a form of denoising). This work investigates whether training tokenizers to reconstruct clean images from corrupted latent embeddings might produce representations that are better suited for downstream generative modeling. |
|
|
|
|
|
### Architecture |
|
|
|
|
|
- **Encoder-Decoder Architecture**: Based on Vision Transformers (ViT) |
|
|
- **Denoising Strategies**: |
|
|
- **Interpolative Latent Noise**: Corrupts latent embeddings through noise interpolation |
|
|
- **Random Masking**: Masks random subsets of image patches during training |
|
|
- **Training Losses**: Same as conventional image tokenizers. |
|
|
|
|
|
## Model Variants |
|
|
|
|
|
| Model | Type | Parameters | Description | |
|
|
|-------|------|------------|-------------| |
|
|
| **DeTok-BB** | Tokenizer | 172M | Base tokenizer with denoising training | |
|
|
| **DeTok-BB-decoder_ft** | Tokenizer | 172M | Base tokenizer with additional decoder fine-tuning | |
|
|
|
|
|
## Results |
|
|
|
|
|
We evaluate our approach across six generative models on ImageNet 256×256 and observe consistent improvements: |
|
|
|
|
|
### With MAR Models (FID-50k with CFG) |
|
|
| Model | FID-50K | Inception Score | Parameters | |
|
|
|-------|---------|-----------------|------------| |
|
|
| MAR-Base + MAR-VAE | 2.31 | 281.7 | 208M | |
|
|
| MAR-Base + DeTok-BB | **1.61** | **289.7** | 208M | |
|
|
| MAR-Base + DeTok-BB-decoder_ft | **1.55** | **291.0** | 208M | |
|
|
| MAR-Large + MAR-VAE | 1.78 | 296.0 | 479M | |
|
|
| MAR-Huge + MAR-VAE | 1.55 | 303.7 | 943M | |
|
|
| MAR-Large + DeTok-BB | **1.43** | **303.5** | 479M | |
|
|
| MAR-Large + DeTok-BB-decoder_ft | **1.32** | **304.1** | 479M | |
|
|
|
|
|
### Observations |
|
|
- **MAR-B**: FID improves from 2.31 (MAR-VAE) → 1.55 (Ours) |
|
|
- **MAR-L**: FID improves from 1.78 (MAR-VAE) → 1.35 (Ours) |
|
|
- The approach works across both non-autoregressive (DiT, SiT, LightningDiT) and autoregressive models (MAR, RasterAR, RandomAR) |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Installation |
|
|
|
|
|
To use DeTok for extracting latent embeddings from images, you need to: |
|
|
|
|
|
1. **Clone the official DeTok repository**: |
|
|
```bash |
|
|
git clone https://github.com/Jiawei-Yang/DeTok.git |
|
|
cd DeTok |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
2. **Download the pre-trained tokenizer weights**: |
|
|
You can download the `DeTok-BB-decoder_ft` checkpoint (recommended) from [here](https://huggingface.co/jjiaweiyang/l-DeTok/resolve/main/detok-BB-gamm3.0-m0.7-decoder_tuned.pth) and place it in your working directory (e.g., `detok-BB-gamm3.0-m0.7-decoder_tuned.pth`). |
|
|
|
|
|
### Extract latent embeddings |
|
|
|
|
|
Here's a sample Python code snippet for feature extraction using the `DeTok_BB` tokenizer: |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from PIL import Image |
|
|
from torchvision.transforms import transforms |
|
|
from models.detok import DeTok_BB # Import from the cloned DeTok repository |
|
|
|
|
|
# --- Configuration (matching DeTok-BB-decoder_ft architecture from paper) --- |
|
|
model_params = { |
|
|
"img_size": 256, |
|
|
"patch_size": 16, |
|
|
"in_chans": 3, |
|
|
"embed_dim": 768, |
|
|
"depths": [2, 2, 8, 2], |
|
|
"num_heads": [3, 6, 12, 24], |
|
|
} |
|
|
tokenizer_weights_path = "detok-BB-gamm3.0-m0.7-decoder_tuned.pth" # Path to your downloaded weights |
|
|
|
|
|
# 1. Initialize and load the tokenizer |
|
|
tokenizer = DeTok_BB(**model_params).eval() |
|
|
if torch.cuda.is_available(): |
|
|
tokenizer = tokenizer.cuda() |
|
|
|
|
|
# Load checkpoint state_dict |
|
|
checkpoint = torch.load(tokenizer_weights_path, map_location='cpu') |
|
|
tokenizer.load_state_dict(checkpoint['model']) |
|
|
|
|
|
# 2. Prepare your image |
|
|
transform = transforms.Compose([ |
|
|
transforms.Resize(model_params["img_size"]), |
|
|
transforms.CenterCrop(model_params["img_size"]), |
|
|
transforms.ToTensor(), |
|
|
transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]) |
|
|
]) |
|
|
|
|
|
# Replace 'path/to/your/image.jpg' with your actual image file |
|
|
image = Image.new('RGB', (model_params["img_size"], model_params["img_size"]), color = 'red') # Example dummy image |
|
|
# image = Image.open("path/to/your/image.jpg").convert("RGB") |
|
|
|
|
|
pixel_values = transform(image).unsqueeze(0) # Add batch dimension |
|
|
|
|
|
if torch.cuda.is_available(): |
|
|
pixel_values = pixel_values.cuda() |
|
|
|
|
|
# 3. Extract latent embeddings |
|
|
with torch.no_grad(): |
|
|
latent_embeddings = tokenizer.encode(pixel_values) |
|
|
|
|
|
print(f"Shape of latent embeddings: {latent_embeddings.shape}") |
|
|
# Expected output for a 256x256 input image with 16x16 patches is (1, 256, 768), |
|
|
# representing 256 image patches with 768-dimensional embeddings. |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Tokenizer Training |
|
|
- **Dataset**: ImageNet train set |
|
|
- **Resolution**: 256×256 |
|
|
- **Batch Size**: 1024 (global) |
|
|
- **Epochs**: 200 (base) + 100 (decoder fine-tuning) |
|
|
- **Denoising Parameters**: |
|
|
- Gamma (noise strength): 3.0 |
|
|
- Mask ratio: 0.7 |
|
|
- Random masking ratio: sampled from max(0, U(-0.1, M)) |
|
|
|
|
|
### Key Training Components |
|
|
- **Interpolative Noise**: x' = (1-τ)x + τε(γ), where τ ~ U(0,1) |
|
|
- **Random Masking**: Variable masking ratios during training |
|
|
- **Multi-component Loss**: MSE + KL + Perceptual + Adversarial losses |
|
|
|
|
|
## Technical Details |
|
|
|
|
|
### Denoising Methodology |
|
|
|
|
|
1. **Interpolative Latent Noise**: We use interpolative noise rather than additive noise, which allows for heavier corruption when the noise level τ is high |
|
|
2. **Masking as Deconstruction**: We explore random masking as another form of latent deconstruction, inspired by masked autoencoders |
|
|
3. **Downstream Alignment**: The denoising objective is designed to align with how modern generative models operate |
|
|
|
|
|
### Potential Benefits |
|
|
|
|
|
- **Task Alignment**: The training objective is designed to match downstream generative model objectives |
|
|
- **Simplicity**: The approach works without requiring large-scale pretrained visual encoders or semantic distillation |
|
|
- **Generality**: We observe improvements across different types of generative models |
|
|
- **Robustness**: The learned representations appear to remain useful even under corruption |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
### Datasets |
|
|
- **Training**: ImageNet train set |
|
|
- **Evaluation**: ImageNet validation set (50k images) |
|
|
|
|
|
### Metrics |
|
|
- **FID-50k**: Fréchet Inception Distance on 50,000 generated samples |
|
|
- **Inception Score**: Standard generative model evaluation metric |
|
|
- **Precision & Recall**: Using ImageNet validation precision-recall data |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@article{yang2025detok, |
|
|
title={Latent Denoising Makes Good Visual Tokenizers}, |
|
|
author={Jiawei Yang and Tianhong Li and Lijie Fan and Yonglong Tian and Yue Wang}, |
|
|
journal={arXiv preprint arXiv:2507.15856}, |
|
|
year={2025} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This project is licensed under the MIT License. |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
This work builds upon many excellent open-source projects. We are particularly grateful to: |
|
|
- [MAR](https://github.com/LTH14/mar) for masked autoregressive modeling |
|
|
- [DiT](https://github.com/facebookresearch/DiT) for diffusion transformers |
|
|
- [MAE](https://github.com/facebookresearch/mae) for masked autoencoder insights |
|
|
- [1d-tokenizer](https://github.com/bytedance/1d-tokenizer) for tokenizer implementations |
|
|
- The broader research community for foundational work in generative modeling |
|
|
|
|
|
## Contact |
|
|
|
|
|
For questions or issues, please open a GitHub issue at the [official repository](https://github.com/Jiawei-Yang/DeTok). |