l-DeTok / README.md
jjiaweiyang's picture
Improve "Usage" section with correct tokenizer example (#2)
8e64756 verified
---
library_name: pytorch
license: mit
pipeline_tag: image-feature-extraction
tags:
- computer-vision
- image-generation
- tokenizer
- autoencoder
- denoising
- visual-tokenizer
- imagenet
- generative-modeling
---
# DeTok: Latent Denoising Makes Good Visual Tokenizers
[![arXiv](https://img.shields.io/badge/arXiv-2507.15856-b31b1b.svg)](https://arxiv.org/abs/2507.15856) 
[![GitHub](https://img.shields.io/badge/GitHub-DeTok-blue)](https://github.com/Jiawei-Yang/DeTok) 
## Model Description
**l-DeTok** (Latent Denoising Tokenizer) is a simple approach for training visual tokenizers by incorporating denoising objectives during tokenizer training. We observe that many modern generative models share a common training paradigm of reconstructing clean signals from corrupted inputs, and explore whether aligning tokenizer training with this principle might be beneficial for downstream generation tasks.
### Approach
We note that modern generative models often involve **reconstructing clean signals from corrupted inputs** (a form of denoising). This work investigates whether training tokenizers to reconstruct clean images from corrupted latent embeddings might produce representations that are better suited for downstream generative modeling.
### Architecture
- **Encoder-Decoder Architecture**: Based on Vision Transformers (ViT)
- **Denoising Strategies**:
- **Interpolative Latent Noise**: Corrupts latent embeddings through noise interpolation
- **Random Masking**: Masks random subsets of image patches during training
- **Training Losses**: Same as conventional image tokenizers.
## Model Variants
| Model | Type | Parameters | Description |
|-------|------|------------|-------------|
| **DeTok-BB** | Tokenizer | 172M | Base tokenizer with denoising training |
| **DeTok-BB-decoder_ft** | Tokenizer | 172M | Base tokenizer with additional decoder fine-tuning |
## Results
We evaluate our approach across six generative models on ImageNet 256×256 and observe consistent improvements:
### With MAR Models (FID-50k with CFG)
| Model | FID-50K | Inception Score | Parameters |
|-------|---------|-----------------|------------|
| MAR-Base + MAR-VAE | 2.31 | 281.7 | 208M |
| MAR-Base + DeTok-BB | **1.61** | **289.7** | 208M |
| MAR-Base + DeTok-BB-decoder_ft | **1.55** | **291.0** | 208M |
| MAR-Large + MAR-VAE | 1.78 | 296.0 | 479M |
| MAR-Huge + MAR-VAE | 1.55 | 303.7 | 943M |
| MAR-Large + DeTok-BB | **1.43** | **303.5** | 479M |
| MAR-Large + DeTok-BB-decoder_ft | **1.32** | **304.1** | 479M |
### Observations
- **MAR-B**: FID improves from 2.31 (MAR-VAE) → 1.55 (Ours)
- **MAR-L**: FID improves from 1.78 (MAR-VAE) → 1.35 (Ours)
- The approach works across both non-autoregressive (DiT, SiT, LightningDiT) and autoregressive models (MAR, RasterAR, RandomAR)
## Usage
### Installation
To use DeTok for extracting latent embeddings from images, you need to:
1. **Clone the official DeTok repository**:
```bash
git clone https://github.com/Jiawei-Yang/DeTok.git
cd DeTok
pip install -r requirements.txt
```
2. **Download the pre-trained tokenizer weights**:
You can download the `DeTok-BB-decoder_ft` checkpoint (recommended) from [here](https://huggingface.co/jjiaweiyang/l-DeTok/resolve/main/detok-BB-gamm3.0-m0.7-decoder_tuned.pth) and place it in your working directory (e.g., `detok-BB-gamm3.0-m0.7-decoder_tuned.pth`).
### Extract latent embeddings
Here's a sample Python code snippet for feature extraction using the `DeTok_BB` tokenizer:
```python
import torch
from PIL import Image
from torchvision.transforms import transforms
from models.detok import DeTok_BB # Import from the cloned DeTok repository
# --- Configuration (matching DeTok-BB-decoder_ft architecture from paper) ---
model_params = {
"img_size": 256,
"patch_size": 16,
"in_chans": 3,
"embed_dim": 768,
"depths": [2, 2, 8, 2],
"num_heads": [3, 6, 12, 24],
}
tokenizer_weights_path = "detok-BB-gamm3.0-m0.7-decoder_tuned.pth" # Path to your downloaded weights
# 1. Initialize and load the tokenizer
tokenizer = DeTok_BB(**model_params).eval()
if torch.cuda.is_available():
tokenizer = tokenizer.cuda()
# Load checkpoint state_dict
checkpoint = torch.load(tokenizer_weights_path, map_location='cpu')
tokenizer.load_state_dict(checkpoint['model'])
# 2. Prepare your image
transform = transforms.Compose([
transforms.Resize(model_params["img_size"]),
transforms.CenterCrop(model_params["img_size"]),
transforms.ToTensor(),
transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
])
# Replace 'path/to/your/image.jpg' with your actual image file
image = Image.new('RGB', (model_params["img_size"], model_params["img_size"]), color = 'red') # Example dummy image
# image = Image.open("path/to/your/image.jpg").convert("RGB")
pixel_values = transform(image).unsqueeze(0) # Add batch dimension
if torch.cuda.is_available():
pixel_values = pixel_values.cuda()
# 3. Extract latent embeddings
with torch.no_grad():
latent_embeddings = tokenizer.encode(pixel_values)
print(f"Shape of latent embeddings: {latent_embeddings.shape}")
# Expected output for a 256x256 input image with 16x16 patches is (1, 256, 768),
# representing 256 image patches with 768-dimensional embeddings.
```
## Training Details
### Tokenizer Training
- **Dataset**: ImageNet train set
- **Resolution**: 256×256
- **Batch Size**: 1024 (global)
- **Epochs**: 200 (base) + 100 (decoder fine-tuning)
- **Denoising Parameters**:
- Gamma (noise strength): 3.0
- Mask ratio: 0.7
- Random masking ratio: sampled from max(0, U(-0.1, M))
### Key Training Components
- **Interpolative Noise**: x' = (1-τ)x + τε(γ), where τ ~ U(0,1)
- **Random Masking**: Variable masking ratios during training
- **Multi-component Loss**: MSE + KL + Perceptual + Adversarial losses
## Technical Details
### Denoising Methodology
1. **Interpolative Latent Noise**: We use interpolative noise rather than additive noise, which allows for heavier corruption when the noise level τ is high
2. **Masking as Deconstruction**: We explore random masking as another form of latent deconstruction, inspired by masked autoencoders
3. **Downstream Alignment**: The denoising objective is designed to align with how modern generative models operate
### Potential Benefits
- **Task Alignment**: The training objective is designed to match downstream generative model objectives
- **Simplicity**: The approach works without requiring large-scale pretrained visual encoders or semantic distillation
- **Generality**: We observe improvements across different types of generative models
- **Robustness**: The learned representations appear to remain useful even under corruption
## Evaluation
### Datasets
- **Training**: ImageNet train set
- **Evaluation**: ImageNet validation set (50k images)
### Metrics
- **FID-50k**: Fréchet Inception Distance on 50,000 generated samples
- **Inception Score**: Standard generative model evaluation metric
- **Precision & Recall**: Using ImageNet validation precision-recall data
## Citation
```bibtex
@article{yang2025detok,
title={Latent Denoising Makes Good Visual Tokenizers},
author={Jiawei Yang and Tianhong Li and Lijie Fan and Yonglong Tian and Yue Wang},
journal={arXiv preprint arXiv:2507.15856},
year={2025}
}
```
## License
This project is licensed under the MIT License.
## Acknowledgments
This work builds upon many excellent open-source projects. We are particularly grateful to:
- [MAR](https://github.com/LTH14/mar) for masked autoregressive modeling
- [DiT](https://github.com/facebookresearch/DiT) for diffusion transformers
- [MAE](https://github.com/facebookresearch/mae) for masked autoencoder insights
- [1d-tokenizer](https://github.com/bytedance/1d-tokenizer) for tokenizer implementations
- The broader research community for foundational work in generative modeling
## Contact
For questions or issues, please open a GitHub issue at the [official repository](https://github.com/Jiawei-Yang/DeTok).