l-DeTok / README.md

Improve "Usage" section with correct tokenizer example (#2)

8e64756 verified 3 months ago

8.05 kB

	---
	library_name: pytorch
	license: mit
	pipeline_tag: image-feature-extraction
	tags:
	- computer-vision
	- image-generation
	- tokenizer
	- autoencoder
	- denoising
	- visual-tokenizer
	- imagenet
	- generative-modeling
	---

	# DeTok: Latent Denoising Makes Good Visual Tokenizers

	[![arXiv](https://img.shields.io/badge/arXiv-2507.15856-b31b1b.svg)](https://arxiv.org/abs/2507.15856)
	[![GitHub](https://img.shields.io/badge/GitHub-DeTok-blue)](https://github.com/Jiawei-Yang/DeTok)

	## Model Description

	l-DeTok (Latent Denoising Tokenizer) is a simple approach for training visual tokenizers by incorporating denoising objectives during tokenizer training. We observe that many modern generative models share a common training paradigm of reconstructing clean signals from corrupted inputs, and explore whether aligning tokenizer training with this principle might be beneficial for downstream generation tasks.

	### Approach

	We note that modern generative models often involve reconstructing clean signals from corrupted inputs (a form of denoising). This work investigates whether training tokenizers to reconstruct clean images from corrupted latent embeddings might produce representations that are better suited for downstream generative modeling.

	### Architecture

	- Encoder-Decoder Architecture: Based on Vision Transformers (ViT)
	- Denoising Strategies:
	- Interpolative Latent Noise: Corrupts latent embeddings through noise interpolation
	- Random Masking: Masks random subsets of image patches during training
	- Training Losses: Same as conventional image tokenizers.

	## Model Variants

	\| Model \| Type \| Parameters \| Description \|
	\|-------\|------\|------------\|-------------\|
	\| DeTok-BB \| Tokenizer \| 172M \| Base tokenizer with denoising training \|
	\| DeTok-BB-decoder_ft \| Tokenizer \| 172M \| Base tokenizer with additional decoder fine-tuning \|

	## Results

	We evaluate our approach across six generative models on ImageNet 256×256 and observe consistent improvements:

	### With MAR Models (FID-50k with CFG)
	\| Model \| FID-50K \| Inception Score \| Parameters \|
	\|-------\|---------\|-----------------\|------------\|
	\| MAR-Base + MAR-VAE \| 2.31 \| 281.7 \| 208M \|
	\| MAR-Base + DeTok-BB \| 1.61 \| 289.7 \| 208M \|
	\| MAR-Base + DeTok-BB-decoder_ft \| 1.55 \| 291.0 \| 208M \|
	\| MAR-Large + MAR-VAE \| 1.78 \| 296.0 \| 479M \|
	\| MAR-Huge + MAR-VAE \| 1.55 \| 303.7 \| 943M \|
	\| MAR-Large + DeTok-BB \| 1.43 \| 303.5 \| 479M \|
	\| MAR-Large + DeTok-BB-decoder_ft \| 1.32 \| 304.1 \| 479M \|

	### Observations
	- MAR-B: FID improves from 2.31 (MAR-VAE) → 1.55 (Ours)
	- MAR-L: FID improves from 1.78 (MAR-VAE) → 1.35 (Ours)
	- The approach works across both non-autoregressive (DiT, SiT, LightningDiT) and autoregressive models (MAR, RasterAR, RandomAR)

	## Usage

	### Installation

	To use DeTok for extracting latent embeddings from images, you need to:

	1. Clone the official DeTok repository:
	```bash
	git clone https://github.com/Jiawei-Yang/DeTok.git
	cd DeTok
	pip install -r requirements.txt
	```
	2. Download the pre-trained tokenizer weights:
	You can download the `DeTok-BB-decoder_ft` checkpoint (recommended) from [here](https://huggingface.co/jjiaweiyang/l-DeTok/resolve/main/detok-BB-gamm3.0-m0.7-decoder_tuned.pth) and place it in your working directory (e.g., `detok-BB-gamm3.0-m0.7-decoder_tuned.pth`).

	### Extract latent embeddings

	Here's a sample Python code snippet for feature extraction using the `DeTok_BB` tokenizer:

	```python
	import torch
	from PIL import Image
	from torchvision.transforms import transforms
	from models.detok import DeTok_BB # Import from the cloned DeTok repository

	# --- Configuration (matching DeTok-BB-decoder_ft architecture from paper) ---
	model_params = {
	"img_size": 256,
	"patch_size": 16,
	"in_chans": 3,
	"embed_dim": 768,
	"depths": [2, 2, 8, 2],
	"num_heads": [3, 6, 12, 24],
	}
	tokenizer_weights_path = "detok-BB-gamm3.0-m0.7-decoder_tuned.pth" # Path to your downloaded weights

	# 1. Initialize and load the tokenizer
	tokenizer = DeTok_BB(**model_params).eval()
	if torch.cuda.is_available():
	tokenizer = tokenizer.cuda()

	# Load checkpoint state_dict
	checkpoint = torch.load(tokenizer_weights_path, map_location='cpu')
	tokenizer.load_state_dict(checkpoint['model'])

	# 2. Prepare your image
	transform = transforms.Compose([
	transforms.Resize(model_params["img_size"]),
	transforms.CenterCrop(model_params["img_size"]),
	transforms.ToTensor(),
	transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
	])

	# Replace 'path/to/your/image.jpg' with your actual image file
	image = Image.new('RGB', (model_params["img_size"], model_params["img_size"]), color = 'red') # Example dummy image
	# image = Image.open("path/to/your/image.jpg").convert("RGB")

	pixel_values = transform(image).unsqueeze(0) # Add batch dimension

	if torch.cuda.is_available():
	pixel_values = pixel_values.cuda()

	# 3. Extract latent embeddings
	with torch.no_grad():
	latent_embeddings = tokenizer.encode(pixel_values)

	print(f"Shape of latent embeddings: {latent_embeddings.shape}")
	# Expected output for a 256x256 input image with 16x16 patches is (1, 256, 768),
	# representing 256 image patches with 768-dimensional embeddings.
	```

	## Training Details

	### Tokenizer Training
	- Dataset: ImageNet train set
	- Resolution: 256×256
	- Batch Size: 1024 (global)
	- Epochs: 200 (base) + 100 (decoder fine-tuning)
	- Denoising Parameters:
	- Gamma (noise strength): 3.0
	- Mask ratio: 0.7
	- Random masking ratio: sampled from max(0, U(-0.1, M))

	### Key Training Components
	- Interpolative Noise: x' = (1-τ)x + τε(γ), where τ ~ U(0,1)
	- Random Masking: Variable masking ratios during training
	- Multi-component Loss: MSE + KL + Perceptual + Adversarial losses

	## Technical Details

	### Denoising Methodology

	1. Interpolative Latent Noise: We use interpolative noise rather than additive noise, which allows for heavier corruption when the noise level τ is high
	2. Masking as Deconstruction: We explore random masking as another form of latent deconstruction, inspired by masked autoencoders
	3. Downstream Alignment: The denoising objective is designed to align with how modern generative models operate

	### Potential Benefits

	- Task Alignment: The training objective is designed to match downstream generative model objectives
	- Simplicity: The approach works without requiring large-scale pretrained visual encoders or semantic distillation
	- Generality: We observe improvements across different types of generative models
	- Robustness: The learned representations appear to remain useful even under corruption

	## Evaluation

	### Datasets
	- Training: ImageNet train set
	- Evaluation: ImageNet validation set (50k images)

	### Metrics
	- FID-50k: Fréchet Inception Distance on 50,000 generated samples
	- Inception Score: Standard generative model evaluation metric
	- Precision & Recall: Using ImageNet validation precision-recall data

	## Citation

	```bibtex
	@article{yang2025detok,
	title={Latent Denoising Makes Good Visual Tokenizers},
	author={Jiawei Yang and Tianhong Li and Lijie Fan and Yonglong Tian and Yue Wang},
	journal={arXiv preprint arXiv:2507.15856},
	year={2025}
	}
	```

	## License

	This project is licensed under the MIT License.

	## Acknowledgments

	This work builds upon many excellent open-source projects. We are particularly grateful to:
	- [MAR](https://github.com/LTH14/mar) for masked autoregressive modeling
	- [DiT](https://github.com/facebookresearch/DiT) for diffusion transformers
	- [MAE](https://github.com/facebookresearch/mae) for masked autoencoder insights
	- [1d-tokenizer](https://github.com/bytedance/1d-tokenizer) for tokenizer implementations
	- The broader research community for foundational work in generative modeling

	## Contact

	For questions or issues, please open a GitHub issue at the [official repository](https://github.com/Jiawei-Yang/DeTok).