DeTok: Latent Denoising Makes Good Visual Tokenizers
Model Description
l-DeTok (Latent Denoising Tokenizer) is a simple approach for training visual tokenizers by incorporating denoising objectives during tokenizer training. We observe that many modern generative models share a common training paradigm of reconstructing clean signals from corrupted inputs, and explore whether aligning tokenizer training with this principle might be beneficial for downstream generation tasks.
Approach
We note that modern generative models often involve reconstructing clean signals from corrupted inputs (a form of denoising). This work investigates whether training tokenizers to reconstruct clean images from corrupted latent embeddings might produce representations that are better suited for downstream generative modeling.
Architecture
- Encoder-Decoder Architecture: Based on Vision Transformers (ViT)
- Denoising Strategies:
- Interpolative Latent Noise: Corrupts latent embeddings through noise interpolation
- Random Masking: Masks random subsets of image patches during training
- Training Losses: Same as conventional image tokenizers.
Model Variants
Model | Type | Parameters | Description |
---|---|---|---|
DeTok-BB | Tokenizer | 172M | Base tokenizer with denoising training |
DeTok-BB-decoder_ft | Tokenizer | 172M | Base tokenizer with additional decoder fine-tuning |
Results
We evaluate our approach across six generative models on ImageNet 256×256 and observe consistent improvements:
With MAR Models (FID-50k with CFG)
Model | FID-50K | Inception Score | Parameters |
---|---|---|---|
MAR-Base + MAR-VAE | 2.31 | 281.7 | 208M |
MAR-Base + DeTok-BB | 1.61 | 289.7 | 208M |
MAR-Base + DeTok-BB-decoder_ft | 1.55 | 291.0 | 208M |
MAR-Large + MAR-VAE | 1.78 | 296.0 | 479M |
MAR-Huge + MAR-VAE | 1.55 | 303.7 | 943M |
MAR-Large + DeTok-BB | 1.43 | 303.5 | 479M |
MAR-Large + DeTok-BB-decoder_ft | 1.32 | 304.1 | 479M |
Observations
- MAR-B: FID improves from 2.31 (MAR-VAE) → 1.55 (Ours)
- MAR-L: FID improves from 1.78 (MAR-VAE) → 1.35 (Ours)
- The approach works across both non-autoregressive (DiT, SiT, LightningDiT) and autoregressive models (MAR, RasterAR, RandomAR)
Usage
Installation
git clone https://github.com/Jiawei-Yang/DeTok.git
cd DeTok
conda create -n detok python=3.10 -y && conda activate detok
pip install -r requirements.txt
Training Details
Tokenizer Training
- Dataset: ImageNet train set
- Resolution: 256×256
- Batch Size: 1024 (global)
- Epochs: 200 (base) + 100 (decoder fine-tuning)
- Denoising Parameters:
- Gamma (noise strength): 3.0
- Mask ratio: 0.7
- Random masking ratio: sampled from max(0, U(-0.1, M))
Key Training Components
- Interpolative Noise: x' = (1-τ)x + τε(γ), where τ ~ U(0,1)
- Random Masking: Variable masking ratios during training
- Multi-component Loss: MSE + KL + Perceptual + Adversarial losses
Technical Details
Denoising Methodology
- Interpolative Latent Noise: We use interpolative noise rather than additive noise, which allows for heavier corruption when the noise level τ is high
- Masking as Deconstruction: We explore random masking as another form of latent deconstruction, inspired by masked autoencoders
- Downstream Alignment: The denoising objective is designed to align with how modern generative models operate
Potential Benefits
- Task Alignment: The training objective is designed to match downstream generative model objectives
- Simplicity: The approach works without requiring large-scale pretrained visual encoders or semantic distillation
- Generality: We observe improvements across different types of generative models
- Robustness: The learned representations appear to remain useful even under corruption
Evaluation
Datasets
- Training: ImageNet train set
- Evaluation: ImageNet validation set (50k images)
Metrics
- FID-50k: Fréchet Inception Distance on 50,000 generated samples
- Inception Score: Standard generative model evaluation metric
- Precision & Recall: Using ImageNet validation precision-recall data
Citation
@article{yang2025detok,
title={Latent Denoising Makes Good Visual Tokenizers},
author={Jiawei Yang and Tianhong Li and Lijie Fan and Yonglong Tian and Yue Wang},
journal={arXiv preprint arXiv:2507.15856},
year={2025}
}
License
This project is licensed under the MIT License.
Acknowledgments
This work builds upon many excellent open-source projects. We are particularly grateful to:
- MAR for masked autoregressive modeling
- DiT for diffusion transformers
- MAE for masked autoencoder insights
- 1d-tokenizer for tokenizer implementations
- The broader research community for foundational work in generative modeling
Contact
For questions or issues, please open a GitHub issue at the official repository.