DeTok: Latent Denoising Makes Good Visual Tokenizers

Model Description

l-DeTok (Latent Denoising Tokenizer) is a simple approach for training visual tokenizers by incorporating denoising objectives during tokenizer training. We observe that many modern generative models share a common training paradigm of reconstructing clean signals from corrupted inputs, and explore whether aligning tokenizer training with this principle might be beneficial for downstream generation tasks.

Approach

We note that modern generative models often involve reconstructing clean signals from corrupted inputs (a form of denoising). This work investigates whether training tokenizers to reconstruct clean images from corrupted latent embeddings might produce representations that are better suited for downstream generative modeling.

Architecture

Encoder-Decoder Architecture: Based on Vision Transformers (ViT)
Denoising Strategies:
- Interpolative Latent Noise: Corrupts latent embeddings through noise interpolation
- Random Masking: Masks random subsets of image patches during training
Training Losses: Same as conventional image tokenizers.

Model Variants

Model	Type	Parameters	Description
DeTok-BB	Tokenizer	172M	Base tokenizer with denoising training
DeTok-BB-decoder_ft	Tokenizer	172M	Base tokenizer with additional decoder fine-tuning

Results

We evaluate our approach across six generative models on ImageNet 256×256 and observe consistent improvements:

With MAR Models (FID-50k with CFG)

Model	FID-50K	Inception Score	Parameters
MAR-Base + MAR-VAE	2.31	281.7	208M
MAR-Base + DeTok-BB	1.61	289.7	208M
MAR-Base + DeTok-BB-decoder_ft	1.55	291.0	208M
MAR-Large + MAR-VAE	1.78	296.0	479M
MAR-Huge + MAR-VAE	1.55	303.7	943M
MAR-Large + DeTok-BB	1.43	303.5	479M
MAR-Large + DeTok-BB-decoder_ft	1.32	304.1	479M

Observations

MAR-B: FID improves from 2.31 (MAR-VAE) → 1.55 (Ours)
MAR-L: FID improves from 1.78 (MAR-VAE) → 1.35 (Ours)
The approach works across both non-autoregressive (DiT, SiT, LightningDiT) and autoregressive models (MAR, RasterAR, RandomAR)

Usage

Installation

git clone https://github.com/Jiawei-Yang/DeTok.git
cd DeTok
conda create -n detok python=3.10 -y && conda activate detok
pip install -r requirements.txt

Training Details

Tokenizer Training

Dataset: ImageNet train set
Resolution: 256×256
Batch Size: 1024 (global)
Epochs: 200 (base) + 100 (decoder fine-tuning)
Denoising Parameters:
- Gamma (noise strength): 3.0
- Mask ratio: 0.7
- Random masking ratio: sampled from max(0, U(-0.1, M))

Key Training Components

Interpolative Noise: x' = (1-τ)x + τε(γ), where τ ~ U(0,1)
Random Masking: Variable masking ratios during training
Multi-component Loss: MSE + KL + Perceptual + Adversarial losses

Technical Details

Denoising Methodology

Interpolative Latent Noise: We use interpolative noise rather than additive noise, which allows for heavier corruption when the noise level τ is high
Masking as Deconstruction: We explore random masking as another form of latent deconstruction, inspired by masked autoencoders
Downstream Alignment: The denoising objective is designed to align with how modern generative models operate

Potential Benefits

Task Alignment: The training objective is designed to match downstream generative model objectives
Simplicity: The approach works without requiring large-scale pretrained visual encoders or semantic distillation
Generality: We observe improvements across different types of generative models
Robustness: The learned representations appear to remain useful even under corruption

Evaluation

Datasets

Training: ImageNet train set
Evaluation: ImageNet validation set (50k images)

Metrics

FID-50k: Fréchet Inception Distance on 50,000 generated samples
Inception Score: Standard generative model evaluation metric
Precision & Recall: Using ImageNet validation precision-recall data

Citation

@article{yang2025detok,
  title={Latent Denoising Makes Good Visual Tokenizers},
  author={Jiawei Yang and Tianhong Li and Lijie Fan and Yonglong Tian and Yue Wang},
  journal={arXiv preprint arXiv:2507.15856},
  year={2025}
}

License

This project is licensed under the MIT License.

Acknowledgments

This work builds upon many excellent open-source projects. We are particularly grateful to:

MAR for masked autoregressive modeling
DiT for diffusion transformers
MAE for masked autoencoder insights
1d-tokenizer for tokenizer implementations
The broader research community for foundational work in generative modeling

Contact

For questions or issues, please open a GitHub issue at the official repository.