DeTok: Latent Denoising Makes Good Visual Tokenizers

arXiv  GitHub 

Model Description

l-DeTok (Latent Denoising Tokenizer) is a simple approach for training visual tokenizers by incorporating denoising objectives during tokenizer training. We observe that many modern generative models share a common training paradigm of reconstructing clean signals from corrupted inputs, and explore whether aligning tokenizer training with this principle might be beneficial for downstream generation tasks.

Approach

We note that modern generative models often involve reconstructing clean signals from corrupted inputs (a form of denoising). This work investigates whether training tokenizers to reconstruct clean images from corrupted latent embeddings might produce representations that are better suited for downstream generative modeling.

Architecture

  • Encoder-Decoder Architecture: Based on Vision Transformers (ViT)
  • Denoising Strategies:
    • Interpolative Latent Noise: Corrupts latent embeddings through noise interpolation
    • Random Masking: Masks random subsets of image patches during training
  • Training Losses: Same as conventional image tokenizers.

Model Variants

Model Type Parameters Description
DeTok-BB Tokenizer 172M Base tokenizer with denoising training
DeTok-BB-decoder_ft Tokenizer 172M Base tokenizer with additional decoder fine-tuning

Results

We evaluate our approach across six generative models on ImageNet 256×256 and observe consistent improvements:

With MAR Models (FID-50k with CFG)

Model FID-50K Inception Score Parameters
MAR-Base + MAR-VAE 2.31 281.7 208M
MAR-Base + DeTok-BB 1.61 289.7 208M
MAR-Base + DeTok-BB-decoder_ft 1.55 291.0 208M
MAR-Large + MAR-VAE 1.78 296.0 479M
MAR-Huge + MAR-VAE 1.55 303.7 943M
MAR-Large + DeTok-BB 1.43 303.5 479M
MAR-Large + DeTok-BB-decoder_ft 1.32 304.1 479M

Observations

  • MAR-B: FID improves from 2.31 (MAR-VAE) → 1.55 (Ours)
  • MAR-L: FID improves from 1.78 (MAR-VAE) → 1.35 (Ours)
  • The approach works across both non-autoregressive (DiT, SiT, LightningDiT) and autoregressive models (MAR, RasterAR, RandomAR)

Usage

Installation

git clone https://github.com/Jiawei-Yang/DeTok.git
cd DeTok
conda create -n detok python=3.10 -y && conda activate detok
pip install -r requirements.txt

Training Details

Tokenizer Training

  • Dataset: ImageNet train set
  • Resolution: 256×256
  • Batch Size: 1024 (global)
  • Epochs: 200 (base) + 100 (decoder fine-tuning)
  • Denoising Parameters:
    • Gamma (noise strength): 3.0
    • Mask ratio: 0.7
    • Random masking ratio: sampled from max(0, U(-0.1, M))

Key Training Components

  • Interpolative Noise: x' = (1-τ)x + τε(γ), where τ ~ U(0,1)
  • Random Masking: Variable masking ratios during training
  • Multi-component Loss: MSE + KL + Perceptual + Adversarial losses

Technical Details

Denoising Methodology

  1. Interpolative Latent Noise: We use interpolative noise rather than additive noise, which allows for heavier corruption when the noise level τ is high
  2. Masking as Deconstruction: We explore random masking as another form of latent deconstruction, inspired by masked autoencoders
  3. Downstream Alignment: The denoising objective is designed to align with how modern generative models operate

Potential Benefits

  • Task Alignment: The training objective is designed to match downstream generative model objectives
  • Simplicity: The approach works without requiring large-scale pretrained visual encoders or semantic distillation
  • Generality: We observe improvements across different types of generative models
  • Robustness: The learned representations appear to remain useful even under corruption

Evaluation

Datasets

  • Training: ImageNet train set
  • Evaluation: ImageNet validation set (50k images)

Metrics

  • FID-50k: Fréchet Inception Distance on 50,000 generated samples
  • Inception Score: Standard generative model evaluation metric
  • Precision & Recall: Using ImageNet validation precision-recall data

Citation

@article{yang2025detok,
  title={Latent Denoising Makes Good Visual Tokenizers},
  author={Jiawei Yang and Tianhong Li and Lijie Fan and Yonglong Tian and Yue Wang},
  journal={arXiv preprint arXiv:2507.15856},
  year={2025}
}

License

This project is licensed under the MIT License.

Acknowledgments

This work builds upon many excellent open-source projects. We are particularly grateful to:

  • MAR for masked autoregressive modeling
  • DiT for diffusion transformers
  • MAE for masked autoencoder insights
  • 1d-tokenizer for tokenizer implementations
  • The broader research community for foundational work in generative modeling

Contact

For questions or issues, please open a GitHub issue at the official repository.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support