|
|
--- |
|
|
license: cc-by-nc-4.0 |
|
|
base_model: |
|
|
- stabilityai/stable-diffusion-3-medium-diffusers |
|
|
pipeline_tag: image-to-image |
|
|
tags: |
|
|
- image-generation |
|
|
- image-to-image |
|
|
- virtual-try-on |
|
|
- virtual-try-off |
|
|
- diffusion |
|
|
- dit |
|
|
- stable-diffusion-3 |
|
|
- multimodal |
|
|
- fashion |
|
|
- pytorch |
|
|
language: en |
|
|
datasets: |
|
|
- dresscode |
|
|
- viton-hd |
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
<h1 align="center">TEMU-VTOFF</h1> |
|
|
<h3 align="center">Text-Enhanced MUlti-category Virtual Try-Off</h3> |
|
|
</div> |
|
|
|
|
|
<div align="center"> |
|
|
<picture> |
|
|
<source srcset="/davidelobba/TEMU-VTOFF/resolve/main/teaser.png" media="(prefers-color-scheme: dark)"> |
|
|
<img src="/davidelobba/TEMU-VTOFF/resolve/main/teaser.png" width="75%" alt="TEMU-VTOFF Teaser"> |
|
|
</source> |
|
|
</picture> |
|
|
</div> |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
**Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals** |
|
|
[Davide Lobba](https://scholar.google.com/citations?user=WEMoLPEAAAAJ&hl=en&oi=ao)<sup>1,2,\*</sup>, [Fulvio Sanguigni](https://scholar.google.com/citations?user=tSpzMUEAAAAJ&hl=en)<sup>2,3,\*</sup>, [Bin Ren](https://scholar.google.com/citations?user=Md9maLYAAAAJ&hl=en)<sup>1,2</sup>, [Marcella Cornia](https://scholar.google.com/citations?user=DzgmSJEAAAAJ&hl=en)<sup>3</sup>, [Rita Cucchiara](https://scholar.google.com/citations?user=OM3sZEoAAAAJ&hl=en)<sup>3</sup>, [Nicu Sebe](https://scholar.google.com/citations?user=stFCYOAAAAAJ&hl=en)<sup>1</sup> |
|
|
<sup>1</sup>University of Trento, <sup>2</sup>University of Pisa, <sup>3</sup>University of Modena and Reggio Emilia |
|
|
<sup>*</sup> Equal contribution |
|
|
</div> |
|
|
|
|
|
<div align="center"> |
|
|
<a href="https://arxiv.org/abs/2505.21062" style="margin: 0 2px;"> |
|
|
<img src="https://img.shields.io/badge/Paper-Arxiv_2505.21062-darkred.svg" alt="Paper"> |
|
|
</a> |
|
|
<a href="https://temu-vtoff-page.github.io/" style="margin: 0 2px;"> |
|
|
<img src='https://img.shields.io/badge/Webpage-Project-silver?style=flat&logo=&logoColor=orange' alt='Project Webpage'> |
|
|
</a> |
|
|
<a href="https://github.com/davidelobba/TEMU-VTOFF" style="margin: 0 2px;"> |
|
|
<img src="https://img.shields.io/badge/GitHub-Repo-blue.svg?logo=github" alt="GitHub Repository"> |
|
|
</a> |
|
|
<!-- The Hugging Face model badge will be automatically displayed on the model page --> |
|
|
</div> |
|
|
|
|
|
## π‘ Model Description |
|
|
|
|
|
**TEMU-VTOFF** is a novel dual-DiT (Diffusion Transformer) architecture designed for the Virtual Try-Off task: generating in-shop images of garments worn by a person. By combining a pretrained feature extractor with a text-enhanced generation module, our method can handle occlusions, multiple garment categories, and ambiguous appearances. It further refines generation fidelity via a feature alignment module based on DINOv2. |
|
|
|
|
|
This model is based on `stabilityai/stable-diffusion-3-medium-diffusers`. The uploaded weights correspond to the finetuned feature extractor and the VTOFF DiT module. |
|
|
|
|
|
## β¨ Key Features |
|
|
Our contribution can be summarized as follows: |
|
|
- **π― Multi-Category Try-Off**. We present a unified framework capable of handling multiple garment types (upper-body, lower-body, and full-body clothes) without requiring category-specific pipelines. |
|
|
- **π Multimodal Hybrid Attention**. We introduce a novel attention mechanism that integrates garment textual descriptions into the generative process by linking them with person-specific features. This helps the model synthesize occluded or ambiguous garment regions more accurately. |
|
|
- **β‘ Garment Aligner Module**. We design a lightweight aligner that conditions generation on clean garment images, replacing conventional denoising objectives. This leads to better alignment consistency on the overall dataset and preserves more precise visual retention. |
|
|
- **π Extensive experiments**. Experiments on the Dress Code and VITON-HD datasets demonstrate that TEMU-VTOFF outperforms prior methods in both the quality of generated images and alignment with the target garment, highlighting its strong generalization capabilities. |