arxiv:2503.01645

DesignDiffusion: High-Quality Text-to-Design Image Generation with Diffusion Models

Published on Mar 3

Authors:

Abstract

DesignDiffusion is a diffusion-based framework that synthesizes design images from textual descriptions, achieving state-of-the-art performance with enhanced character embeddings and self-play Direct Preference Optimization.

AI-generated summary

In this paper, we present DesignDiffusion, a simple yet effective framework for the novel task of synthesizing design images from textual descriptions. A primary challenge lies in generating accurate and style-consistent textual and visual content. Existing works in a related task of visual text generation often focus on generating text within given specific regions, which limits the creativity of generation models, resulting in style or color inconsistencies between textual and visual elements if applied to design image generation. To address this issue, we propose an end-to-end, one-stage diffusion-based framework that avoids intricate components like position and layout modeling. Specifically, the proposed framework directly synthesizes textual and visual design elements from user prompts. It utilizes a distinctive character embedding derived from the visual text to enhance the input prompt, along with a character localization loss for enhanced supervision during text generation. Furthermore, we employ a self-play Direct Preference Optimization fine-tuning strategy to improve the quality and accuracy of the synthesized visual text. Extensive experiments demonstrate that DesignDiffusion achieves state-of-the-art performance in design image generation.

View arXiv page View PDF Add to collection

Community

BigCoco

Jun 26

Hi! I'm reading your paper and am really interested in your work.
However, I found some issues in your paper; I'm not sure if it's typo or my misunderstanding.

In algorithm 1. The update formula of DPO is not as same as you show in formula (6), (sigmoid part)
is it σ(−βT ω(λ_t)) (algorithm 1) or
σ(−βT ω(λ_t)(||ϵw − ϵθ(xw_t , t)|| − ||ϵw − ϵref(xw_t , t)||^2_2 −||ϵl − ϵθ(xl_t, t)||^2_2 + ||ϵl − ϵref(xl_t, t)||^2_2)) (6)?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.01645 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2503.01645 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.01645 in a Space README.md to link it from this page.