Papers
arxiv:2503.12652

UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing

Published on Mar 16
Authors:
,
,
,
,

Abstract

UniVG is a unified diffusion model supporting diverse image generation tasks using a single set of weights, demonstrating competitive performance across various applications.

AI-generated summary

Text-to-Image (T2I) diffusion models have shown impressive results in generating visually compelling images following user prompts. Building on this, various methods further fine-tune the pre-trained T2I model for specific tasks. However, this requires separate model architectures, training designs, and multiple parameter sets to handle different tasks. In this paper, we introduce UniVG, a generalist diffusion model capable of supporting a diverse range of image generation tasks with a single set of weights. UniVG treats multi-modal inputs as unified conditions to enable various downstream applications, ranging from T2I generation, inpainting, instruction-based editing, identity-preserving generation, and layout-guided generation, to depth estimation and referring segmentation. Through comprehensive empirical studies on data mixing and multi-task training, we provide detailed insights into the training processes and decisions that inform our final designs. For example, we show that T2I generation and other tasks, such as instruction-based editing, can coexist without performance trade-offs, while auxiliary tasks like depth estimation and referring segmentation enhance image editing. Notably, our model can even outperform some task-specific models on their respective benchmarks, marking a significant step towards a unified image generation model.

Community

A young man wearing black wide-leg pants, a black jacket, layered silver chains, and rectangular sunglasses stands confidently at a gas station under a clear blue sky. The sunlight flares behind him, creating a dramatic lens glow. The camera is positioned low and close, looking slightly up toward him, capturing his assertive stance and expression. Suddenly, the camera begins a rapid and smooth zoom-out, while the man remains completely still. The scene expands quickly through the gas station, over the surrounding desert, climbing high above the landscape. The zoom continues beyond the atmosphere. In the final view, we see a single, photorealistic Earth slowly rotating in the darkness of space, with no duplicated elements or overlays. A soft glowing dot marks the man’s original position on the planet.

A young man wearing black wide-leg pants, a black jacket, layered silver chains, and rectangular sunglasses stands confidently at a gas station under a clear blue sky. The sunlight flares behind him, creating a dramatic lens glow. The camera is positioned low and close, looking slightly up toward him, capturing his assertive stance and expression. Suddenly, the camera begins a rapid and smooth zoom-out, while the man remains completely still. The scene expands quickly through the gas station, over the surrounding desert, climbing high above the landscape. The zoom continues beyond the atmosphere. In the final view, we see a single, photorealistic Earth slowly rotating in the darkness of space, with no duplicated elements or overlays. A soft glowing dot marks the man’s original position on the planet.

preview (68).webp

A young man wearing black wide-leg pants, a black jacket, layered silver chains, and rectangular sunglasses stands confidently at a gas station under a clear blue sky. The sunlight flares behind him, creating a dramatic lens glow. The camera is positioned low and close, looking slightly up toward him, capturing his assertive stance and expression. Suddenly, the camera begins a rapid and smooth zoom-out, while the man remains completely still. The scene expands quickly through the gas station, over the surrounding desert, climbing high above the landscape. The zoom continues beyond the atmosphere. In the final view, we see a single, photorealistic Earth slowly rotating in the darkness of space, with no duplicated elements or overlays. A soft glowing dot marks the man’s original position on the planet.

No description provided.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.12652 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2503.12652 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.12652 in a Space README.md to link it from this page.

Collections including this paper 1