Papers
arxiv:2509.24514

Instruction Guided Multi Object Image Editing with Quantity and Layout Consistency

Published on Sep 29
Authors:
,
,

Abstract

QL-Adapter framework addresses object count and spatial layout challenges in instruction-driven image editing by integrating layout priors and image features into CLIP text encoders, achieving state-of-the-art performance.

AI-generated summary

Instruction driven image editing with standard CLIP text encoders often fails in complex scenes with many objects. We present QL-Adapter, a framework for multiple object editing that tackles two challenges: enforcing object counts and spatial layouts, and accommodating diverse categories. QL-Adapter consists of two core modules: the Image-Layout Fusion Module (ILFM) and the Cross-Modal Augmentation Module (CMAM). ILFM fuses layout priors with ViT patch tokens from the CLIP image encoder to strengthen spatial structure understanding. CMAM injects image features into the text branch to enrich textual embeddings and improve instruction following. We further build QL-Dataset, a benchmark that spans broad category, layout, and count variations, and define the task of quantity and layout consistent image editing (QL-Edit). Extensive experiments show that QL-Adapter achieves state of the art performance on QL-Edit and significantly outperforms existing models.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.24514 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.24514 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.24514 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.