Generating an Image From 1,000 Words: Enhancing Text-to-Image With Structured Captions
Abstract
A text-to-image model trained on long structured captions with DimFusion fusion mechanism and TaBR evaluation protocol achieves state-of-the-art prompt alignment and improved controllability.
Text-to-image models have rapidly evolved from casual creative tools to professional-grade systems, achieving unprecedented levels of image quality and realism. Yet, most models are trained to map short prompts into detailed images, creating a gap between sparse textual input and rich visual outputs. This mismatch reduces controllability, as models often fill in missing details arbitrarily, biasing toward average user preferences and limiting precision for professional use. We address this limitation by training the first open-source text-to-image model on long structured captions, where every training sample is annotated with the same set of fine-grained attributes. This design maximizes expressive coverage and enables disentangled control over visual factors. To process long captions efficiently, we propose DimFusion, a fusion mechanism that integrates intermediate tokens from a lightweight LLM without increasing token length. We also introduce the Text-as-a-Bottleneck Reconstruction (TaBR) evaluation protocol. By assessing how well real images can be reconstructed through a captioning-generation loop, TaBR directly measures controllability and expressiveness, even for very long captions where existing evaluation methods fail. Finally, we demonstrate our contributions by training the large-scale model FIBO, achieving state-of-the-art prompt alignment among open-source models. Model weights are publicly available at https://huggingface.co/briaai/FIBO
Community
Generating an image from 1,000 words.
Very excited to release Fibo ๐, the first ever open-source model trained exclusively on long, structured captions.
Fibo sets a new standard for controllability and disentanglement in image generation
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- No Concept Left Behind: Test-Time Optimization for Compositional Text-to-Image Generation (2025)
- Query-Kontext: An Unified Multimodal Model for Image Generation and Editing (2025)
- Image Generation Based on Image Style Extraction (2025)
- Structured Information for Improving Spatial Relationships in Text-to-Image Generation (2025)
- Learning an Image Editing Model without Image Editing Pairs (2025)
- MaskAttn-SDXL: Controllable Region-Level Text-To-Image Generation (2025)
- Chimera: Compositional Image Generation using Part-based Concepting (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper