Papers
arXiv:2511.06876

Generating an Image From 1,000 Words: Enhancing Text-to-Image With Structured Captions

Published on Nov 10
ยท Submitted by Ron Mokady on Nov 11
ยท briaai BRIA AI
Authors:
,
,
,
,
,
,
,
,

Abstract

A text-to-image model trained on long structured captions with DimFusion fusion mechanism and TaBR evaluation protocol achieves state-of-the-art prompt alignment and improved controllability.

AI-generated summary

Text-to-image models have rapidly evolved from casual creative tools to professional-grade systems, achieving unprecedented levels of image quality and realism. Yet, most models are trained to map short prompts into detailed images, creating a gap between sparse textual input and rich visual outputs. This mismatch reduces controllability, as models often fill in missing details arbitrarily, biasing toward average user preferences and limiting precision for professional use. We address this limitation by training the first open-source text-to-image model on long structured captions, where every training sample is annotated with the same set of fine-grained attributes. This design maximizes expressive coverage and enables disentangled control over visual factors. To process long captions efficiently, we propose DimFusion, a fusion mechanism that integrates intermediate tokens from a lightweight LLM without increasing token length. We also introduce the Text-as-a-Bottleneck Reconstruction (TaBR) evaluation protocol. By assessing how well real images can be reconstructed through a captioning-generation loop, TaBR directly measures controllability and expressiveness, even for very long captions where existing evaluation methods fail. Finally, we demonstrate our contributions by training the large-scale model FIBO, achieving state-of-the-art prompt alignment among open-source models. Model weights are publicly available at https://huggingface.co/briaai/FIBO

Community

Paper author Paper submitter

Generating an image from 1,000 words.

Very excited to release Fibo ๐Ÿ˜ƒ, the first ever open-source model trained exclusively on long, structured captions.

Fibo sets a new standard for controllability and disentanglement in image generation

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2511.06876 in a dataset README.md to link it from this page.

Spaces citing this paper 11

Collections including this paper 2