arXiv:2511.06876

Generating an Image From 1,000 Words: Enhancing Text-to-Image With Structured Captions

Published on Nov 10

· Submitted by

Ron Mokady on Nov 11

BRIA AI

Upvote

Authors:

Hezi Zisman ,

Kfir Goldberg ,

Ron Mokady

Abstract

A text-to-image model trained on long structured captions with DimFusion fusion mechanism and TaBR evaluation protocol achieves state-of-the-art prompt alignment and improved controllability.

AI-generated summary

Text-to-image models have rapidly evolved from casual creative tools to professional-grade systems, achieving unprecedented levels of image quality and realism. Yet, most models are trained to map short prompts into detailed images, creating a gap between sparse textual input and rich visual outputs. This mismatch reduces controllability, as models often fill in missing details arbitrarily, biasing toward average user preferences and limiting precision for professional use. We address this limitation by training the first open-source text-to-image model on long structured captions, where every training sample is annotated with the same set of fine-grained attributes. This design maximizes expressive coverage and enables disentangled control over visual factors. To process long captions efficiently, we propose DimFusion, a fusion mechanism that integrates intermediate tokens from a lightweight LLM without increasing token length. We also introduce the Text-as-a-Bottleneck Reconstruction (TaBR) evaluation protocol. By assessing how well real images can be reconstructed through a captioning-generation loop, TaBR directly measures controllability and expressiveness, even for very long captions where existing evaluation methods fail. Finally, we demonstrate our contributions by training the large-scale model FIBO, achieving state-of-the-art prompt alignment among open-source models. Model weights are publicly available at https://huggingface.co/briaai/FIBO

View arXiv page View PDF GitHub 241 Add to collection

Community

rmokady

Paper author Paper submitter 1 day ago

Generating an image from 1,000 words.

Very excited to release Fibo 😃, the first ever open-source model trained exclusively on long, structured captions.

Fibo sets a new standard for controllability and disentanglement in image generation

librarian-bot

about 19 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2511.06876 in a dataset README.md to link it from this page.

Generating an Image From 1,000 Words: Enhancing Text-to-Image With Structured Captions

Abstract

Community

Models citing this paper 1

Datasets citing this paper 0

Spaces citing this paper 11

Collections including this paper 2