Papers
arxiv:2506.04209

Language-Image Alignment with Fixed Text Encoders

Published on Jun 4
· Submitted by JingfengY on Jun 6
Authors:

Abstract

Learning Language-Image alignment with a Fixed Text encoder (LIFT) using pre-trained large language models effectively guides visual representation learning, outperforming joint training methods like CLIP in compositional understanding and long captions.

AI-generated summary

Currently, the most dominant approach to establishing language-image alignment is to pre-train text and image encoders jointly through contrastive learning, such as CLIP and its variants. In this work, we question whether such a costly joint training is necessary. In particular, we investigate if a pre-trained fixed large language model (LLM) offers a good enough text encoder to guide visual representation learning. That is, we propose to learn Language-Image alignment with a Fixed Text encoder (LIFT) from an LLM by training only the image encoder. Somewhat surprisingly, through comprehensive benchmarking and ablation studies, we find that this much simplified framework LIFT is highly effective and it outperforms CLIP in most scenarios that involve compositional understanding and long captions, while achieving considerable gains in computational efficiency. Our work takes a first step towards systematically exploring how text embeddings from LLMs can guide visual learning and suggests an alternative design choice for learning language-aligned visual representations.

Community

Paper author Paper submitter
edited Jun 5

Some qualitative comparisons between LIFT and CLIP! The first line shows the caption or option selected by LIFT, and the second line shows the one selected by CLIP. In every case, LIFT selects the correct one, while CLIP does not. We observe that LIFT compensates for CLIP’s shortcomings in tasks involving compositional information (e.g., spatial locations, object-attribute associations, object-object relations).

Screenshot 2025-06-05 at 16.07.22.png

Screenshot 2025-06-05 at 16.08.39.png

Paper author Paper submitter
edited Jun 5

The pipeline of LIFT, which adopts a dual-tower architecture similar to CLIP. LIFT uses an LLM-based text encoder to pre-compute the embedding for each text sample. During training, we solely update the image encoder and the projection head to align image embeddings with the pre-computed text embeddings by optimizing an alignment objective.

pipeline.png

Paper author Paper submitter

good work, but none of the prior important related work have been discussed, including:

  1. Zhang, Le, Qian Yang, and Aishwarya Agrawal. "Assessing and Learning Alignment of Unimodal Vision and Language Models." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025.
  2. Ruthardt, Jona, et al. "Do better language models have crisper vision?." arXiv preprint arXiv:2410.07173 (2024).
  3. Zhai, Xiaohua, et al. "Lit: Zero-shot transfer with locked-image text tuning." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.
·
Paper author

Thanks so much for your attention and for raising this thoughtful point! We really appreciate you highlighting these related works. Given the rapidly growing body of work in language-image alignment and some time constraints during writing, we unfortunately didn’t include a discussion of these papers in our initial version. We'll definitely consider including them in our later version to provide readers with a more comprehensive view of the field.

One key distinction between these works and ours lies in the focus and setup of our study. As reflected in our title, we aim to investigate whether the original, fixed text embeddings from large language models (LLMs) can directly benefit language-image alignment. In contrast, all three works you mentioned introduce some form of post-processing to the original LLM text embeddings (e.g., alignment or projection layers). This makes it difficult to attribute the observed performance gains specifically to the LLMs themselves, as those gains may be due to the added components or fine-tuning. We also train the image encoder entirely from scratch rather than building on a pre-trained model, ensuring that the superiority or limitations of a base model do not influence our analysis.

Again, thanks for your insightful comment! We’ll definitely expand our related work section to include these contributions in the next version. Wishing you a great weekend!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.04209 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.04209 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.04209 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.