arxiv:2506.04209

Language-Image Alignment with Fixed Text Encoders

Published on Jun 4

· Submitted by

JingfengY on Jun 6

Upvote

Authors:

Jingfeng Yang ,

Ziyang Wu ,

Yue Zhao ,

Abstract

Learning Language-Image alignment with a Fixed Text encoder (LIFT) using pre-trained large language models effectively guides visual representation learning, outperforming joint training methods like CLIP in compositional understanding and long captions.

AI-generated summary

Currently, the most dominant approach to establishing language-image alignment is to pre-train text and image encoders jointly through contrastive learning, such as CLIP and its variants. In this work, we question whether such a costly joint training is necessary. In particular, we investigate if a pre-trained fixed large language model (LLM) offers a good enough text encoder to guide visual representation learning. That is, we propose to learn Language-Image alignment with a Fixed Text encoder (LIFT) from an LLM by training only the image encoder. Somewhat surprisingly, through comprehensive benchmarking and ablation studies, we find that this much simplified framework LIFT is highly effective and it outperforms CLIP in most scenarios that involve compositional understanding and long captions, while achieving considerable gains in computational efficiency. Our work takes a first step towards systematically exploring how text embeddings from LLMs can guide visual learning and suggests an alternative design choice for learning language-aligned visual representations.

View arXiv page View PDF Project page GitHub 33 Add to collection

Community

JingfengY

Paper author Paper submitter Jun 5

•

edited Jun 5

Some qualitative comparisons between LIFT and CLIP! The first line shows the caption or option selected by LIFT, and the second line shows the one selected by CLIP. In every case, LIFT selects the correct one, while CLIP does not. We observe that LIFT compensates for CLIP’s shortcomings in tasks involving compositional information (e.g., spatial locations, object-attribute associations, object-object relations).

JingfengY

Paper author Paper submitter Jun 5

•

edited Jun 5

The pipeline of LIFT, which adopts a dual-tower architecture similar to CLIP. LIFT uses an LLM-based text encoder to pre-compute the embedding for each text sample. During training, we solely update the image encoder and the projection head to align image embeddings with the pre-computed text embeddings by optimizing an alignment objective.

JingfengY

Paper author Paper submitter Jun 6

Project page: https://jingfeng0705.github.io/LIFT/lift.html

le723z

Jun 6

good work, but none of the prior important related work have been discussed, including:

Zhang, Le, Qian Yang, and Aishwarya Agrawal. "Assessing and Learning Alignment of Unimodal Vision and Language Models." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025.
Ruthardt, Jona, et al. "Do better language models have crisper vision?." arXiv preprint arXiv:2410.07173 (2024).
Zhai, Xiaohua, et al. "Lit: Zero-shot transfer with locked-image text tuning." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.

JingfengY

Paper author Jun 7

Thanks so much for your attention and for raising this thoughtful point! We really appreciate you highlighting these related works. Given the rapidly growing body of work in language-image alignment and some time constraints during writing, we unfortunately didn’t include a discussion of these papers in our initial version. We'll definitely consider including them in our later version to provide readers with a more comprehensive view of the field.

One key distinction between these works and ours lies in the focus and setup of our study. As reflected in our title, we aim to investigate whether the original, fixed text embeddings from large language models (LLMs) can directly benefit language-image alignment. In contrast, all three works you mentioned introduce some form of post-processing to the original LLM text embeddings (e.g., alignment or projection layers). This makes it difficult to attribute the observed performance gains specifically to the LLMs themselves, as those gains may be due to the added components or fine-tuning. We also train the image encoder entirely from scratch rather than building on a pre-trained model, ensuring that the superiority or limitations of a base model do not influence our analysis.

Again, thanks for your insightful comment! We’ll definitely expand our related work section to include these contributions in the next version. Wishing you a great weekend!