CAFe: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning
Abstract
A contrastive-autoregressive fine-tuning framework, CAFe, enhances large vision-language models for both multimodal retrieval and generation, improving precision and coherence.
The rapid advancement of large vision-language models (LVLMs) has driven significant progress in multimodal tasks, enabling models to interpret, reason, and generate outputs across both visual and textual domains. While excelling in generative tasks, existing LVLMs often face limitations in tasks requiring high-fidelity representation learning, such as generating image or text embeddings for retrieval. Recent work has proposed finetuning LVLMs for representational learning, but the fine-tuned model often loses its generative capabilities due to the representational learning training paradigm. To address this trade-off, we introduce CAFe, a contrastive-autoregressive fine-tuning framework that enhances LVLMs for both representation and generative tasks. By integrating a contrastive objective with autoregressive language modeling, our approach unifies these traditionally separate tasks, achieving state-of-the-art results in both multimodal retrieval and multimodal generative benchmarks, including object hallucination (OH) mitigation. CAFe establishes a novel framework that synergizes embedding and generative functionalities in a single model, setting a foundation for future multimodal models that excel in both retrieval precision and coherent output generation.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation (2025)
- Kernel-based Unsupervised Embedding Alignment for Enhanced Visual Representation in Vision-language Models (2025)
- UniMoCo: Unified Modality Completion for Robust Multi-Modal Embeddings (2025)
- Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better (2025)
- Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM (2025)
- Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration (2025)
- FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper