Lychee-rerank-mm

Lychee-rerank-mm is the latest generalist multimodal reranking model developed based on the Qwen2.5-VL-Instruct foundation model. It is designed for reranking tasks in image-text multimodal retrieval scenarios. Lychee-rerank-mm is jointly developed by the NLP Team of Harbin Institute of Technology, Shenzhen, and the 7B parameter versions are released as open source.

Lychee-rerank-mm:

Model Type: Multimodal Reranking
Language Support: en
Param Size: 7B
Model Precision: BF16

For more details, please refer to our paper.

Model List

Model Type	Models	Size	Instruction Aware
Multimodal Reranking	lychee-rerank-mm	8.29B	Yes

Note:

Instruction Aware notes whether the reranking model supports customizing the input instruction according to different tasks.

Like most models, for most downstream tasks, using instructions (instruct) typically yields an improvement to not using them. Therefore, we recommend that developers create tailored instructions specific to their tasks and scenarios.

Model Usage

📌 Tips: We recommend that developers customize the instruction according to their specific scenarios.

Transformers Usage

import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info


def format_content(text, image, prefix='Query:'):
    content = []
    if not text and not image:
        content = [{'type': 'text', 'text': prefix}]
        return content
    content.append({'type': 'text', 'text': prefix})
    if image:
        content.append({'type': 'image', 'image': 'file://' + image})
    if text:
        content.append({'type': 'text', 'text': text})
    return content

def format_instruction(instruction, query_text, query_image_path, doc_text, doc_image_path):
    inputs = []
    inputs.append({
        "role": "system",
        "content": [{
            "type": "text",
            "text": "Judge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be \"yes\" or \"no\"."
        }
        ]
    })
    contents = []
    contents.append({
        "type": "text",
        "text": '<Instruct>: ' + instruction
    })
    query_content = format_content(query_text, query_image_path, prefix='<Query>:')
    contents.extend(query_content)
    doc_content = format_content(doc_text, doc_image_path, prefix='\n<Document>:')
    contents.extend(doc_content)
    inputs.append({
        "role": "user",
        "content": contents
    })
    return inputs


def process_inputs(pairs):
    texts = [tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    ) for messages in pairs]
    try:
        image_inputs, video_inputs = process_vision_info(pairs)
    except Exception as e:
        print(f'Failed to load image, consider to remove it from the dataset')
    inputs = tokenizer(
        text=texts,
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
        truncation=False,
        max_length=3200
    )
    for key in inputs:
        inputs[key] = inputs[key].to(model.device)
    return inputs


@torch.no_grad()
def compute_logits(inputs, **kwargs):
    batch_scores = model(**inputs).logits[:, -1, :]
    true_vector = batch_scores[:, token_true_id]
    false_vector = batch_scores[:, token_false_id]
    batch_scores = torch.stack([false_vector, true_vector], dim=1)
    batch_scores = torch.nn.functional.log_softmax(batch_scores, dim=1)
    scores = batch_scores[:, 1].exp().tolist()
    return scores


model_name_or_path = "vec-ai/lychee-rerank-mm"
min_pixels = 4*28*28
max_pixels = 1280*28*28
tokenizer = AutoProcessor.from_pretrained(model_name_or_path, padding_side='left', min_pixels=min_pixels, max_pixels=max_pixels, trust_remote_code=True)
# We recommend enabling flash_attention_2 for better acceleration and memory saving.
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_name_or_path, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2").cuda().eval()

token_false_id = tokenizer.tokenizer.get_vocab()["no"]
token_true_id = tokenizer.tokenizer.get_vocab()["yes"]
        
task = 'Given a web search query, retrieve relevant passages that answer the query'

query_texts = [
    "What is the capital of China?",
    "Explain gravity",
]

query_images = [
    None,
    None,
]

doc_texts = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
]

doc_images = [
    None,
    None,
]

pairs = [format_instruction(task, query_text, query_image, doc_text, doc_image) for query_text, query_image, doc_text, doc_image in zip(query_texts, query_images, doc_texts, doc_images)]

# Tokenize the input texts
inputs = process_inputs(pairs)
scores = compute_logits(inputs)

print("scores: ", scores)

query_text = "What breed is the cat in the image?"
query_image = "./images/Siamese_cat1.jpg"
doc_texts = [
    "The Siamese cat is one of the first distinctly recognised breeds of Asian cat. It derives from the Wichianmat landrace. The Siamese cat is one of several varieties of cats native to Thailand (known as Siam before 1939). The original Siamese became one of the most popular breeds in Europe and North America in the 19th century. Siamese cats have a distinctive colourpoint coat, resulting from a temperature-sensitive type of albinism.",
    "The Asian or Asian group, is a cat breed similar to the European Burmese, but comes in a range of different coat colours and patterns. Long-haired Asians of all varieties are called Tiffanies. Asians are grouped in section 5 (Burmese) by the Governing Council of the Cat Fancy (GCCF)."
]
doc_images = [
    "./images/Siamese_cat2.jpg",
    "./images/Asian_cat.jpg",
]
pairs = [format_instruction(task, query_text, query_image, doc_text, doc_image) for doc_text, doc_image in zip(doc_texts, doc_images)]
inputs = process_inputs(pairs)
scores = compute_logits(inputs)
print("scores: ", scores)

Evaluation

Model	Param	ALL (40)	T→T (14)	I→I (1)	T→I (4)	T→VD (5)	I→T (5)	T→IT (2)	IT→T (4)	IT→I (2)	IT→IT (3)
GME-2B	2.21B	52.54	49.59	30.75	48.46	66.39	52.62	77.02	39.88	36.70	66.89

Qwen3-Reranker	4.02B	--	60.49	--	--	--	--	--	--	--	--
Jina-rerank-m0	2.21B	54.36	55.36	27.50	59.46	73.13	55.43	74.95	27.82	37.65	51.54
MonoQwen2-VL-v0.1	2.21B	44.20	48.89	12.59	58.73	71.29	19.62	76.46	14.35	31.75	35.83

lychee-rerank-mm-3B	3.75B	61.40	59.22	29.76	58.85	72.38	63.06	81.96	48.81	43.97	79.08
lychee-rerank-mm-7B	8.29B	63.85	61.08	32.83	61.18	72.94	66.61	84.55	53.29	47.39	82.19

For more details, please refer to our paper.

Citation

If you find our work helpful, feel free to give us a cite.

@misc{dai2025supervisedfinetuningcontrastivelearning,
      title={Supervised Fine-Tuning or Contrastive Learning? Towards Better Multimodal LLM Reranking}, 
      author={Ziqi Dai and Xin Zhang and Mingxin Li and Yanzhao Zhang and Dingkun Long and Pengjun Xie and Meishan Zhang and Wenjie Li and Min Zhang},
      year={2025},
      eprint={2510.14824},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.14824}, 
}

Downloads last month: 12

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vec-ai/lychee-rerank-mm

Base model

Qwen/Qwen2.5-VL-7B-Instruct

Finetuned

(829)

this model

Quantizations

1 model