Lychee-rerank-mm
Lychee-rerank-mm is the latest generalist multimodal reranking model developed based on the Qwen2.5-VL-Instruct foundation model. It is designed for reranking tasks in image-text multimodal retrieval scenarios.
Lychee-rerank-mm is jointly developed by the NLP Team of Harbin Institute of Technology, Shenzhen, and the 7B parameter versions are released as open source.
Lychee-rerank-mm:
- Model Type: Multimodal Reranking
- Language Support: en
- Param Size: 7B
- Model Precision: BF16
For more details, please refer to our paper.
Model List
| Model Type | Models | Size | Instruction Aware |
|---|---|---|---|
| Multimodal Reranking | lychee-rerank-mm | 8.29B | Yes |
Note:
Instruction Awarenotes whether the reranking model supports customizing the input instruction according to different tasks.- Like most models, for most downstream tasks, using instructions (instruct) typically yields an improvement to not using them. Therefore, we recommend that developers create tailored instructions specific to their tasks and scenarios.
Model Usage
📌 Tips: We recommend that developers customize the instruction according to their specific scenarios.
Transformers Usage
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
def format_content(text, image, prefix='Query:'):
content = []
if not text and not image:
content = [{'type': 'text', 'text': prefix}]
return content
content.append({'type': 'text', 'text': prefix})
if image:
content.append({'type': 'image', 'image': 'file://' + image})
if text:
content.append({'type': 'text', 'text': text})
return content
def format_instruction(instruction, query_text, query_image_path, doc_text, doc_image_path):
inputs = []
inputs.append({
"role": "system",
"content": [{
"type": "text",
"text": "Judge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be \"yes\" or \"no\"."
}
]
})
contents = []
contents.append({
"type": "text",
"text": '<Instruct>: ' + instruction
})
query_content = format_content(query_text, query_image_path, prefix='<Query>:')
contents.extend(query_content)
doc_content = format_content(doc_text, doc_image_path, prefix='\n<Document>:')
contents.extend(doc_content)
inputs.append({
"role": "user",
"content": contents
})
return inputs
def process_inputs(pairs):
texts = [tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
) for messages in pairs]
try:
image_inputs, video_inputs = process_vision_info(pairs)
except Exception as e:
print(f'Failed to load image, consider to remove it from the dataset')
inputs = tokenizer(
text=texts,
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
truncation=False,
max_length=3200
)
for key in inputs:
inputs[key] = inputs[key].to(model.device)
return inputs
@torch.no_grad()
def compute_logits(inputs, **kwargs):
batch_scores = model(**inputs).logits[:, -1, :]
true_vector = batch_scores[:, token_true_id]
false_vector = batch_scores[:, token_false_id]
batch_scores = torch.stack([false_vector, true_vector], dim=1)
batch_scores = torch.nn.functional.log_softmax(batch_scores, dim=1)
scores = batch_scores[:, 1].exp().tolist()
return scores
model_name_or_path = "vec-ai/lychee-rerank-mm"
min_pixels = 4*28*28
max_pixels = 1280*28*28
tokenizer = AutoProcessor.from_pretrained(model_name_or_path, padding_side='left', min_pixels=min_pixels, max_pixels=max_pixels, trust_remote_code=True)
# We recommend enabling flash_attention_2 for better acceleration and memory saving.
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_name_or_path, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2").cuda().eval()
token_false_id = tokenizer.tokenizer.get_vocab()["no"]
token_true_id = tokenizer.tokenizer.get_vocab()["yes"]
task = 'Given a web search query, retrieve relevant passages that answer the query'
query_texts = [
"What is the capital of China?",
"Explain gravity",
]
query_images = [
None,
None,
]
doc_texts = [
"The capital of China is Beijing.",
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
]
doc_images = [
None,
None,
]
pairs = [format_instruction(task, query_text, query_image, doc_text, doc_image) for query_text, query_image, doc_text, doc_image in zip(query_texts, query_images, doc_texts, doc_images)]
# Tokenize the input texts
inputs = process_inputs(pairs)
scores = compute_logits(inputs)
print("scores: ", scores)
query_text = "What breed is the cat in the image?"
query_image = "./images/Siamese_cat1.jpg"
doc_texts = [
"The Siamese cat is one of the first distinctly recognised breeds of Asian cat. It derives from the Wichianmat landrace. The Siamese cat is one of several varieties of cats native to Thailand (known as Siam before 1939). The original Siamese became one of the most popular breeds in Europe and North America in the 19th century. Siamese cats have a distinctive colourpoint coat, resulting from a temperature-sensitive type of albinism.",
"The Asian or Asian group, is a cat breed similar to the European Burmese, but comes in a range of different coat colours and patterns. Long-haired Asians of all varieties are called Tiffanies. Asians are grouped in section 5 (Burmese) by the Governing Council of the Cat Fancy (GCCF)."
]
doc_images = [
"./images/Siamese_cat2.jpg",
"./images/Asian_cat.jpg",
]
pairs = [format_instruction(task, query_text, query_image, doc_text, doc_image) for doc_text, doc_image in zip(doc_texts, doc_images)]
inputs = process_inputs(pairs)
scores = compute_logits(inputs)
print("scores: ", scores)
Evaluation
| Model | Param | ALL (40) | T→T (14) | I→I (1) | T→I (4) | T→VD (5) | I→T (5) | T→IT (2) | IT→T (4) | IT→I (2) | IT→IT (3) |
|---|---|---|---|---|---|---|---|---|---|---|---|
| GME-2B | 2.21B | 52.54 | 49.59 | 30.75 | 48.46 | 66.39 | 52.62 | 77.02 | 39.88 | 36.70 | 66.89 |
| Qwen3-Reranker | 4.02B | -- | 60.49 | -- | -- | -- | -- | -- | -- | -- | -- |
| Jina-rerank-m0 | 2.21B | 54.36 | 55.36 | 27.50 | 59.46 | 73.13 | 55.43 | 74.95 | 27.82 | 37.65 | 51.54 |
| MonoQwen2-VL-v0.1 | 2.21B | 44.20 | 48.89 | 12.59 | 58.73 | 71.29 | 19.62 | 76.46 | 14.35 | 31.75 | 35.83 |
| lychee-rerank-mm-3B | 3.75B | 61.40 | 59.22 | 29.76 | 58.85 | 72.38 | 63.06 | 81.96 | 48.81 | 43.97 | 79.08 |
| lychee-rerank-mm-7B | 8.29B | 63.85 | 61.08 | 32.83 | 61.18 | 72.94 | 66.61 | 84.55 | 53.29 | 47.39 | 82.19 |
For more details, please refer to our paper.
Citation
If you find our work helpful, feel free to give us a cite.
@misc{dai2025supervisedfinetuningcontrastivelearning,
title={Supervised Fine-Tuning or Contrastive Learning? Towards Better Multimodal LLM Reranking},
author={Ziqi Dai and Xin Zhang and Mingxin Li and Yanzhao Zhang and Dingkun Long and Pengjun Xie and Meishan Zhang and Wenjie Li and Min Zhang},
year={2025},
eprint={2510.14824},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.14824},
}
- Downloads last month
- 12
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
