arXiv:2511.01617

Vote-in-Context: Turning VLMs into Zero-Shot Rank Fusers

Published on Nov 3

· Submitted by

Mohamed Eltahir on Nov 4

King Abdullah University of Science and Technology

Upvote

Authors:

Mohamed Eltahir ,

Abstract

Vote-in-Context (ViC) is a training-free framework that leverages Vision-Language Models (VLMs) for zero-shot reranking and fusion in cross-modal video retrieval, achieving state-of-the-art performance.

AI-generated summary

In the retrieval domain, candidates' fusion from heterogeneous retrievers is a long-standing challenge, particularly for complex, multi-modal data such as videos. While typical fusion techniques are training-free, they rely solely on rank or score signals, disregarding candidates' representations. This work introduces Vote-in-Context (ViC), a generalized, training-free framework that re-thinks list-wise reranking and fusion as a zero-shot reasoning task for a Vision-Language Model (VLM). The core insight is to serialize both content evidence and retriever metadata directly within the VLM's prompt, allowing the model to adaptively weigh retriever consensus against visual-linguistic content. We demonstrate the generality of this framework by applying it to the challenging domain of cross-modal video retrieval. To this end, we introduce the S-Grid, a compact serialization map that represents each video as an image grid, optionally paired with subtitles to enable list-wise reasoning over video candidates. ViC is evaluated both as a single-list reranker, where it dramatically improves the precision of individual retrievers, and as an ensemble fuser, where it consistently outperforms strong baselines like CombSUM. Across video retrieval benchmarks including ActivityNet and VATEX, the framework establishes new state-of-the-art zero-shot retrieval performance, demonstrating its effectiveness in handling complex visual and temporal signals alongside text. In zero-shot settings, ViC achieves Recall@1 scores of 87.1% (t2v) / 89.0% (v2t) on MSR-VTT and 99.6% (v2t) on VATEX, representing massive gains of up to +40 Recall@1 over previous state-of-the-art baselines. We present ViC as a simple, reproducible, and highly effective recipe for turning modern VLMs into powerful zero-shot rerankers and fusers. Code and resources are publicly available at: https://github.com/mohammad2012191/ViC

View arXiv page View PDF GitHub 1 Add to collection

Community

mohammad2012191

Paper author Paper submitter 3 days ago

🤔 What we explored:

Can we pass a large set of videos to a VLM and have it reason through them efficiently? Maybe represent each video as a special image?
If yes, can we teach a VLM to rank video candidates directly using their content (frames/subtitles)?
Oh Wait, I guess we can use this same idea to fuse multiple ranked lists from different retrievers with different modalities?
Hmmm, if we already use video content for fusion, can we also inject retrievers metadata (rank and multiplicity of each video) implicitly into the VLM's input? In other words, can we encode retrievers' "votes" in "context" and let the VLM decide?

🔥 Answers:

+40 Recall@1 over previous SoTA on most of the famous video retrieval benchmarks (MSR-VTT, DiDeMo,...) and near-saturation on ActivityNet and VATEX, jumping from 58% / 80% to 96% / 97.5% R@1 respectively, all in zero-shot settings!
Representing videos as grids (augmented with subtitles) preserves temporal signals, enabling VLMs to reason across many videos simultaneously.
Turns out current video retrievers are highly complementary. Classical fusion methods like RRF, CombSUM, and CombMNZ are already strong when applied on multiple retrievers! But on the other side, ViC approaches their performance level independently, using a single retriever + VLM reranker! Using multiple retrivers with a reranker is a different story!
Even better? We found keeping duplicates in the VLM input leads to better results. Turns out that the VLM actually uses the frequency of appearance (i.e. votes) as a signal! 👀
You don't need a 1T model for this! Even 8B VLMs show strong zero-shot results, and it’s not just Qwen this time 😉. ViC generalizes across multiple VLMs.
All these while we’re still bottlenecked by the VLM's effective context window. I guess you want to have a look at that table at the end. 👀

ViC is simple, effective, and general. It utilizes candidates' ranks, consensus and content to rerank the candidates. We believe it unlocks a new way to think about reranking and fusion across modalities and retrieval pipelines.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2511.01617 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2511.01617 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.01617 in a Space README.md to link it from this page.