arxiv:2507.17402

HLFormer: Enhancing Partially Relevant Video Retrieval with Hyperbolic Learning

Published on Jul 23

· Submitted by

JunLi2005 on Jul 25

Upvote

Authors:

Li Jun ,

Abstract

HLFormer uses a hyperbolic modeling framework with Lorentz and Euclidean attention blocks to improve video-text retrieval by addressing hierarchical and partial relevance issues.

AI-generated summary

Partially Relevant Video Retrieval (PRVR) addresses the critical challenge of matching untrimmed videos with text queries describing only partial content. Existing methods suffer from geometric distortion in Euclidean space that sometimes misrepresents the intrinsic hierarchical structure of videos and overlooks certain hierarchical semantics, ultimately leading to suboptimal temporal modeling. To address this issue, we propose the first hyperbolic modeling framework for PRVR, namely HLFormer, which leverages hyperbolic space learning to compensate for the suboptimal hierarchical modeling capabilities of Euclidean space. Specifically, HLFormer integrates the Lorentz Attention Block and Euclidean Attention Block to encode video embeddings in hybrid spaces, using the Mean-Guided Adaptive Interaction Module to dynamically fuse features. Additionally, we introduce a Partial Order Preservation Loss to enforce "text < video" hierarchy through Lorentzian cone constraints. This approach further enhances cross-modal matching by reinforcing partial relevance between video content and text queries. Extensive experiments show that HLFormer outperforms state-of-the-art methods. Code is released at https://github.com/lijun2005/ICCV25-HLFormer.

View arXiv page View PDF Project page GitHub 26 Add to collection

Community

JunLi2005

Paper author Paper submitter 4 days ago

We propose HLFormer, the first hyperbolic modeling framework for PRVR, which leverages hyperbolic space learning to compensate for the suboptimal hierarchical modeling capabilities of Euclidean space. HLFormer's designs are faithfully tailored for two core demands in PRVR, namely (i) temporal modeling to extract key moment features, and (ii) learning robust cross-modal representations. For (i), we inject the intra-video hierarchy prior into the temporal modeling by introducing multi-scale Lorentz attention. It collaborates with the Euclidean attention and enhances activation of discriminative moment features relevant to queries. For (ii), we introduce $L_{pop}$ to impose a fine-grained 'text < video' semantic entailment constraint in hyperbolic space. This helps to model the inter-video hierarchy prior among videos and texts.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.17402 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.17402 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.17402 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.