arxiv:2511.00524

Text-guided Fine-Grained Video Anomaly Detection

Published on Nov 1

Authors:

Abstract

The Text-guided Fine-Grained Video Anomaly Detection framework uses a Large Vision-Language Model to generate detailed anomaly heatmaps and textual descriptions, achieving state-of-the-art performance in video anomaly detection.

AI-generated summary

Video Anomaly Detection (VAD) aims to identify anomalous events within video segments. In scenarios such as surveillance or industrial process monitoring, anomaly detection is of critical importance. While existing approaches are semi-automated, requiring human assessment for anomaly detection, traditional VADs offer limited output as either normal or anomalous. We propose Text-guided Fine-Grained Video Anomaly Detection (T-VAD), a framework built upon Large Vision-Language Model (LVLM). T-VAD introduces an Anomaly Heatmap Decoder (AHD) that performs pixel-wise visual-textual feature alignment to generate fine-grained anomaly heatmaps. Furthermore, we design a Region-aware Anomaly Encoder (RAE) that transforms the heatmaps into learnable textual embeddings, guiding the LVLM to accurately identify and localize anomalous events in videos. This significantly enhances both the granularity and interactivity of anomaly detection. The proposed method achieving SOTA performance by demonstrating 94.8% Area Under the Curve (AUC, specifically micro-AUC) and 67.8%/76.7% accuracy in anomaly heatmaps (RBDC/TBDC) on the UBnormal dataset, and subjectively verified more preferable textual description on the ShanghaiTech-based dataset (BLEU-4: 62.67 for targets, 88.84 for trajectories; Yes/No accuracy: 97.67%), and on the UBnormal dataset (BLEU-4: 50.32 for targets, 78.10 for trajectories; Yes/No accuracy: 89.73%).

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2511.00524 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2511.00524 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.00524 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.