VPP-LLaVA Model Card

Model Details

Model Type: VPP-LLaVA is an enhanced multimodal model built upon the LLaVA architecture. It is designed to improve visual grounding capabilities by incorporating Visual Position Prompts (VPP) into the original LLaVA model. LLaVA itself is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model based on the transformer architecture.

Model Date: The VPP-LLaVA-7b enhancements were developed and tested based on the LLaVA-v1.5-7B model, which was trained in Feb. 2025.

Paper or Resources for More Information:

Original LLaVA: LLaVA: Large Language and Vision Assistant
VPP-LLaVA Enhancements: Visual Position Prompt for MLLM based Visual Grounding

License

The original LLaVA model is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved. The enhancements and modifications for VPP-LLaVA are intended for research use only and follow the same licensing principles.

Where to Send Questions or Comments about the Model

For questions or comments about VPP-LLaVA, please refer to the GitHub repository: VPP-LLaVA

Intended Use

Primary Intended Uses: The primary use of VPP-LLaVA is for research on large multimodal models, particularly focusing on improving visual grounding and spatial reasoning capabilities. It aims to enhance the performance of LLaVA in tasks that require precise alignment of spatial information within images.

Primary Intended Users: The primary intended users of VPP-LLaVA are researchers and hobbyists in the fields of computer vision, natural language processing, machine learning, and artificial intelligence, who are interested in exploring advanced multimodal models and improving visual grounding performance.

Training Dataset

The training dataset for VPP-LLaVA is the VPP-SFT dataset, which is available on Hugging Face: VPP-SFT. This dataset contains about 0.6M high-quality visual grounding samples, designed to efficiently train the model for improved visual grounding tasks. Please refer to our VPP-LLaVA for more details.

Evaluation Dataset

The evaluation dataset for VPP-LLaVA includes the following benchmarks:

RefCOCO
RefCOCO+
RefCOCOg
ReferIt
GSEval-BBox

Model Enhancements

VPP-LLaVA introduces Visual Position Prompts (VPP) to the original LLaVA architecture to enhance visual grounding capabilities. The enhancements are based on the research presented in the paper Visual Position Prompt for MLLM based Visual Grounding. The VPP mechanism includes:

Global VPP: Provides a global position reference by overlaying learnable, axis-like embeddings onto the input image.
Local VPP: Focuses on fine-grained localization by incorporating position-aware queries that suggest probable object locations.

These enhancements enable VPP-LLaVA to achieve state-of-the-art performance in visual grounding tasks, even when trained on a relatively smaller dataset compared to other models.

Zero-Shot Performance on Unseen Dataset (GSeval)

VPP-LLaVA demonstrates remarkable zero-shot performance on unseen datasets, particularly in challenging scenarios involving part-object and multi-object situations. This capability is crucial for real-world applications where the model may encounter previously unseen objects or complex scenes. The model's ability to generalize and accurately ground visual references in these scenarios highlights its robustness and adaptability.

VPP-LLaVA paper link: https://arxiv.org/abs/2503.15426

wayneicloud
/

VPP-LLaVA-7b