Spatial-LLaVA-7B Model Card

Github Repo

πŸ€— Huggingface Space Demo

πŸ€– Model details

Model type:

This finetuned LLaVA model is trained from liuhaotian/llava-pretrain-vicuna-7b-v1.3 for improving spatial relation reasoning of large multi-modal model.

LLaVA is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture.

🎯 Intended use

Primary intended uses: The primary use of LLaVA is research on large multimodal models and chatbots.

Primary intended users: The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

πŸ“š Training dataset

Instruction following training: rogerxi/LLaVA-Spatial-Instruct-850K

πŸ“Š Evaluation

A collection of 10 benchmarks:

Model VQAv2 GQA VizWiz SQA TextVQA POPE MME MM-Bench MM-Bench-cn MM-Vet
LLaVA-1.5-7b 78.5 62.0 50.0 66.8 58.2 85.9 1510.7 64.3 58.3 31.1
Spatial-LLaVA-7b 79.7 62.7 48.7 68.7 58.5 87.2 1472.7 67.8 60.7 31.6

Spatial-Relation-Eval (built based on SpatialRGPT-Bench):

Qualitative Spatial Relations

Model Below/Above Left/Right Big/Small Tall/Short Wide/Thin Behind/Front Avg
LLaVA-1.5-7b 53.91 53.49 45.36 40.00 50.00 51.04 48.97
LLaVA-1.5-13b 54.28 52.32 45.36 48.57 49.02 47.92 49.67
Spatial-LLaVA-7b 56.32 66.28 60.82 48.57 49.02 52.08 55.12

Quantitative Spatial Relations

Model Direct Dist (m / ratio) Horizontal Dist (m / ratio) Vertical Dist (m / ratio) Width (m / ratio) Height (m / ratio) Direction (Β° / ratio)
LLaVA-1.5-7b 12.90 / 1.06 10.68 / 2.03 20.79 / 0.94 24.19 / 0.50 14.29 / 5.27 10.23 / 58.33
LLaVA-1.5-13b 13.71 / 0.93 10.68 / 3.56 16.83 / 0.85 15.32 / 0.57 17.67 / 5.8 14.77 / 54.29
Spatial-LLaVA-7b 24.19 / 0.57 14.56 / 0.62 41.58 / 0.42 22.58 / 1.12 18.25 / 2.92 20.45 / 56.47

πŸ™ Acknowledgements

We thank Liu Haotian et al. for the LLaVA pretrained script, weights and LLaVA-v1.5 mixture dataset; the teams behind CLEVR, TextCaps, VisualMRC and VQAv2 (via β€œHuggingFaceM4/the_cauldron”); remyxai for OpenSpaces; Anjie Cheng et al. for Spatial-Bench and data pipeline; Google for OpenImages; and Hugging Face for their datasets infrastructure.

Downloads last month
1
Safetensors
Model size
7B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using rogerxi/Spatial-LLaVA-7B 1