Spatial-LLaVA-7B Model Card

🤖 Model details

Model type:

This finetuned LLaVA model is trained from liuhaotian/llava-pretrain-vicuna-7b-v1.3 for improving spatial relation reasoning of large multi-modal model.

LLaVA is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture.

🎯 Intended use

Primary intended uses: The primary use of LLaVA is research on large multimodal models and chatbots.

Primary intended users: The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

📚 Training dataset

Instruction following training: rogerxi/LLaVA-Spatial-Instruct-850K

📊 Evaluation

A collection of 10 benchmarks:

Model	VQAv2	GQA	VizWiz	SQA	TextVQA	POPE	MME	MM-Bench	MM-Bench-cn	MM-Vet
LLaVA-1.5-7b	78.5	62.0	50.0	66.8	58.2	85.9	1510.7	64.3	58.3	31.1
Spatial-LLaVA-7b	79.7	62.7	48.7	68.7	58.5	87.2	1472.7	67.8	60.7	31.6

Spatial-Relation-Eval (built based on SpatialRGPT-Bench):

Qualitative Spatial Relations

Model	Below/Above	Left/Right	Big/Small	Tall/Short	Wide/Thin	Behind/Front	Avg
LLaVA-1.5-7b	53.91	53.49	45.36	40.00	50.00	51.04	48.97
LLaVA-1.5-13b	54.28	52.32	45.36	48.57	49.02	47.92	49.67
Spatial-LLaVA-7b	56.32	66.28	60.82	48.57	49.02	52.08	55.12

Quantitative Spatial Relations

Model	Direct Dist (m / ratio)	Horizontal Dist (m / ratio)	Vertical Dist (m / ratio)	Width (m / ratio)	Height (m / ratio)	Direction (° / ratio)
LLaVA-1.5-7b	12.90 / 1.06	10.68 / 2.03	20.79 / 0.94	24.19 / 0.50	14.29 / 5.27	10.23 / 58.33
LLaVA-1.5-13b	13.71 / 0.93	10.68 / 3.56	16.83 / 0.85	15.32 / 0.57	17.67 / 5.8	14.77 / 54.29
Spatial-LLaVA-7b	24.19 / 0.57	14.56 / 0.62	41.58 / 0.42	22.58 / 1.12	18.25 / 2.92	20.45 / 56.47

🙏 Acknowledgements

We thank Liu Haotian et al. for the LLaVA pretrained script, weights and LLaVA-v1.5 mixture dataset; the teams behind CLEVR, TextCaps, VisualMRC and VQAv2 (via “HuggingFaceM4/the_cauldron”); remyxai for OpenSpaces; Anjie Cheng et al. for Spatial-Bench and data pipeline; Google for OpenImages; and Hugging Face for their datasets infrastructure.

Downloads last month: 1

Safetensors

Model size

7B params

Tensor type

BF16

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

rogerxi
/

Spatial-LLaVA-7B