yl-1993

Enhance model card: Add pipeline tag, library name, and usage examples (#2)

3182921 verified 17 days ago

8.55 kB

	---
	base_model:
	- OpenGVLab/InternVL3-8B
	license: apache-2.0
	pipeline_tag: image-text-to-text
	library_name: transformers
	---

	EN \| [中文](README_CN.md)

	# SenseNova-SI: Scaling Spatial Intelligence with Multimodal Foundation Models

	<a href="https://github.com/OpenSenseNova/SenseNova-SI" target="_blank">
	<img alt="Code" src="https://img.shields.io/badge/SenseNova_SI-Code-100000?style=flat-square&logo=github&logoColor=white" height="20" />
	</a>
	<a href="https://arxiv.org/abs/2511.13719" target="_blank">
	<img alt="arXiv" src="https://img.shields.io/badge/arXiv-SenseNova_SI-red?logo=arxiv" height="20" />
	</a>
	<a href="https://github.com/EvolvingLMMs-Lab/EASI" target="_blank">
	<img alt="Code" src="https://img.shields.io/badge/EASI-Code-100000?style=flat-square&logo=github&logoColor=white" height="20" />
	</a>
	<a href="https://huggingface.co/spaces/lmms-lab-si/EASI-Leaderboard" target="_blank">
	<img alt="Leaderboard" src="https://img.shields.io/badge/%F0%9F%A4%97%20_EASI-Leaderboard-ffc107?color=ffc107&logoColor=white" height="20" />
	</a>

	🔥Please check out our newly released [SenseNova-SI-1.1-InternVL3-2B](https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-2B) and
	[SenseNova-SI-1.1-InternVL3-8B](https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-8B).

	⏳*The current model will be deprecated in due course.*

	## Overview
	Despite remarkable progress, leading multimodal models still exhibit notable deficiencies in spatial intelligence:
	the ability to make metric estimations, understand spatial relationships, handle viewpoint changes, and integrate information across complex scenes.
	We take a scaling perspective: constructing and curating a large-scale, comprehensive collection of spatial intelligence data,
	and through continued training on powerful multimodal foundations,
	cultivating multi-faceted spatial understanding within the SenseNova-SI family of models.
	In the future, SenseNova-SI will be integrated with larger-scale in-house models.

	## Release Information
	Currently, we build SenseNova-SI upon popular open-source foundation models to maximize compatibility with existing research pipelines.
	In this release, we present
	[SenseNova-SI-InternVL3-2B](https://huggingface.co/sensenova/SenseNova-SI-InternVL3-2B) and
	[SenseNova-SI-InternVL3-8B](https://huggingface.co/sensenova/SenseNova-SI-InternVL3-8B),
	which achieve state-of-the-art performance among open-source models of comparable size across four recent spatial intelligence benchmarks:
	VSI, MMSI, MindCube, and ViewSpatial.

	<table>
	<thead>
	<tr>
	<th>Model</th>
	<th>VSI</th>
	<th>MMSI</th>
	<th>MindCube-Tiny</th>
	<th>ViewSpatial</th>
	</tr>
	</thead>
	<tbody>
	<tr style="background:#F2F0EF;font-weight:700;text-align:center;">
	<td colspan="5"><em>Open-source Models (~2B)</em></td>
	</tr>
	<tr>
	<td>InternVL3-2B</td><td>32.98</td><td>26.50</td><td>37.50</td><td>32.56</td>
	</tr>
	<tr>
	<td>Qwen3-VL-2B-Instruct</td><td>50.36</td><td>28.90</td><td>34.52</td><td>36.97</td>
	</tr>
	<tr>
	<td>MindCube-3B-RawQA-SFT</td><td>17.24</td><td>1.70</td><td>51.73</td><td>24.14</td>
	</tr>
	<tr>
	<td>MindCube-3B-Aug-CGMap-FFR-Out-SFT</td><td>29.60</td><td>29.10</td><td>41.06</td><td>30.90</td>
	</tr>
	<tr>
	<td>MindCube-3B-Plain-CGMap-FFR-Out-SFT</td><td>29.93</td><td>30.40</td><td>39.90</td><td>31.20</td>
	</tr>
	<tr>
	<td>SpatialLadder-3B</td><td>44.86</td><td>27.40</td><td>43.46</td><td>39.85</td>
	</tr>
	<tr>
	<td>SpatialMLLM-4B</td><td>45.98</td><td>26.10</td><td>33.46</td><td>34.66</td>
	</tr>
	<tr>
	<td><strong>SenseNova-SI-InternVL3-2B</strong></td>
	<td><strong>58.47</strong></td>
	<td><strong>35.50</strong></td>
	<td><strong>71.35</strong></td>
	<td><strong>40.62</strong></td>
	</tr>
	<tr style="background:#F2F0EF;font-weight:700;text-align:center;">
	<td colspan="5"><em>Open-source Models (~8B)</em></td>
	</tr>
	<tr>
	<td>InternVL3-8B</td><td>42.14</td><td>28.00</td><td>41.54</td><td>38.66</td>
	</tr>
	<tr>
	<td>Qwen3-VL-8B-Instruct</td><td>57.90</td><td>31.10</td><td>29.42</td><td>42.20</td>
	</tr>
	<tr>
	<td>BAGEL-7B</td><td>30.90</td><td>33.10</td><td>34.71</td><td>41.32</td>
	</tr>
	<tr>
	<td>SpaceR-7B</td><td>36.29</td><td>27.40</td><td>37.98</td><td>35.85</td>
	</tr>
	<tr>
	<td>ViLaSR-7B</td><td>44.63</td><td>30.20</td><td>35.10</td><td>35.71</td>
	</tr>
	<tr>
	<td><strong>SenseNova-SI-InternVL3-8B</strong></td>
	<td><strong>62.80</strong></td>
	<td><strong>37.90</strong></td>
	<td><strong>89.33</strong></td>
	<td><strong>53.92</strong></td>
	</tr>
	<tr style="background:#F2F0EF;color:#6b7280;font-weight:600;text-align:center;">
	<td colspan="5"><em>Proprietary Models</em></td>
	</tr>
	<tr style="color:#6b7280;">
	<td>Gemini-2.5-pro-2025-06</td><td>53.57</td><td>38.00</td><td>57.60</td><td>46.06</td>
	</tr>
	<tr style="color:#6b7280;">
	<td>Grok-4-2025-07-09</td><td>47.92</td><td>37.80</td><td>63.56</td><td>43.23</td>
	</tr>
	<tr style="color:#6b7280;">
	<td>GPT-5-2025-08-07</td><td>55.03</td><td>41.80</td><td>56.30</td><td>45.59</td>
	</tr>
	</tbody>
	</table>

	## What's Next?
	We will release the accompanying technical report shortly. Please stay tuned!

	## 🛠️ QuickStart

	### Installation

	We recommend using [uv](https://docs.astral.sh/uv/) to manage the environment.

	> uv installation guide: <https://docs.astral.sh/uv/getting-started/installation/#installing-uv>

	```bash
	git clone [email protected]:OpenSenseNova/SenseNova-SI.git
	cd SenseNova-SI/
	uv sync --extra cu124 # or one of [cu118\|cu121\|cu124\|cu126\|cu128\|cu129], depending on your CUDA version
	uv sync
	source .venv/bin/activate
	```

	### How to Use

	Here's an example demonstrating how to use the SenseNova-SI model for multi-image visual question answering with the `transformers` library.

	```python
	import torch
	from PIL import Image
	from transformers import AutoModel, AutoProcessor

	model_path = "sensenova/SenseNova-SI-1.1-InternVL3-8B"

	# Load processor and model
	processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
	model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True).eval()

	# Example: Pos-Obj-Obj subset of MMSI-Bench (from GitHub examples)
	# These images should be available locally or loaded from a source
	# For demonstration, assuming they are accessible via relative paths:
	# (Note: In a real scenario, ensure 'examples/Q1_1.png' and 'examples/Q1_2.png' exist from the GitHub repo)
	try:
	image1 = Image.open("./examples/Q1_1.png").convert("RGB")
	image2 = Image.open("./examples/Q1_2.png").convert("RGB")
	except FileNotFoundError:
	print("Example images not found. Please ensure 'examples/Q1_1.png' and 'examples/Q1_2.png' are available, or provide your own images.")
	# Fallback for demonstration if images are not present
	image1 = Image.new('RGB', (500, 500), color = 'red')
	image2 = Image.new('RGB', (500, 500), color = 'blue')

	question = "<image><image>
	You are standing in front of the dice pattern and observing it. Where is the desk lamp approximately located relative to you?
	Options: A: 90 degrees counterclockwise, B: 90 degrees clockwise, C: 135 degrees counterclockwise, D: 135 degrees clockwise"

	# Prepare inputs
	inputs = processor(text=question, images=[image1, image2], return_tensors="pt").to(model.device)

	# Generate response
	with torch.no_grad():
	output = model.generate(**inputs, max_new_tokens=100)
	response = processor.batch_decode(output, skip_special_tokens=True)[0]

	print(f"Question: {question}")
	print(f"Answer: {response}")
	```

	## 🖊️ Citation

	```bib
	@article{sensenova-si,
	title = {Scaling Spatial Intelligence with Multimodal Foundation Models},
	author = {Cai, Zhongang and Wang, Ruisi and Gu, Chenyang and Pu, Fanyi and Xu, Junxiang and Wang, Yubo and Yin, Wanqi and Yang, Zhitao and Wei, Chen and Sun, Qingping and Zhou, Tongxi and Li, Jiaqi and Pang, Hui En and Qian, Oscar and Wei, Yukun and Lin, Zhiqian and Shi, Xuanke and Deng, Kewang and Han, Xiaoyang and Chen, Zukai and Fan, Xiangyu and Deng, Hanming and Lu, Lewei and Pan, Liang and Li, Bo and Liu, Ziwei and Wang, Quan and Lin, Dahua and Yang, Lei},
	journal = {arXiv preprint arXiv:2511.13719},
	year = {2025}
	}
	```