DeepGlint-AI
/

ViCToR-LLaVA-SigLIP2-Qwen2.5-7b

Image-Text-to-Text

Model card Files Files and versions

ViCToR-LLaVA-SigLIP2-Qwen2.5-7b / README.md

Yin-Xie's picture

Update README.md

680764c verified 3 months ago

|

history blame contribute delete

1.61 kB

	---
	license: apache-2.0
	inference: false
	pipeline_tag: image-text-to-text
	datasets:
	- liuhaotian/LLaVA-Pretrain
	- lmms-lab/LLaVA-ReCap-CC12M
	- lmms-lab/LLaVA-NeXT-Data
	---

	<br>
	<br>

	# ViCToR Model Card

	## Model details

	Paper or resources for more information:
	https://github.com/deepglint/Victor


	Where to send questions or comments about the model:
	https://github.com/deepglint/Victor/issues


	## Results
	\| Benchmark \| ViCTOR-7B \| LLaVA-1.5-13B \| LLaVA-NeXT-8B \| Ross \|
	\| ---------------- \| --------- \| ------------- \| ------------- \| ---- \|
	\| MMStar \| 54.3 \| 34.3 \| 43.9 \| 53.9 \|
	\| RealWorldQA \| 65.6 \| 55.3 \| 58.4 \| 58.7 \|
	\| MMBench^(cn,val) \| 79.0 \| 67.8 \| – \| – \|
	\| OCRBench \| 556 \| 337 \| 531 \| 553 \|
	\| POPE \| 88.4 \| 88.4 \| 87.1 \| 88.1 \|
	\| MMU \| 48.9 \| 37.0 \| 43.1 \| 49.0 \|
	\| A12D \| 79.5 \| 61.1 \| 72.8 \| 79.5 \|
	\| MME \| 2071 \| 1781 \| 1908 \| 1854 \|
	\| SEED^(f) \| 75.7 \| 68.2 \| 72.5 \| 73.6 \|

	## Citation
	```
	@inproceedings{Xie2024ViCToRIV,
	title={ViCToR: Improving Visual Comprehension via Token Reconstruction for Pretraining LMMs},
	author={Yin Xie and Kaicheng Yang and Peirou Liang and Xiang An and Yongle Zhao and Yumeng Wang and Ziyong Feng and Roy Miles and Ismail Elezi and Jiankang Deng},
	year={2024},
	url={https://api.semanticscholar.org/CorpusID:273482504}
	}
	```