|
|
--- |
|
|
license: apache-2.0 |
|
|
inference: false |
|
|
pipeline_tag: image-text-to-text |
|
|
datasets: |
|
|
- liuhaotian/LLaVA-Pretrain |
|
|
- lmms-lab/LLaVA-ReCap-CC12M |
|
|
- lmms-lab/LLaVA-NeXT-Data |
|
|
--- |
|
|
|
|
|
<br> |
|
|
<br> |
|
|
|
|
|
# ViCToR Model Card |
|
|
|
|
|
## Model details |
|
|
|
|
|
**Paper or resources for more information:** |
|
|
https://github.com/deepglint/Victor |
|
|
|
|
|
|
|
|
**Where to send questions or comments about the model:** |
|
|
https://github.com/deepglint/Victor/issues |
|
|
|
|
|
|
|
|
## Results |
|
|
| Benchmark | ViCTOR-7B | LLaVA-1.5-13B | LLaVA-NeXT-8B | Ross | |
|
|
| ---------------- | --------- | ------------- | ------------- | ---- | |
|
|
| MMStar | **54.3** | 34.3 | 43.9 | 53.9 | |
|
|
| RealWorldQA | **65.6** | 55.3 | 58.4 | 58.7 | |
|
|
| MMBench^(cn,val) | **79.0** | 67.8 | – | – | |
|
|
| OCRBench | 556 | 337 | 531 | 553 | |
|
|
| POPE | 88.4 | 88.4 | 87.1 | 88.1 | |
|
|
| MMU | 48.9 | 37.0 | 43.1 | 49.0 | |
|
|
| A12D | 79.5 | 61.1 | 72.8 | 79.5 | |
|
|
| MME | 2071 | 1781 | 1908 | 1854 | |
|
|
| SEED^(f) | **75.7** | 68.2 | 72.5 | 73.6 | |
|
|
|
|
|
## Citation |
|
|
``` |
|
|
@inproceedings{Xie2024ViCToRIV, |
|
|
title={ViCToR: Improving Visual Comprehension via Token Reconstruction for Pretraining LMMs}, |
|
|
author={Yin Xie and Kaicheng Yang and Peirou Liang and Xiang An and Yongle Zhao and Yumeng Wang and Ziyong Feng and Roy Miles and Ismail Elezi and Jiankang Deng}, |
|
|
year={2024}, |
|
|
url={https://api.semanticscholar.org/CorpusID:273482504} |
|
|
} |
|
|
``` |
|
|
|