merge
Browse files
README.md
CHANGED
|
@@ -1,124 +1,124 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
datasets:
|
| 4 |
-
- liuhaotian/LLaVA-CC3M-Pretrain-595K
|
| 5 |
-
- liuhaotian/LLaVA-Instruct-150K
|
| 6 |
-
- FreedomIntelligence/ALLaVA-4V-Chinese
|
| 7 |
-
- shareAI/ShareGPT-Chinese-English-90k
|
| 8 |
-
language:
|
| 9 |
-
- zh
|
| 10 |
-
- en
|
| 11 |
-
pipeline_tag: visual-question-answering
|
| 12 |
-
---
|
| 13 |
-
<br>
|
| 14 |
-
<br>
|
| 15 |
-
|
| 16 |
-
# Model Card for 360VL
|
| 17 |
-
<p align="center">
|
| 18 |
-
<img src="https://github.com/360CVGroup/360VL/blob/master/qh360_vl/360vl.PNG?raw=true" width=100%/>
|
| 19 |
-
</p>
|
| 20 |
-
|
| 21 |
-
**360VL** is developed based on the LLama3 language model and is also the industry's first open source multi-modal
|
| 22 |
-
|
| 23 |
-
## Model Zoo
|
| 24 |
-
|
| 25 |
-
360VL has released the following versions.
|
| 26 |
-
|
| 27 |
-
Model | Download
|
| 28 |
-
|---|---
|
| 29 |
-
360VL-8B | [🤗 Hugging Face](https://huggingface.co/qihoo360/360VL-8B)
|
| 30 |
-
360VL-70B | [🤗 Hugging Face](https://huggingface.co/qihoo360/360VL-70B)
|
| 31 |
-
## Features
|
| 32 |
-
|
| 33 |
-
360VL offers the following features:
|
| 34 |
-
|
| 35 |
-
- Multi-round text-image conversations: 360VL can take both text and images as inputs and produce text outputs. Currently, it supports multi-round visual question answering with one image.
|
| 36 |
-
|
| 37 |
-
- Bilingual text support: 360VL supports conversations in both English and Chinese, including text recognition in images.
|
| 38 |
-
|
| 39 |
-
- Strong image comprehension: 360VL is adept at analyzing visuals, making it an efficient tool for tasks like extracting, organizing, and summarizing information from images.
|
| 40 |
-
|
| 41 |
-
- Fine-grained image resolution: 360VL supports image understanding at a higher resolution of 672×672.
|
| 42 |
-
|
| 43 |
-
## Performance
|
| 44 |
-
| Model | Checkpoints | MMB<sub>T | MMB<sub>D|MMB-CN<sub>T | MMB-CN<sub>D|MMMU<sub>V|MMMU<sub>T| MME |
|
| 45 |
-
|:--------------------|:------------:|:----:|:------:|:------:|:-------:|:-------:|:-------:|:-------:|
|
| 46 |
-
| QWen-VL-Chat | [🤗LINK](https://huggingface.co/Qwen/Qwen-VL-Chat) | 61.8 | 60.6 | 56.3 | 56.7 |37| 32.9 | 1860 |
|
| 47 |
-
| mPLUG-Owl2 | [🤖LINK](https://www.modelscope.cn/models/iic/mPLUG-Owl2/summary) | 66.0 | 66.5 | 60.3 | 59.5 |34.7| 32.1 | 1786.4 |
|
| 48 |
-
| CogVLM | [🤗LINK](https://huggingface.co/THUDM/cogvlm-grounding-generalist-hf) | 65.8| 63.7 | 55.9 | 53.8 |37.3| 30.1 | 1736.6|
|
| 49 |
-
| Monkey-Chat | [🤗LINK](https://huggingface.co/echo840/Monkey-Chat) | 72.4| 71 | 67.5 | 65.8 |40.7| - | 1887.4|
|
| 50 |
-
| MM1-7B-Chat | [LINK](https://ar5iv.labs.arxiv.org/html/2403.09611) | -| 72.3 | - | - |37.0| 35.6 | 1858.2|
|
| 51 |
-
| IDEFICS2-8B | [🤗LINK](https://huggingface.co/HuggingFaceM4/idefics2-8b) | 75.7 | 75.3 | 68.6 | 67.3 |43.0| 37.7 |1847.6|
|
| 52 |
-
| Honeybee | [LINK](https://github.com/kakaobrain/honeybee) | 74.3 | 74.3 | - | - |36.2| -|1950|
|
| 53 |
-
| SVIT-v1.5-13B| [🤗LINK](https://huggingface.co/Isaachhe/svit-v1.5-13b-full) | 69.1 | - | 63.1 | - | 38.0| 33.3|1889|
|
| 54 |
-
| LLaVA-v1.5-13B | [🤗LINK](https://huggingface.co/liuhaotian/llava-v1.5-13b) | 69.2 | 69.2 | 65 | 63.6 |36.4| 33.6 | 1826.7|
|
| 55 |
-
| LLaVA-v1.6-13B | [🤗LINK](https://huggingface.co/liuhaotian/llava-v1.6-vicuna-13b) | 70 | 70.7 | 68.5 | 64.3 |36.2| - |1901|
|
| 56 |
-
| YI-VL-34B | [🤗LINK](https://huggingface.co/01-ai/Yi-VL-34B) | 72.4 | 71.1 | 70.7 | 71.4 |45.1| 41.6 |2050.2|
|
| 57 |
-
| **360VL-8B** | [🤗LINK](https://huggingface.co/qihoo360/360VL-8B) | 75.3 | 73.7 | 71.1 | 68.6 |39.7| 37.1 | 1899.1|
|
| 58 |
-
| **360VL-70B** | [🤗LINK](https://huggingface.co/qihoo360/360VL-70B) | 78.1 | 80.4 | 76.9 | 77.7 |50.8| 44.3 | 1983.2|
|
| 59 |
-
## Quick Start 🤗
|
| 60 |
-
|
| 61 |
-
```Shell
|
| 62 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 63 |
-
import torch
|
| 64 |
-
from PIL import Image
|
| 65 |
-
|
| 66 |
-
checkpoint = "qh360_vl-70B"
|
| 67 |
-
|
| 68 |
-
model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.float16, device_map='cuda', trust_remote_code=True).eval()
|
| 69 |
-
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
|
| 70 |
-
vision_tower = model.get_vision_tower()
|
| 71 |
-
vision_tower.load_model()
|
| 72 |
-
vision_tower.to(device="cuda", dtype=torch.float16)
|
| 73 |
-
image_processor = vision_tower.image_processor
|
| 74 |
-
tokenizer.pad_token = tokenizer.eos_token
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
image = Image.open("docs/008.jpg").convert('RGB')
|
| 78 |
-
query = "Who is this cartoon character?"
|
| 79 |
-
terminators = [
|
| 80 |
-
tokenizer.convert_tokens_to_ids("<|eot_id|>",)
|
| 81 |
-
]
|
| 82 |
-
|
| 83 |
-
inputs = model.build_conversation_input_ids(tokenizer, query=query, image=image, image_processor=image_processor)
|
| 84 |
-
|
| 85 |
-
input_ids = inputs["input_ids"].to(device='cuda', non_blocking=True)
|
| 86 |
-
images = inputs["image"].to(dtype=torch.float16, device='cuda', non_blocking=True)
|
| 87 |
-
|
| 88 |
-
output_ids = model.generate(
|
| 89 |
-
input_ids,
|
| 90 |
-
images=images,
|
| 91 |
-
do_sample=False,
|
| 92 |
-
eos_token_id=terminators,
|
| 93 |
-
num_beams=1,
|
| 94 |
-
max_new_tokens=512,
|
| 95 |
-
use_cache=True)
|
| 96 |
-
|
| 97 |
-
input_token_len = input_ids.shape[1]
|
| 98 |
-
outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
|
| 99 |
-
outputs = outputs.strip()
|
| 100 |
-
print(outputs)
|
| 101 |
-
```
|
| 102 |
-
|
| 103 |
-
**Model type:**
|
| 104 |
-
360VL-70B is an open-source chatbot trained by fine-tuning LLM on multimodal instruction-following data.
|
| 105 |
-
It is an auto-regressive language model, based on the transformer architecture.
|
| 106 |
-
Base LLM: [meta-llama/Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct)
|
| 107 |
-
|
| 108 |
-
**Model date:**
|
| 109 |
-
360VL-70B was trained in May 2024.
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
## License
|
| 114 |
-
This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses.
|
| 115 |
-
The content of this project itself is licensed under the Apache license 2.0
|
| 116 |
-
|
| 117 |
-
**Where to send questions or comments about the model:**
|
| 118 |
-
https://github.com/360CVGroup/360VL
|
| 119 |
-
|
| 120 |
-
## Related Projects
|
| 121 |
-
This work wouldn't be possible without the incredible open-source code of these projects. Huge thanks!
|
| 122 |
-
- [Meta Llama 3](https://github.com/meta-llama/llama3)
|
| 123 |
-
- [LLaVA: Large Language and Vision Assistant](https://github.com/haotian-liu/LLaVA)
|
| 124 |
-
- [Honeybee: Locality-enhanced Projector for Multimodal LLM](https://github.com/kakaobrain/honeybee)
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
datasets:
|
| 4 |
+
- liuhaotian/LLaVA-CC3M-Pretrain-595K
|
| 5 |
+
- liuhaotian/LLaVA-Instruct-150K
|
| 6 |
+
- FreedomIntelligence/ALLaVA-4V-Chinese
|
| 7 |
+
- shareAI/ShareGPT-Chinese-English-90k
|
| 8 |
+
language:
|
| 9 |
+
- zh
|
| 10 |
+
- en
|
| 11 |
+
pipeline_tag: visual-question-answering
|
| 12 |
+
---
|
| 13 |
+
<br>
|
| 14 |
+
<br>
|
| 15 |
+
|
| 16 |
+
# Model Card for 360VL
|
| 17 |
+
<p align="center">
|
| 18 |
+
<img src="https://github.com/360CVGroup/360VL/blob/master/qh360_vl/360vl.PNG?raw=true" width=100%/>
|
| 19 |
+
</p>
|
| 20 |
+
|
| 21 |
+
**360VL** is developed based on the LLama3 language model and is also the industry's first open source large multi-modal model based on **LLama3-70B**[[🤗Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct)]. In addition to applying the Llama3 language model, the 360VL model also designs a globally aware multi-branch projector architecture, which enables the model to have more sufficient image understanding capabilities.
|
| 22 |
+
|
| 23 |
+
## Model Zoo
|
| 24 |
+
|
| 25 |
+
360VL has released the following versions.
|
| 26 |
+
|
| 27 |
+
Model | Download
|
| 28 |
+
|---|---
|
| 29 |
+
360VL-8B | [🤗 Hugging Face](https://huggingface.co/qihoo360/360VL-8B)
|
| 30 |
+
360VL-70B | [🤗 Hugging Face](https://huggingface.co/qihoo360/360VL-70B)
|
| 31 |
+
## Features
|
| 32 |
+
|
| 33 |
+
360VL offers the following features:
|
| 34 |
+
|
| 35 |
+
- Multi-round text-image conversations: 360VL can take both text and images as inputs and produce text outputs. Currently, it supports multi-round visual question answering with one image.
|
| 36 |
+
|
| 37 |
+
- Bilingual text support: 360VL supports conversations in both English and Chinese, including text recognition in images.
|
| 38 |
+
|
| 39 |
+
- Strong image comprehension: 360VL is adept at analyzing visuals, making it an efficient tool for tasks like extracting, organizing, and summarizing information from images.
|
| 40 |
+
|
| 41 |
+
- Fine-grained image resolution: 360VL supports image understanding at a higher resolution of 672×672.
|
| 42 |
+
|
| 43 |
+
## Performance
|
| 44 |
+
| Model | Checkpoints | MMB<sub>T | MMB<sub>D|MMB-CN<sub>T | MMB-CN<sub>D|MMMU<sub>V|MMMU<sub>T| MME |
|
| 45 |
+
|:--------------------|:------------:|:----:|:------:|:------:|:-------:|:-------:|:-------:|:-------:|
|
| 46 |
+
| QWen-VL-Chat | [🤗LINK](https://huggingface.co/Qwen/Qwen-VL-Chat) | 61.8 | 60.6 | 56.3 | 56.7 |37| 32.9 | 1860 |
|
| 47 |
+
| mPLUG-Owl2 | [🤖LINK](https://www.modelscope.cn/models/iic/mPLUG-Owl2/summary) | 66.0 | 66.5 | 60.3 | 59.5 |34.7| 32.1 | 1786.4 |
|
| 48 |
+
| CogVLM | [🤗LINK](https://huggingface.co/THUDM/cogvlm-grounding-generalist-hf) | 65.8| 63.7 | 55.9 | 53.8 |37.3| 30.1 | 1736.6|
|
| 49 |
+
| Monkey-Chat | [🤗LINK](https://huggingface.co/echo840/Monkey-Chat) | 72.4| 71 | 67.5 | 65.8 |40.7| - | 1887.4|
|
| 50 |
+
| MM1-7B-Chat | [LINK](https://ar5iv.labs.arxiv.org/html/2403.09611) | -| 72.3 | - | - |37.0| 35.6 | 1858.2|
|
| 51 |
+
| IDEFICS2-8B | [🤗LINK](https://huggingface.co/HuggingFaceM4/idefics2-8b) | 75.7 | 75.3 | 68.6 | 67.3 |43.0| 37.7 |1847.6|
|
| 52 |
+
| Honeybee | [LINK](https://github.com/kakaobrain/honeybee) | 74.3 | 74.3 | - | - |36.2| -|1950|
|
| 53 |
+
| SVIT-v1.5-13B| [🤗LINK](https://huggingface.co/Isaachhe/svit-v1.5-13b-full) | 69.1 | - | 63.1 | - | 38.0| 33.3|1889|
|
| 54 |
+
| LLaVA-v1.5-13B | [🤗LINK](https://huggingface.co/liuhaotian/llava-v1.5-13b) | 69.2 | 69.2 | 65 | 63.6 |36.4| 33.6 | 1826.7|
|
| 55 |
+
| LLaVA-v1.6-13B | [🤗LINK](https://huggingface.co/liuhaotian/llava-v1.6-vicuna-13b) | 70 | 70.7 | 68.5 | 64.3 |36.2| - |1901|
|
| 56 |
+
| YI-VL-34B | [🤗LINK](https://huggingface.co/01-ai/Yi-VL-34B) | 72.4 | 71.1 | 70.7 | 71.4 |45.1| 41.6 |2050.2|
|
| 57 |
+
| **360VL-8B** | [🤗LINK](https://huggingface.co/qihoo360/360VL-8B) | 75.3 | 73.7 | 71.1 | 68.6 |39.7| 37.1 | 1899.1|
|
| 58 |
+
| **360VL-70B** | [🤗LINK](https://huggingface.co/qihoo360/360VL-70B) | 78.1 | 80.4 | 76.9 | 77.7 |50.8| 44.3 | 1983.2|
|
| 59 |
+
## Quick Start 🤗
|
| 60 |
+
|
| 61 |
+
```Shell
|
| 62 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 63 |
+
import torch
|
| 64 |
+
from PIL import Image
|
| 65 |
+
|
| 66 |
+
checkpoint = "qh360_vl-70B"
|
| 67 |
+
|
| 68 |
+
model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.float16, device_map='cuda', trust_remote_code=True).eval()
|
| 69 |
+
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
|
| 70 |
+
vision_tower = model.get_vision_tower()
|
| 71 |
+
vision_tower.load_model()
|
| 72 |
+
vision_tower.to(device="cuda", dtype=torch.float16)
|
| 73 |
+
image_processor = vision_tower.image_processor
|
| 74 |
+
tokenizer.pad_token = tokenizer.eos_token
|
| 75 |
+
|
| 76 |
+
|
| 77 |
+
image = Image.open("docs/008.jpg").convert('RGB')
|
| 78 |
+
query = "Who is this cartoon character?"
|
| 79 |
+
terminators = [
|
| 80 |
+
tokenizer.convert_tokens_to_ids("<|eot_id|>",)
|
| 81 |
+
]
|
| 82 |
+
|
| 83 |
+
inputs = model.build_conversation_input_ids(tokenizer, query=query, image=image, image_processor=image_processor)
|
| 84 |
+
|
| 85 |
+
input_ids = inputs["input_ids"].to(device='cuda', non_blocking=True)
|
| 86 |
+
images = inputs["image"].to(dtype=torch.float16, device='cuda', non_blocking=True)
|
| 87 |
+
|
| 88 |
+
output_ids = model.generate(
|
| 89 |
+
input_ids,
|
| 90 |
+
images=images,
|
| 91 |
+
do_sample=False,
|
| 92 |
+
eos_token_id=terminators,
|
| 93 |
+
num_beams=1,
|
| 94 |
+
max_new_tokens=512,
|
| 95 |
+
use_cache=True)
|
| 96 |
+
|
| 97 |
+
input_token_len = input_ids.shape[1]
|
| 98 |
+
outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
|
| 99 |
+
outputs = outputs.strip()
|
| 100 |
+
print(outputs)
|
| 101 |
+
```
|
| 102 |
+
|
| 103 |
+
**Model type:**
|
| 104 |
+
360VL-70B is an open-source chatbot trained by fine-tuning LLM on multimodal instruction-following data.
|
| 105 |
+
It is an auto-regressive language model, based on the transformer architecture.
|
| 106 |
+
Base LLM: [meta-llama/Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct)
|
| 107 |
+
|
| 108 |
+
**Model date:**
|
| 109 |
+
360VL-70B was trained in May 2024.
|
| 110 |
+
|
| 111 |
+
|
| 112 |
+
|
| 113 |
+
## License
|
| 114 |
+
This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses.
|
| 115 |
+
The content of this project itself is licensed under the Apache license 2.0
|
| 116 |
+
|
| 117 |
+
**Where to send questions or comments about the model:**
|
| 118 |
+
https://github.com/360CVGroup/360VL
|
| 119 |
+
|
| 120 |
+
## Related Projects
|
| 121 |
+
This work wouldn't be possible without the incredible open-source code of these projects. Huge thanks!
|
| 122 |
+
- [Meta Llama 3](https://github.com/meta-llama/llama3)
|
| 123 |
+
- [LLaVA: Large Language and Vision Assistant](https://github.com/haotian-liu/LLaVA)
|
| 124 |
+
- [Honeybee: Locality-enhanced Projector for Multimodal LLM](https://github.com/kakaobrain/honeybee)
|