|
--- |
|
base_model: |
|
- MAmmoTH-VL/MAmmoTH-VL-8B |
|
datasets: |
|
- TIGER-Lab/VisualWebInstruct |
|
language: |
|
- en |
|
library_name: transformers |
|
license: apache-2.0 |
|
pipeline_tag: image-text-to-text |
|
--- |
|
|
|
# Introduction |
|
MAmmoTH-VL2, the model trained with VisualWebInstruct. |
|
|
|
# Links |
|
[Github](https://github.com/TIGER-AI-Lab/VisualWebInstruct)| |
|
[Paper](https://arxiv.org/abs/2503.10582)| |
|
[Website](https://tiger-ai-lab.github.io/VisualWebInstruct/)| |
|
[Demo](https://huggingface.co/spaces/TIGER-Lab/MAmmoTH-VL2) |
|
|
|
|
|
# Example Usage |
|
## Requirements |
|
```python |
|
llava==1.7.0.dev0 # pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git |
|
torch==2.5.1 |
|
``` |
|
To perform inference using MAmmoTH-VL2, you can use the following code snippet: |
|
```python |
|
from llava.model.builder import load_pretrained_model |
|
from llava.mm_utils import process_images |
|
from llava.constants import DEFAULT_IMAGE_TOKEN |
|
from llava.conversation import conv_templates |
|
|
|
from PIL import Image |
|
import requests |
|
import copy |
|
import torch |
|
|
|
# Load MAmmoTH-VL2 model |
|
pretrained = "TIGER-Lab/MAmmoTH-VL2" |
|
model_name = "llava_qwen" |
|
device = "cuda:3" # Specify a single GPU |
|
device_map = {"": device} |
|
|
|
# Load model |
|
tokenizer, model, image_processor, max_length = load_pretrained_model( |
|
pretrained, |
|
None, |
|
model_name, |
|
device_map=device_map, |
|
multimodal=True |
|
) |
|
model.eval() |
|
model = model.to(device) |
|
|
|
# Load image |
|
image_url = "https://raw.githubusercontent.com/jymmmmm/VISUALWEBINSTRUCT/main/image.png" |
|
image = Image.open(requests.get(image_url, stream=True).raw).convert('RGB') |
|
images = [image] |
|
image_sizes = [[image.size[0], image.size[1]]] |
|
|
|
# Prepare prompt |
|
prompt = "In the picture shown below, prove ΔWXY and ΔZWY are similar. Please conclude your answer as Answer: xxx at the end if possible." |
|
|
|
# Set up conversation template |
|
|
|
conv_template = "qwen_2_5" |
|
conv = copy.deepcopy(conv_templates[conv_template]) |
|
|
|
# Add question with image |
|
question = DEFAULT_IMAGE_TOKEN + "\n" + prompt |
|
conv.append_message(conv.roles[0], question) |
|
conv.append_message(conv.roles[1], None) |
|
prompt_question = conv.get_prompt() |
|
|
|
# Prepare model inputs |
|
inputs = tokenizer( |
|
prompt_question, |
|
return_tensors="pt", |
|
padding=True, |
|
truncation=True, |
|
max_length=max_length |
|
) |
|
input_ids = inputs.input_ids.to(device) |
|
attention_mask = inputs.attention_mask.to(device) |
|
|
|
# Process image |
|
image_tensor = process_images(images, image_processor, model.config) |
|
if isinstance(image_tensor, list): |
|
image_tensor = [img.to(dtype=torch.float16, device=device) for img in image_tensor] |
|
else: |
|
image_tensor = image_tensor.to(dtype=torch.float16, device=device) |
|
|
|
# Generate response |
|
with torch.no_grad(): |
|
outputs = model.generate( |
|
input_ids, |
|
attention_mask=attention_mask, |
|
images=image_tensor, |
|
image_sizes=image_sizes, |
|
do_sample=False, |
|
temperature=0, |
|
max_new_tokens=512, |
|
) |
|
|
|
# Decode response |
|
input_token_len = input_ids.shape[1] |
|
response = tokenizer.batch_decode(outputs[:, input_token_len:], skip_special_tokens=True)[0] |
|
print("Response:", response) |
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
# Citation |
|
``` |
|
@article{visualwebinstruct, |
|
title={VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search}, |
|
author = {Jia, Yiming and Li, Jiachen and Yue, Xiang and Li, Bo and Nie, Ping and Zou, Kai and Chen, Wenhu}, |
|
journal={arXiv preprint arXiv:2503.10582}, |
|
year={2025} |
|
} |
|
``` |