TIGER-Lab
/

MAmmoTH-VL2

Image-Text-to-Text

text-generation

text-generation-inference

Model card Files Files and versions

MAmmoTH-VL2 / README.md

wenhu's picture

Update README.md

b10993f verified 5 months ago

|

history blame contribute delete

3.37 kB

	---
	base_model:
	- MAmmoTH-VL/MAmmoTH-VL-8B
	datasets:
	- TIGER-Lab/VisualWebInstruct
	language:
	- en
	library_name: transformers
	license: apache-2.0
	pipeline_tag: image-text-to-text
	---

	# Introduction
	MAmmoTH-VL2, the model trained with VisualWebInstruct.

	# Links
	[Github](https://github.com/TIGER-AI-Lab/VisualWebInstruct)\|
	[Paper](https://arxiv.org/abs/2503.10582)\|
	[Website](https://tiger-ai-lab.github.io/VisualWebInstruct/)\|
	[Demo](https://huggingface.co/spaces/TIGER-Lab/MAmmoTH-VL2)


	# Example Usage
	## Requirements
	```python
	llava==1.7.0.dev0 # pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
	torch==2.5.1
	```
	To perform inference using MAmmoTH-VL2, you can use the following code snippet:
	```python
	from llava.model.builder import load_pretrained_model
	from llava.mm_utils import process_images
	from llava.constants import DEFAULT_IMAGE_TOKEN
	from llava.conversation import conv_templates

	from PIL import Image
	import requests
	import copy
	import torch

	# Load MAmmoTH-VL2 model
	pretrained = "TIGER-Lab/MAmmoTH-VL2"
	model_name = "llava_qwen"
	device = "cuda:3" # Specify a single GPU
	device_map = {"": device}

	# Load model
	tokenizer, model, image_processor, max_length = load_pretrained_model(
	pretrained,
	None,
	model_name,
	device_map=device_map,
	multimodal=True
	)
	model.eval()
	model = model.to(device)

	# Load image
	image_url = "https://raw.githubusercontent.com/jymmmmm/VISUALWEBINSTRUCT/main/image.png"
	image = Image.open(requests.get(image_url, stream=True).raw).convert('RGB')
	images = [image]
	image_sizes = [[image.size[0], image.size[1]]]

	# Prepare prompt
	prompt = "In the picture shown below, prove ΔWXY and ΔZWY are similar. Please conclude your answer as Answer: xxx at the end if possible."

	# Set up conversation template

	conv_template = "qwen_2_5"
	conv = copy.deepcopy(conv_templates[conv_template])

	# Add question with image
	question = DEFAULT_IMAGE_TOKEN + "\n" + prompt
	conv.append_message(conv.roles[0], question)
	conv.append_message(conv.roles[1], None)
	prompt_question = conv.get_prompt()

	# Prepare model inputs
	inputs = tokenizer(
	prompt_question,
	return_tensors="pt",
	padding=True,
	truncation=True,
	max_length=max_length
	)
	input_ids = inputs.input_ids.to(device)
	attention_mask = inputs.attention_mask.to(device)

	# Process image
	image_tensor = process_images(images, image_processor, model.config)
	if isinstance(image_tensor, list):
	image_tensor = [img.to(dtype=torch.float16, device=device) for img in image_tensor]
	else:
	image_tensor = image_tensor.to(dtype=torch.float16, device=device)

	# Generate response
	with torch.no_grad():
	outputs = model.generate(
	input_ids,
	attention_mask=attention_mask,
	images=image_tensor,
	image_sizes=image_sizes,
	do_sample=False,
	temperature=0,
	max_new_tokens=512,
	)

	# Decode response
	input_token_len = input_ids.shape[1]
	response = tokenizer.batch_decode(outputs[:, input_token_len:], skip_special_tokens=True)[0]
	print("Response:", response)

	```





	# Citation
	```
	@article{visualwebinstruct,
	title={VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search},
	author = {Jia, Yiming and Li, Jiachen and Yue, Xiang and Li, Bo and Nie, Ping and Zou, Kai and Chen, Wenhu},
	journal={arXiv preprint arXiv:2503.10582},
	year={2025}
	}
	```