|
--- |
|
license: cc-by-4.0 |
|
--- |
|
|
|
|
|
# π Math2Visual: Visual Language Generation Model |
|
|
|
This is the official model for generating **structured visual language** representations from math word problems, as proposed in: |
|
|
|
π **[ACL 2025 Findings Paper β Math2Visual](https://aclanthology.org/2025.findings-acl.586/)** |
|
|
|
π₯ **[Project Video](https://youtu.be/jdPYVoHEPtk)** |
|
|
|
π **[Annotated Visual Language and Visual Dataset](https://huggingface.co/datasets/junling24/Math2Visual-Generating_Pedagogically_Meaningful_Visuals_for_Math_Word_Problems)** |
|
|
|
π» **[GitHub Codebase](https://github.com/eth-lre/math2visual/tree/main)** |
|
|
|
--- |
|
|
|
## β¨ Model Summary |
|
|
|
This model takes a math word problem (MWP) and its equation (formula) as input and outputs a **visual language** string which is used for generating pedagogically meaningful visuals. The output follows a fixed structure based on teacher-informed design to describe key mathematical relationships between entities, containers, and operations. |
|
|
|
It is built by fine-tuning `meta-llama/Llama-3.1-8B` with LoRA using [PEFT](https://github.com/huggingface/peft), optimized with 4-bit quantization (BitsAndBytes). The code for generating visuals using visual language can be found in our **[github repository](https://github.com/eth-lre/math2visual/tree/main)** |
|
|
|
|
|
--- |
|
|
|
## π§ Example Use |
|
|
|
### π§ Install dependencies |
|
|
|
```bash |
|
pip install torch==2.5.1+cu121 torchvision==0.20.1+cu121 torchaudio==2.5.1+cu121 \ |
|
bitsandbytes==0.45.0 inflect==7.3.1 lxml==5.3.0 ipython==8.25.0 python-dotenv==1.0.1 \ |
|
git+https://github.com/huggingface/transformers.git@5fa35344755d8d9c29610b57d175efd03776ae9e \ |
|
git+https://github.com/huggingface/peft.git@aa3f41f7529ed078e9225b2fc1edbb8c71f58f99 |
|
|
|
π‘ Use -f https://download.pytorch.org/whl/torch_stable.html for CUDA wheels if needed. |
|
|
|
βΈ» |
|
|
|
π Run Inference |
|
|
|
import torch |
|
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig |
|
from peft import PeftModel |
|
|
|
# Load model |
|
bnb_config = BitsAndBytesConfig( |
|
load_in_4bit=True, |
|
bnb_4bit_use_double_quant=True, |
|
bnb_4bit_quant_type="nf4", |
|
bnb_4bit_compute_dtype=torch.bfloat16 |
|
) |
|
|
|
base_model_id = "meta-llama/Llama-3.1-8B" |
|
adapter_dir = "junling24/Math2Visual-Visual_Language_Generation" |
|
|
|
base = AutoModelForCausalLM.from_pretrained( |
|
base_model_id, |
|
quantization_config=bnb_config, |
|
device_map="auto", |
|
trust_remote_code=True |
|
) |
|
model = PeftModel.from_pretrained(base, adapter_dir) |
|
model.eval() |
|
model.config.use_cache = True |
|
|
|
tokenizer = AutoTokenizer.from_pretrained( |
|
base_model_id, |
|
padding_side="left", |
|
add_eos_token=True, |
|
add_bos_token=True, |
|
trust_remote_code=True |
|
) |
|
tokenizer.pad_token = tokenizer.eos_token |
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
model.to(device) |
|
|
|
# Prompt |
|
def create_prompt(mwp, formula=None): |
|
return ( |
|
'''You are an expert at converting math story problem into a structured 'visual language'...''' |
|
f"Question: {mwp}\n" |
|
f"Formula: {formula}\n" |
|
"Answer in visual language:" |
|
) |
|
|
|
mwp = "Janet has nine oranges, and Sharon has seven oranges. How many oranges do Janet and Sharon have together?" |
|
formula = "9 + 7 = 16" |
|
prompt = create_prompt(mwp, formula) |
|
|
|
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048, padding="max_length").to(device) |
|
|
|
with torch.no_grad(): |
|
outputs = model.generate( |
|
**inputs, |
|
max_new_tokens=2048, |
|
do_sample=True, |
|
temperature=0.7, |
|
repetition_penalty=1.15 |
|
) |
|
|
|
visual_language = tokenizer.decode(outputs[0], skip_special_tokens=True)[len(prompt):].strip() |
|
print("Generated Visual Language:\n", visual_language) |
|
|
|
|
|
βΈ» |
|
π Citation |
|
|
|
@inproceedings{wang2025math2visual, |
|
title={Generating Pedagogically Meaningful Visuals for Math Word Problems: A New Benchmark and Analysis of Text-to-Image Models}, |
|
author={Wang, Junling and Rutkiewicz, Anna and Wang, April Yi and Sachan, Mrinmaya}, |
|
booktitle={Findings of the Association for Computational Linguistics: ACL 2025}, |
|
year={2025}, |
|
url={https://arxiv.org/abs/2506.03735} |
|
} |
|
|
|
|
|
βΈ» |
|
|
|
π¬ Contact |
|
|
|
For research inquiries, please contact: |
|
π§ Junling Wang β wangjun [at] ethz [dot] ch |
|
|
|
|