File size: 4,297 Bytes
7cfc106
 
 
 
9e71638
c3c0878
9e71638
c3c0878
9e71638
c3c0878
9e71638
c3c0878
7cfc106
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
790ee9b
 
7cfc106
790ee9b
7cfc106
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
# πŸ“ Math2Visual: Visual Language Generation Model

This is the official model for generating **structured visual language** representations from math word problems, as proposed in:

πŸ“„ **[ACL 2025 Findings Paper β€” Math2Visual](https://arxiv.org/abs/2506.03735)**  

πŸŽ₯ **[Project Video](https://youtu.be/jdPYVoHEPtk)**  

πŸ“˜ **[Annotated Visual Language and Visual Dataset](https://huggingface.co/datasets/junling24/Math2Visual-Generating_Pedagogically_Meaningful_Visuals_for_Math_Word_Problems)**  

πŸ’» **[GitHub Codebase](https://github.com/eth-lre/math2visual/tree/main)**

---

## ✨ Model Summary

This model takes a math word problem (MWP) and its equation (formula) as input and outputs a **visual language** string which is used for generating pedagogically meaningful visuals. The output follows a fixed structure based on teacher-informed design to describe key mathematical relationships between entities, containers, and operations.

It is built by fine-tuning `meta-llama/Llama-3.1-8B` with LoRA using [PEFT](https://github.com/huggingface/peft), optimized with 4-bit quantization (BitsAndBytes). The code for generating visuals using visual language can be found in our **[github repository](https://github.com/eth-lre/math2visual/tree/main)**


---

## 🧠 Example Use

### πŸ”§ Install dependencies

```bash
pip install torch==2.5.1+cu121 torchvision==0.20.1+cu121 torchaudio==2.5.1+cu121 \
  bitsandbytes==0.45.0 inflect==7.3.1 lxml==5.3.0 ipython==8.25.0 python-dotenv==1.0.1 \
  git+https://github.com/huggingface/transformers.git@5fa35344755d8d9c29610b57d175efd03776ae9e \
  git+https://github.com/huggingface/peft.git@aa3f41f7529ed078e9225b2fc1edbb8c71f58f99

πŸ’‘ Use -f https://download.pytorch.org/whl/torch_stable.html for CUDA wheels if needed.

βΈ»

πŸš€ Run Inference

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel

# Load model
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

base_model_id = "meta-llama/Llama-3.1-8B"
adapter_dir = "junling24/Math2Visual-Visual_Language_Generation"

base = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)
model = PeftModel.from_pretrained(base, adapter_dir)
model.eval()
model.config.use_cache = True

tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    padding_side="left",
    add_eos_token=True,
    add_bos_token=True,
    trust_remote_code=True
)
tokenizer.pad_token = tokenizer.eos_token
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# Prompt
def create_prompt(mwp, formula=None):
    return (
        '''You are an expert at converting math story problem into a structured 'visual language'...'''
        f"Question: {mwp}\n"
        f"Formula: {formula}\n"
        "Answer in visual language:"
    )

mwp = "Janet has nine oranges, and Sharon has seven oranges. How many oranges do Janet and Sharon have together?"
formula = "9 + 7 = 16"
prompt = create_prompt(mwp, formula)

inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048, padding="max_length").to(device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=2048,
        do_sample=True,
        temperature=0.7,
        repetition_penalty=1.15
    )

visual_language = tokenizer.decode(outputs[0], skip_special_tokens=True)[len(prompt):].strip()
print("Generated Visual Language:\n", visual_language)


βΈ»
πŸ“„ Citation

@inproceedings{wang2025math2visual,
  title={Generating Pedagogically Meaningful Visuals for Math Word Problems: A New Benchmark and Analysis of Text-to-Image Models},
  author={Wang, Junling and Rutkiewicz, Anna and Wang, April Yi and Sachan, Mrinmaya},
  booktitle={Findings of the Association for Computational Linguistics: ACL 2025},
  year={2025},
  url={https://arxiv.org/abs/2506.03735}
}


βΈ»

πŸ“¬ License & Contact

This work is licensed under a
[Creative Commons Attribution-ShareAlike 4.0 International License][cc-by-sa].

For research inquiries, please contact:
πŸ“§ Junling Wang β€” wangjun [at] ethz [dot] ch