File size: 2,564 Bytes
e133974
8f36513
 
 
 
 
 
 
 
 
 
e133974
 
8f36513
e133974
8f36513
e133974
8f36513
e133974
8f36513
e133974
8f36513
e133974
8f36513
e133974
8f36513
e133974
8f36513
e133974
8f36513
e133974
8f36513
 
 
e133974
8f36513
e133974
8f36513
e133974
8f36513
 
 
 
e133974
8f36513
 
e133974
8f36513
 
 
 
e133974
8f36513
 
 
e133974
8f36513
e133974
8f36513
e133974
8f36513
e133974
8f36513
e133974
8f36513
e133974
8f36513
 
e133974
8f36513
e133974
8f36513
e133974
8f36513
e133974
8f36513
4e98ff2
 
 
 
 
 
 
 
 
8f36513
e133974
8f36513
e133974
8f36513
e133974
8f36513
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
---
language:
- en
base_model:
- Salesforce/blip-image-captioning-base
pipeline_tag: image-to-text
tags:
- blip
- icon-description
- image-captioning
license: mit
library_name: transformers
---
# 🧠 BLIP β€” UI Elements Captioning

This model is a fine-tuned version of [`Salesforce/blip-image-captioning-base`](https://huggingface.co/Salesforce/blip-image-captioning-base), adapted for **captioning UI elements** from macOS application screenshots.

It is part of the **Screen2AX** research project focused on improving accessibility using vision-based deep learning.

---

## 🎯 Use Case

The model takes an image of a **UI icon or element** and generates a **natural language description** (e.g., `"Settings icon"`, `"Play button"`, `"Search field"`).

This helps build assistive technologies such as screen readers by providing textual labels for unlabeled visual components.

---

## πŸ— Model Architecture

- Base model: [`Salesforce/blip-image-captioning-base`](https://huggingface.co/Salesforce/blip-image-captioning-base)
- Architecture: **BLIP** (Bootstrapping Language-Image Pre-training)
- Task: `image-to-text`

---

## πŸ–Ό Example

```python
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import requests

processor = BlipProcessor.from_pretrained("MacPaw/blip-icon-captioning")
model = BlipForConditionalGeneration.from_pretrained("MacPaw/blip-icon-captioning")

image = Image.open("path/to/ui_icon.png")
inputs = processor(images=image, return_tensors="pt")
output = model.generate(**inputs)
caption = processor.decode(output[0], skip_special_tokens=True)

print(caption)
# Example: "Settings icon"
```

---

## πŸ“œ License

This model is released under the **MIT License**.

---

## πŸ”— Related Projects

- [Screen2AX Project](https://github.com/MacPaw/Screen2AX)
- [Screen2AX HuggingFace Collection](https://huggingface.co/collections/MacPaw/screen2ax-687dfe564d50f163020378b8)

---

## ✍️ Citation

If you use this model in your research, please cite the Screen2AX paper:

```bibtex
@misc{muryn2025screen2axvisionbasedapproachautomatic,
      title={Screen2AX: Vision-Based Approach for Automatic macOS Accessibility Generation}, 
      author={Viktor Muryn and Marta Sumyk and Mariya Hirna and Sofiya Garkot and Maksym Shamrai},
      year={2025},
      eprint={2507.16704},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.16704}, 
}
```

---

## 🌐 MacPaw Research

Learn more at [https://research.macpaw.com](https://research.macpaw.com)