File size: 2,564 Bytes
			
			| e133974 8f36513 e133974 8f36513 e133974 8f36513 e133974 8f36513 e133974 8f36513 e133974 8f36513 e133974 8f36513 e133974 8f36513 e133974 8f36513 e133974 8f36513 e133974 8f36513 e133974 8f36513 e133974 8f36513 e133974 8f36513 e133974 8f36513 e133974 8f36513 e133974 8f36513 e133974 8f36513 e133974 8f36513 e133974 8f36513 e133974 8f36513 e133974 8f36513 e133974 8f36513 e133974 8f36513 e133974 8f36513 e133974 8f36513 e133974 8f36513 4e98ff2 8f36513 e133974 8f36513 e133974 8f36513 e133974 8f36513 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 | ---
language:
- en
base_model:
- Salesforce/blip-image-captioning-base
pipeline_tag: image-to-text
tags:
- blip
- icon-description
- image-captioning
license: mit
library_name: transformers
---
# π§  BLIP β UI Elements Captioning
This model is a fine-tuned version of [`Salesforce/blip-image-captioning-base`](https://huggingface.co/Salesforce/blip-image-captioning-base), adapted for **captioning UI elements** from macOS application screenshots.
It is part of the **Screen2AX** research project focused on improving accessibility using vision-based deep learning.
---
## π― Use Case
The model takes an image of a **UI icon or element** and generates a **natural language description** (e.g., `"Settings icon"`, `"Play button"`, `"Search field"`).
This helps build assistive technologies such as screen readers by providing textual labels for unlabeled visual components.
---
## π Model Architecture
- Base model: [`Salesforce/blip-image-captioning-base`](https://huggingface.co/Salesforce/blip-image-captioning-base)
- Architecture: **BLIP** (Bootstrapping Language-Image Pre-training)
- Task: `image-to-text`
---
## πΌ Example
```python
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import requests
processor = BlipProcessor.from_pretrained("MacPaw/blip-icon-captioning")
model = BlipForConditionalGeneration.from_pretrained("MacPaw/blip-icon-captioning")
image = Image.open("path/to/ui_icon.png")
inputs = processor(images=image, return_tensors="pt")
output = model.generate(**inputs)
caption = processor.decode(output[0], skip_special_tokens=True)
print(caption)
# Example: "Settings icon"
```
---
## π License
This model is released under the **MIT License**.
---
## π Related Projects
- [Screen2AX Project](https://github.com/MacPaw/Screen2AX)
- [Screen2AX HuggingFace Collection](https://huggingface.co/collections/MacPaw/screen2ax-687dfe564d50f163020378b8)
---
## βοΈ Citation
If you use this model in your research, please cite the Screen2AX paper:
```bibtex
@misc{muryn2025screen2axvisionbasedapproachautomatic,
      title={Screen2AX: Vision-Based Approach for Automatic macOS Accessibility Generation}, 
      author={Viktor Muryn and Marta Sumyk and Mariya Hirna and Sofiya Garkot and Maksym Shamrai},
      year={2025},
      eprint={2507.16704},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.16704}, 
}
```
---
## π MacPaw Research
Learn more at [https://research.macpaw.com](https://research.macpaw.com) | 
