🧠 BLIP β€” UI Elements Captioning

This model is a fine-tuned version of Salesforce/blip-image-captioning-base, adapted for captioning UI elements from macOS application screenshots.

It is part of the Screen2AX research project focused on improving accessibility using vision-based deep learning.


🎯 Use Case

The model takes an image of a UI icon or element and generates a natural language description (e.g., "Settings icon", "Play button", "Search field").

This helps build assistive technologies such as screen readers by providing textual labels for unlabeled visual components.


πŸ— Model Architecture


πŸ–Ό Example

from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import requests

processor = BlipProcessor.from_pretrained("MacPaw/blip-icon-captioning")
model = BlipForConditionalGeneration.from_pretrained("MacPaw/blip-icon-captioning")

image = Image.open("path/to/ui_icon.png")
inputs = processor(images=image, return_tensors="pt")
output = model.generate(**inputs)
caption = processor.decode(output[0], skip_special_tokens=True)

print(caption)
# Example: "Settings icon"

πŸ“œ License

This model is released under the MIT License.


πŸ”— Related Projects


✍️ Citation

If you use this model in your research, please cite the Screen2AX paper:

@misc{muryn2025screen2axvisionbasedapproachautomatic,
      title={Screen2AX: Vision-Based Approach for Automatic macOS Accessibility Generation}, 
      author={Viktor Muryn and Marta Sumyk and Mariya Hirna and Sofiya Garkot and Maksym Shamrai},
      year={2025},
      eprint={2507.16704},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.16704}, 
}

🌐 MacPaw Research

Learn more at https://research.macpaw.com

Downloads last month
14
Safetensors
Model size
247M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for MacPaw/blip-icon-captioning

Finetuned
(34)
this model