Persian Image Captioning (PIC) Model

Intended Use

  • Primary Use Cases: Generating detailed Persian captions for images, particularly in contexts requiring cultural and linguistic accuracy. It serves as a core component in the PTIR framework for text-image retrieval, enabling applications in medical imaging, cultural heritage, and other domain-specific scenarios.
  • Out-of-Scope Uses: Not intended for non-Persian languages, real-time applications without optimization, or tasks beyond image captioning such as object detection or generation.

Training Data

The model was trained on a custom dataset of approximately 1.2 million Persian image-caption pairs. This dataset was aggregated from diverse sources, with captions generated using advanced Vision-Language Models and refined for cultural and linguistic accuracy. Captions include detailed descriptions of object counts, shapes, colors, environmental contexts, age groups, and animal breeds.

Evaluation was performed on the COCO-PIC validation dataset, available at Hugging Face Datasets, which is derived from the COCO dataset with Persian captions.

Evaluation

  • Metrics: Evaluated using BLEU, ROUGE, CIDEr, and Hit@K for retrieval integration.
  • Results: Outperforms baselines in caption quality, with significant improvements in detailed descriptions. In retrieval, PTIR (using this model) achieves Hit@1: 22%, Hit@200: 80%.
  • Comparisons: Superior to Persian baselines and CLIP-based models in accuracy and efficiency.
  • Dataset: Tested on subsets of the training data and COCO-PIC validation set.

Usage

To use the model, install the required libraries:

pip install transformers torch datasets arabic-reshaper python-bidi

Load and generate captions in Python:

import torch
from transformers import VisionEncoderDecoderModel, AutoTokenizer, AutoImageProcessor
from PIL import Image
import arabic_reshaper
from bidi.algorithm import get_display
import matplotlib.pyplot as plt

model_name = "rasoulasadianub/Persian-Image-Captioning"
model = VisionEncoderDecoderModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token_id = tokenizer.eos_token_id
image_processor = AutoImageProcessor.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def generate_caption(image_path):
    image = Image.open(image_path).convert('RGB')
    pixel_values = image_processor(image, return_tensors="pt").pixel_values.to(device)
    with torch.no_grad():
        output_ids = model.generate(pixel_values)
    caption = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return caption

def visualize_caption(image_path, caption):
    image = Image.open(image_path).convert('RGB')
    reshaped_caption = arabic_reshaper.reshape(caption)
    bidi_text = get_display(reshaped_caption)
    plt.imshow(image)
    plt.axis("off")
    plt.title(bidi_text)
    plt.show()

# Example
image_path = "path/to/your/image.jpg"
caption = generate_caption(image_path)
visualize_caption(image_path, caption)

Limitations and Biases

  • Limitations: Primarily optimized for Persian; performance may degrade on non-Persian or highly specialized images (e.g., abstract art). Dependent on the quality of the training dataset, which may not cover all cultural nuances.
  • Biases: Potential biases from source datasets (e.g., COCO-derived), including underrepresentation of certain demographics or regions. Efforts were made to refine captions for cultural accuracy, but users should evaluate for fairness in specific applications.

Citation

If you use this model, please cite the original paper:

@article{asadian2025pic,
  author = {Asadian, Rasoul and Akhavanpour, Alireza},
  title = {Persian Text-Image Retrieval: A Framework Based on Image Captioning and Scalable Vector Search},
  journal = {IEEE CSICC},
  year = {2025},
  doi = {10.1109/CSICC65765.2025.10967407},
  url = {https://ieeexplore.ieee.org/document/10967407}
}

Additional Information

Downloads last month
1
Safetensors
Model size
0.2B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for rasoulasadianub/Persian-Image-Captioning

Finetuned
(5)
this model