ClipCap

This is an implementation of the ClipCap model — a captioning system that connects CLIP vision features to a GPT-2 language model via a learnable prefix.

The provided checkpoint (coco_prefix_best_200k.pt) was trained on 203,914 samples from the Conceptual Captions dataset using prefix tuning.

Model Architecture

Vision Encoder: CLIP
Language Model: GPT-2 (via Hugging Face Transformers)
Connector: Multi-Layer Perceptron (MLP) to map CLIP embeddings to GPT-2 prefix tokens

Usage

To use this model, define the ClipCapModel architecture as described in the main.py file and load the checkpoint into your model instance. You’ll also need to obtain CLIP embeddings of the image as input.

Refer to the original ClipCap repository for preprocessing and full inference pipeline details.

Reference

Mokady, R., Hertz, A., & Bermano, A. H. (2021). ClipCap: CLIP Prefix for Image Captioning. arXiv preprint arXiv:2111.09734.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for saad1926q/clipcap-image-captioning

Base model

openai-community/gpt2

Finetuned

(2113)

this model

Dataset used to train saad1926q/clipcap-image-captioning

Paper for saad1926q/clipcap-image-captioning

ClipCap: CLIP Prefix for Image Captioning

Paper • 2111.09734 • Published Nov 18, 2021

saad1926q
/

clipcap-image-captioning