ClipCap: CLIP Prefix for Image Captioning
Paper
•
2111.09734
•
Published
This is an implementation of the ClipCap model — a captioning system that connects CLIP vision features to a GPT-2 language model via a learnable prefix.
The provided checkpoint (coco_prefix_best_200k.pt) was trained on 203,914 samples from the Conceptual Captions dataset using prefix tuning.
To use this model, define the ClipCapModel architecture as described in the main.py file and load the checkpoint into your model instance. You’ll also need to obtain CLIP embeddings of the image as input.
Refer to the original ClipCap repository for preprocessing and full inference pipeline details.
Mokady, R., Hertz, A., & Bermano, A. H. (2021). ClipCap: CLIP Prefix for Image Captioning. arXiv preprint arXiv:2111.09734.
Base model
openai-community/gpt2