Instructions to use dandelin/vilt-b32-finetuned-coco with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use dandelin/vilt-b32-finetuned-coco with Transformers:
# Load model directly from transformers import AutoProcessor, ViltForImageAndTextRetrieval processor = AutoProcessor.from_pretrained("dandelin/vilt-b32-finetuned-coco") model = ViltForImageAndTextRetrieval.from_pretrained("dandelin/vilt-b32-finetuned-coco") - Notebooks
- Google Colab
- Kaggle
batch inference
Hello, I want to ask,how can I use batch to inference?
Since the ViltProcessor can't encoding texts longer than 40, I cut off it to 40(cause if i don't do it, the ViltForImageAndTextRetrieval can not work!).
But there are the processed texts less than 40(without padding), so I could not reorginised it as a whole batch!
Is there any solution to solve this problem? Thanks!
# trunking text code
encoding = processor(image, text, return_tensors="pt")
encoding['input_ids'][0, 39] = encoding['input_ids'][0, -1]
encoding['input_ids'] = encoding['input_ids'][:, :40]
encoding['token_type_ids'][0, 39] = encoding['token_type_ids'][0, -1]
encoding['token_type_ids'] = encoding['token_type_ids'][:, :40]
encoding['attention_mask'][0, 39] = encoding['attention_mask'][0, -1]
encoding['attention_mask'] = encoding['attention_mask'][:, :40]
# reformat it as batch code
cur_batch_data = {x: torch.concat([y, encoding[x]]) for x, y in cur_batch_data.items()}
If this problem can not be solved, I have to evaluate the ViLT for mAP metric with batch=1. To be honest, this is very, very slow. Is there anyone can help me!
You can simply use BertTokenizerFast and ViltImageProcessor for encoding text and images separately, with all the benefits of batch encoding and possibility to set parameters by yourself.