HuggingFaceM4/COCO
Updated • 4.14k • 33
This is a fine-tuned NanoVLM (Nano Vision-Language Model) trained on Mixed (COCO Captions + VQAv2) using Modal.com's cloud infrastructure.
from models.vision_language_model import VisionLanguageModel
from PIL import Image
import requests
# Load the model
model = VisionLanguageModel.from_pretrained("pgryko/nanovlm-COCO-VQAv2")
# Load an image
url = "https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg"
image = Image.open(requests.get(url, stream=True).raw)
# Generate a response
response = model.generate(
image=image,
prompt="What do you see in this image?",
max_length=50
)
print(response)
This model was trained using Modal.com's serverless GPU infrastructure:
To reproduce this training:
# Using the integrated Modal approach
python modal/submit_modal_training.py \
--build_dataset \
--dataset_type mixed \
--dataset_limit 5000 \
--batch_size 8 \
--max_training_steps 500 \
--compile \
--push_to_hub \
--hub_model_id your-username/your-model-name
Training metrics and logs are available on Weights & Biases:
This model inherits potential biases from its training datasets (COCO, VQAv2). Users should be aware of potential limitations in:
@misc{pgryko_nanovlm_COCO_VQAv2,
title={NanoVLM Fine-tuned on Mixed (COCO Captions + VQAv2)},
author={Modal.com Training Pipeline},
year={2024},
url={https://huggingface.co/pgryko/nanovlm-COCO-VQAv2}
}
This model was trained using an automated pipeline on Modal.com. For questions or issues, please refer to the nanoVLM repository.
Base model
lusxvr/nanoVLM-222M