YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

V-Do

This repository hosts a Vision-Language Model (VLM) trained using the nanoVLM library. This model is designed to understand and process both visual and textual information, making it capable of performing tasks such as Visual Question Answering (VQA).

Model Overview

V-Do is built upon the nanoVLM architecture, which integrates key components for multimodal understanding:

  • Vision Encoder (ViT): Processes input images to extract visual features.
  • Language Model (LM): Handles textual input and generates text outputs.
  • Multimodal Projector (MP): Bridges the gap between the visual and textual modalities, allowing the LM to incorporate visual context.

The model weights are provided in the efficient Safetensors format.

Repository Structure

This repository is expected to contain the necessary files to load and use the VLM, including:

  • Model weights (in safetensors format)
  • Configuration files
  • Other potentially necessary files for the nanoVLM library.

How to Use V-Do

To use the V-Do model for inference, follow these steps:

1. Clone the Repository

First, clone this Hugging Face repository to your local machine or Colab environment:

pip install torch datasets tqdm transformers accelerate -q
# Instantiate the VLMConfig, using the repo ID for loading
# The from_pretrained method will automatically handle loading from the Hub
vlm_cfg = VLMConfig(vlm_checkpoint_path="Duino/V-Do")

# Load the model directly from the Hugging Face Hub
try:
    model = VisionLanguageModel.from_pretrained(vlm_cfg.vlm_checkpoint_path)
    print(f"Successfully loaded model from {vlm_cfg.vlm_checkpoint_path}")
except Exception as e:
    print(f"Error loading model: {e}")

# Load the tokenizer and image processor
tokenizer = get_tokenizer(vlm_cfg.lm_tokenizer)
image_processor = get_image_processor(vlm_cfg.vit_img_size)

# Move model to the device (GPU if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if model is not None:
    model.to(device)

print("\nModel, tokenizer, and image processor loaded successfully.")
Downloads last month
2
Safetensors
Model size
222M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support