Duino commited on
Commit
78551ee
·
verified ·
1 Parent(s): efb81cc

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +56 -0
README.md ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # V-Do
2
+
3
+ This repository hosts a Vision-Language Model (VLM) trained using the nanoVLM library. This model is designed to understand and process both visual and textual information, making it capable of performing tasks such as Visual Question Answering (VQA).
4
+
5
+ ## Model Overview
6
+
7
+ V-Do is built upon the nanoVLM architecture, which integrates key components for multimodal understanding:
8
+
9
+ * **Vision Encoder (ViT):** Processes input images to extract visual features.
10
+ * **Language Model (LM):** Handles textual input and generates text outputs.
11
+ * **Multimodal Projector (MP):** Bridges the gap between the visual and textual modalities, allowing the LM to incorporate visual context.
12
+
13
+ The model weights are provided in the efficient [Safetensors](https://huggingface.co/docs/safetensors/index) format.
14
+
15
+ ## Repository Structure
16
+
17
+ This repository is expected to contain the necessary files to load and use the VLM, including:
18
+
19
+ * Model weights (in safetensors format)
20
+ * Configuration files
21
+ * Other potentially necessary files for the nanoVLM library.
22
+
23
+ ## How to Use V-Do
24
+
25
+ To use the V-Do model for inference, follow these steps:
26
+
27
+ ### 1. Clone the Repository
28
+
29
+ First, clone this Hugging Face repository to your local machine or Colab environment:
30
+ ``` Python
31
+ pip install torch datasets tqdm transformers accelerate -q
32
+ ```
33
+ ```` Python
34
+ # Instantiate the VLMConfig, using the repo ID for loading
35
+ # The from_pretrained method will automatically handle loading from the Hub
36
+ vlm_cfg = VLMConfig(vlm_checkpoint_path="Duino/V-Do")
37
+
38
+ # Load the model directly from the Hugging Face Hub
39
+ try:
40
+ model = VisionLanguageModel.from_pretrained(vlm_cfg.vlm_checkpoint_path)
41
+ print(f"Successfully loaded model from {vlm_cfg.vlm_checkpoint_path}")
42
+ except Exception as e:
43
+ print(f"Error loading model: {e}")
44
+
45
+ # Load the tokenizer and image processor
46
+ tokenizer = get_tokenizer(vlm_cfg.lm_tokenizer)
47
+ image_processor = get_image_processor(vlm_cfg.vit_img_size)
48
+
49
+ # Move model to the device (GPU if available)
50
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
51
+ if model is not None:
52
+ model.to(device)
53
+
54
+ print("\nModel, tokenizer, and image processor loaded successfully.")
55
+ ````
56
+