new weights

by Adirazgold - opened Jun 9

base: refs/heads/main

←

from: refs/pr/7

Discussion Files changed

+90

-109

This PR is in draft mode

Files changed (9) hide show

.gitattributes +0 -1
README.md +47 -66
REPORT_Benchmarking the AI advantage in finance.pdf +0 -3
granite_vision_embedding_config.py → colgranitevision_config.py +2 -4
config.json +11 -11
modeling_granite_vision_embedding.py → modeling_colgranitevision.py +9 -6
preprocessor_config.json +1 -1
processing_granite_vision_embedding.py → processing_colgranitevision.py +18 -15
processor_config.json +2 -2

.gitattributes CHANGED Viewed

@@ -33,4 +33,3 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
-REPORT_Benchmarking[[:space:]]the[[:space:]]AI[[:space:]]advantage[[:space:]]in[[:space:]]finance.pdf filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -3,70 +3,62 @@ license: apache-2.0
 language:
 - en
 base_model:
-- ibm-granite/granite-vision-3.3-2b
 library_name: transformers
 ---
-# granite-vision-3.3-2b-embedding
 **Model Summary:**
-Granite-vision-3.3-2b-embedding is an efficient embedding model based on [granite-vision-3.3-2b](https://huggingface.co/ibm-granite/granite-vision-3.3-2b). This model is specifically designed for multimodal document retrieval, enabling queries on documents with tables, charts, infographics, and complex layouts. The model generates ColBERT-style multi-vector representations of pages.
-By removing the need for OCR-based text extractions, granite-vision-3.3-2b-embedding can help simplify and accelerate RAG pipelines.
 **Evaluations:**
-We evaluated granite-vision-3.3-2b-embedding alongside other top colBERT style multi-modal embedding models in the 1B-4B parameter range using two benchmark: [Vidore2](https://github.com/illuin-tech/vidore-benchmark/) and [Real-MM-RAG-Bench](https://huggingface.co/collections/ibm-research/real-mm-rag-bench-67d2dc0ddf2dfafe66f09d34) which aim to specifically address complex multimodal document retrieval tasks.
 ## **NDCG@5 - ViDoRe V2**
-| Collection \ Model                     | ColPali-v1.3 | ColQwen2.5-v0.2 | ColNomic-3b |  ColSmolvlm-v0.1     |  granite-vision-3.3-2b-embedding |
-|----------------------------------------|--------------|------------------|-------------|-------------------|-----------
-| ESG Restaurant Human                   | 51.1        | 68.4           | 65.8       |    62.4               | 65.3                    |
-| Economics Macro Multilingual           | 49.9        | 56.5            | 55.4       |     47.4              | 51.2                    |
-| MIT Biomedical                         | 59.7        | 63.6            | 63.5       |    58.1               |61.5                   |
-| ESG Restaurant Synthetic               | 57.0        | 57.4            | 56.6       |     51.1              |56.6                    |
-| ESG Restaurant Synthetic Multilingual  | 55.7        | 57.4            | 57.2       |     47.6             |55.7                    |
-| MIT Biomedical Multilingual            | 56.5        | 61.1            | 62.5       |      50.5             | 55.5                    |
-| Economics Macro                        | 51.6        | 59.8            | 60.2       |      60.9            |58.3                    |
-| **Avg (ViDoRe2)**                      | **54.5**    | **60.6**        | **60.2**   | **54.0**              |**57.7**                    |
 ## **NDCG@5 - REAL-MM-RAG**
-| Collection \ Model                     | ColPali-v1.3 | ColQwen2.5-v0.2 | ColNomic-3b |   ColSmolvlm-v0.1            |  granite-vision-3.3-2b-embedding |
-|----------------------------------------|--------------|------------------|-------------|--------------------------| ------------------
-| FinReport                              | 55         | 66             | 78        |   65                  |73
-| FinSlides                              | 68        | 79             | 81        |   55                 |79
-| TechReport                             | 78         | 86             | 88        |   83                 |87
-| TechSlides                             | 90         | 93             | 92        |   91            |93
-| **Avg (REAL-MM-RAG)**                  | **73**     | **81**         | **85**    |   **74**           |**83**
-- **Release Date**: June 11th 2025
 - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
-- **Supported Input Format:** Currently the model supports English instructions and images (png, jpeg) as input format.
 **Intended Use:**
 The model is intended to be used in enterprise applications that involve retrieval of visual and text data. In particular, the model is well-suited for multi-modal RAG systems where the knowledge base is composed of complex enterprise documents, such as reports, slides, images, canned doscuments, manuals and more. The model can be used as a standalone retriever, or alongside a text-based retriever.
 ### Usage
 ```shell
 pip install -q torch torchvision torchaudio
-pip install transformers==4.50
 ```
 Then run the code:
 ```python
-from io import BytesIO
-import requests
-import torch
-from PIL import Image
 from transformers import AutoProcessor, AutoModel
-from transformers.utils.import_utils import is_flash_attn_2_available
 device = "cuda" if torch.cuda.is_available() else "cpu"
-model_name = "ibm-granite/granite-vision-3.3-2b-embedding"
-model = AutoModel.from_pretrained(
-                      model_name,
-                      trust_remote_code=True,
-                      torch_dtype=torch.float16,
-                      device_map=device,
-                      attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None
-                      ).eval()
-processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
 # ─────────────────────────────────────────────
 # Inputs: Image + Text
@@ -106,35 +98,24 @@ similarity = processor.score(txt_emb, img_emb, batch_size=1, device=device)
 print("\n" + "=" * 50)
 print(f"📊 Similarity between image and text: {similarity.item():.4f}")
 print("=" * 50)
 ```
 ### Use granite-vision-embedding-3.3-2b for MM RAG
-For an example of MM-RAG using granite-vision-3.3-2b-embedding refer to [this notebook](https://github.com/ibm-granite/granite-vision-models/blob/main/cookbooks/GraniteVisionEmbedding_MM-RAG_Notebook.ipynb).
 **Model Architecture:**
-The architecture of granite-vision-3.3-2b-embedding follows ColPali(https://arxiv.org/abs/2407.01449) approach and consists of the following components:
-(1) Vision-Language model : granite-vision-3.3-2b (https://huggingface.co/ibm-granite/granite-vision-3.3-2b).
-(2) Projection layer: linear layer that projects the hidden layer dimension of Vision-Language model to 128 and outputs 729 embedding vectors per image.
-The scoring is computed using MaxSim-based late interaction mechanism.
 **Training Data:**
-Our training data is entirly comprised from DocFM. DocFM is a large-scale comprehensive dataset effort at IBM consisting of 85 million document pages extracted from unique PDF
 documents sourced from Common Crawl, Wikipedia, and ESG (Environmental, Social, and Governance)
-reports.
 **Infrastructure:**
-We train granite-vision-3.3-2b-embedding on IBM’s cognitive computing cluster, which is outfitted with NVIDIA A100 GPUs.
 **Ethical Considerations and Limitations:**
-The use of Large Vision and Language Models involves risks and ethical considerations people must be aware of, including but not limited to: bias and fairness, misinformation, and autonomous decision-making. Granite-vision-3.3-2b-embedding is not the exception in this regard. Although our alignment processes include safety considerations, the model may in some cases produce inaccurate or biased responses.
-Regarding ethics, a latent risk associated with all Large Language Models is their malicious utilization. We urge the community to use granite-vision-3.3-2b-embedding with ethical intentions and in a responsible way.
 **Resources**
-- 📄 Granite Vision technical report [here](https://arxiv.org/abs/2502.09927)
-- 📄 Real-MM-RAG-Bench paper (ACL 2025) [here](https://arxiv.org/abs/2502.12342)
-- 📄 Vidore 2 paper [here](https://www.arxiv.org/pdf/2505.17166)
-- ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite
-- 🚀 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/
-- 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources

 language:
 - en
 base_model:
+- ibm-granite/granite-vision-3.3-2b-preview
 library_name: transformers
 ---
+# granite-vision-embedding-3.3-2b
 **Model Summary:**
+granite-vision-embedding-3.3-2b is an efficient embedding model, based on granite-vision Vision Language Model(VLM). This model is specifically designed for multi-modal document retrieval, enabling queries on documents with tables, charts, infographics, and complex layout. The model generates ColBERT-style multi-vector representations of pages.
+The model eliminates the need for OCR-based text extraction and related preprocessing steps.
 **Evaluations:**
+We evaluated granite-vision-embedding-3.3-2b alongside other top colBERT style multi-modal embedding models in the 1B-3B parameter range using two benchmark: Vidore2 and [Real-MM-RAG-Bench](https://arxiv.org/abs/2502.12342) which are specifically addressing complex multi-modal documents retrieval task.
 ## **NDCG@5 - ViDoRe V2**
+| Collection \ Model                     | ColPali-v1.3 | ColQwen2.5-v0.2 | ColNomic-3b | ColGraniteVision-3.3-2b |
+|----------------------------------------|--------------|------------------|-------------|--------------------------|
+| ESG Restaurant Human                   | 51.10        | 68.40            | 65.80       | 60.00                    |
+| Economics Macro Multilingual           | 49.90        | 56.50            | 55.40       | 50.13                    |
+| MIT Biomedical                         | 59.70        | 63.60            | 63.50       | 60.00                    |
+| ESG Restaurant Synthetic               | 57.00        | 57.40            | 56.60       | 54.00                    |
+| ESG Restaurant Synthetic Multilingual  | 55.70        | 57.40            | 57.20       | 52.00                    |
+| MIT Biomedical Multilingual            | 56.50        | 61.10            | 62.50       | 54.00                    |
+| Economics Macro                        | 51.60        | 59.80            | 60.20       | 57.00                    |
+| **Avg (ViDoRe2)**                      | **54.50**    | **60.60**        | **60.17**   | **55.20**                    |
 ## **NDCG@5 - REAL-MM-RAG**
+| Collection \ Model                     | ColPali-v1.3 | ColQwen2.5-v0.2 | ColNomic-3b | ColGraniteVision-3.3-2b |
+|----------------------------------------|--------------|------------------|-------------|--------------------------|
+| FinReport                              | 0.55         | 0.66             | 0.78        | 0.60                     |
+| FinSlides                              | 0.68         | 0.79             | 0.81        | 0.72                     |
+| TechReport                             | 0.78         | 0.86             | 0.88        | 0.80                     |
+| TechSlides                             | 0.90         | 0.93             | 0.92        | 0.92                     |
+| **Avg (REAL-MM-RAG)**                  | **0.73**     | **0.81**         | **0.85**    | **0.79**                 |
+- **Release Date**: June 2025
 - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
+**Supported Input Format:**
+Currently the model supports English queries and images (png, jpeg, etc.) as input format.
 **Intended Use:**
 The model is intended to be used in enterprise applications that involve retrieval of visual and text data. In particular, the model is well-suited for multi-modal RAG systems where the knowledge base is composed of complex enterprise documents, such as reports, slides, images, canned doscuments, manuals and more. The model can be used as a standalone retriever, or alongside a text-based retriever.
 ### Usage
+First, make sure to build the latest verions of transormers:
 ```shell
 pip install -q torch torchvision torchaudio
+pip install transformers>=4.49
 ```
 Then run the code:
 ```python
 from transformers import AutoProcessor, AutoModel
+from PIL import Image
+import torch
 device = "cuda" if torch.cuda.is_available() else "cpu"
+model_name = "ibm-granite/granite-vision-embedding-3.3-2b"
+model = AutoModel.from_pretrained(model_name, trust_remote_code=True,torch_dtype=torch.float16).to(device).eval()
+processor = AutoProcessor.from_pretrained(model_name,trust_remote_code=True)
 # ─────────────────────────────────────────────
 # Inputs: Image + Text
 print("\n" + "=" * 50)
 print(f"📊 Similarity between image and text: {similarity.item():.4f}")
 print("=" * 50)
 ```
 ### Use granite-vision-embedding-3.3-2b for MM RAG
+For an example of MM RAG using col-granite-visionrefer to [this notebook](......).
 **Model Architecture:**
+We built our model upon [granite-vision-3.3-2b](https://huggingface.co/ibm-granite/granite-vision-3.3-2b) with additional projection layer.
 **Training Data:**
+The model was trained on a random subset from DOCFM. DOCFM is a large-scale comprehensive dataset effort at IBM consisting of 85 million document pages extracted from unique PDF
 documents sourced from Common Crawl, Wikipedia, and ESG (Environmental, Social, and Governance)
+reports. For each image in the dataset, Pseudo-questions were generated using Pixtral12B VLM.
 **Infrastructure:**
+We train granite-vision-embedding-3.3-2b on IBM’s cognitive computing cluster, which is outfitted with NVIDIA A100 GPUs.
 **Ethical Considerations and Limitations:**
+The use of Large Vision and Language Models involves risks and ethical considerations people must be aware of, including but not limited to: bias and fairness, misinformation, and autonomous decision-making. col-granite-vision-1.0-2b is not the exception in this regard. Although our alignment processes include safety considerations, the model may in some cases produce inaccurate or biased responses.
+Regarding ethics, a latent risk associated with all Large Language Models is their malicious utilization. We urge the community to use col-granite-vision-1.0-2b with ethical intentions and in a responsible way.
 **Resources**
+- :page_facing_up: Granite Vision technical report [here](https://arxiv.org/abs/2502.09927)
+- :star:️ Learn about the latest updates with Granite: https://www.ibm.com/granite
+- :rocket: Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/
+- :bulb: Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources

REPORT_Benchmarking the AI advantage in finance.pdf DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:4e6da951c55eef3fd52aa41543f3b4377ab26e2758c579aec2d11068a66b3d20
-size 1746880

granite_vision_embedding_config.py → colgranitevision_config.py RENAMED Viewed

@@ -1,8 +1,8 @@
 from transformers import LlavaNextConfig
-class GraniteVisionEmbConfig(LlavaNextConfig):
-    model_type = "granitevisionemb"
     def __init__(self, **kwargs):
         self.base_model = kwargs.get("base_model", None)
@@ -11,5 +11,3 @@ class GraniteVisionEmbConfig(LlavaNextConfig):
         self.base_image_feature_location = kwargs.get("base_image_feature_location", "last")
         self.adapter_path = kwargs.get("adapter_path", None)
         super().__init__(**kwargs)

 from transformers import LlavaNextConfig
+class ColGraniteVisionConfig(LlavaNextConfig):
+    model_type = "colgranitevision"
     def __init__(self, **kwargs):
         self.base_model = kwargs.get("base_model", None)
         self.base_image_feature_location = kwargs.get("base_image_feature_location", "last")
         self.adapter_path = kwargs.get("adapter_path", None)
         super().__init__(**kwargs)

config.json CHANGED Viewed

@@ -1,18 +1,18 @@
 {
-  "_name_or_path": "ibm_granite/granite-vision-3.3-2b",
   "adapter_path": null,
-    "auto_map": {
-        "AutoModel": "modeling_granite_vision_embedding.GraniteVisionEmb",
-        "AutoProcessor": "processing_granite_vision_embedding.GraniteVisionEmbProcessor",
-        "AutoConfig":    "granite_vision_embedding_config.GraniteVisionEmbConfig"
     },
   "architectures": [
-    "GraniteVisionEmb"
   ],
-  "base_image_feature_location": "last",
   "base_model": null,
   "emb_dim_doc": 128,
   "emb_dim_query": 128,
   "image_grid_pinpoints": [
     [
       384,
@@ -121,7 +121,7 @@
   ],
   "image_seq_length": 576,
   "image_token_index": 49155,
-  "model_type": "granitevisionemb",
   "multimodal_projector_bias": true,
   "pretrained_language_model": "",
   "pretrained_vision_tower": "",
@@ -149,12 +149,12 @@
     "rms_norm_eps": 1e-05,
     "rope_theta": 300000,
     "tie_word_embeddings": true,
-    "torch_dtype": "bfloat16",
     "vocab_size": 49156
   },
   "tie_word_embeddings": true,
   "torch_dtype": "float32",
-  "transformers_version": "4.49.0",
   "use_image_newline_parameter": true,
   "vision_config": {
     "_attn_implementation_autoset": true,
@@ -167,7 +167,7 @@
     "num_attention_heads": 16,
     "num_hidden_layers": 27,
     "patch_size": 14,
-    "torch_dtype": "bfloat16"
   },
   "vision_feature_layer": [
     -24,

 {
+  "_name_or_path": "ibm-granite/granite-vision-3.3-2b",
   "adapter_path": null,
+  "auto_map": {
+        "AutoModel": "modeling_colgranitevision.ColGraniteVision",
+        "AutoProcessor": "processing_colgranitevision.ColGraniteVisionProcessor",
+        "AutoConfig":    "colgranitevision_config.ColGraniteVisionConfig"
     },
   "architectures": [
+    "ColGraniteVision"
   ],
   "base_model": null,
   "emb_dim_doc": 128,
   "emb_dim_query": 128,
+  "base_image_feature_location": "last",
   "image_grid_pinpoints": [
     [
       384,
   ],
   "image_seq_length": 576,
   "image_token_index": 49155,
+  "model_type": "colgranitevision",
   "multimodal_projector_bias": true,
   "pretrained_language_model": "",
   "pretrained_vision_tower": "",
     "rms_norm_eps": 1e-05,
     "rope_theta": 300000,
     "tie_word_embeddings": true,
+    "torch_dtype": "float32",
     "vocab_size": 49156
   },
   "tie_word_embeddings": true,
   "torch_dtype": "float32",
+  "transformers_version": "4.50.0.dev0",
   "use_image_newline_parameter": true,
   "vision_config": {
     "_attn_implementation_autoset": true,
     "num_attention_heads": 16,
     "num_hidden_layers": 27,
     "patch_size": 14,
+    "torch_dtype": "float32"
   },
   "vision_feature_layer": [
     -24,

modeling_granite_vision_embedding.py → modeling_colgranitevision.py RENAMED Viewed

@@ -7,10 +7,11 @@ from transformers import LlavaNextPreTrainedModel
 from transformers.models.llava_next.modeling_llava_next import LlavaNextForConditionalGeneration
 from transformers.models.llava_next.modeling_llava_next import unpad_image, get_anyres_image_grid_shape
-from .granite_vision_embedding_config import GraniteVisionEmbConfig
-class LlavaNextWithCustomPacking(LlavaNextForConditionalGeneration):
     def pack_image_features(
             self,
             image_features,
@@ -92,15 +93,15 @@ class LlavaNextWithCustomPacking(LlavaNextForConditionalGeneration):
         return image_features, feature_lens
-class GraniteVisionEmb(LlavaNextPreTrainedModel):
     """
-    GraniteVisionEmb model implementation.
     """
     main_input_name: ClassVar[str] = "doc_input_ids"  # transformers-related
-    config_class = GraniteVisionEmbConfig
-    def __init__(self, config: GraniteVisionEmbConfig):
         super().__init__(config=config)
         model = LlavaNextWithCustomPacking(config=config)
@@ -108,6 +109,8 @@ class GraniteVisionEmb(LlavaNextPreTrainedModel):
             self._tied_weights_keys = [f"model.language_model.{k}" for k in model.language_model._tied_weights_keys]
         self.model = model
         self.dim = 128
         self.custom_text_proj = nn.Linear(self.model.config.text_config.hidden_size, self.dim)

 from transformers.models.llava_next.modeling_llava_next import LlavaNextForConditionalGeneration
 from transformers.models.llava_next.modeling_llava_next import unpad_image, get_anyres_image_grid_shape
+from .colgranitevision_config import ColGraniteVisionConfig
+class LlavaNextWithCustomPacking(LlavaNextForConditionalGeneration):
     def pack_image_features(
             self,
             image_features,
         return image_features, feature_lens
+class ColGraniteVision(LlavaNextPreTrainedModel):
     """
+    ColGraniteVision model implementation.
     """
     main_input_name: ClassVar[str] = "doc_input_ids"  # transformers-related
+    config_class = ColGraniteVisionConfig
+    def __init__(self, config: ColGraniteVisionConfig):
         super().__init__(config=config)
         model = LlavaNextWithCustomPacking(config=config)
             self._tied_weights_keys = [f"model.language_model.{k}" for k in model.language_model._tied_weights_keys]
         self.model = model
+        # TODO: Wait for ColPali2 to create a ColPaliConfig to allow specifying the embedding dimension.
+        # We could do it now but it would break all the models trying to load the model from the checkpoint.
         self.dim = 128
         self.custom_text_proj = nn.Linear(self.model.config.text_config.hidden_size, self.dim)

preprocessor_config.json CHANGED Viewed

@@ -127,7 +127,7 @@
     0.5,
     0.5
   ],
-  "processor_class": "GraniteVisionEmbProcessor",
   "resample": 3,
   "rescale_factor": 0.00392156862745098,
   "size": {

     0.5,
     0.5
   ],
+  "processor_class": "ColGraniteVisionProcessor",
   "resample": 3,
   "rescale_factor": 0.00392156862745098,
   "size": {

processing_granite_vision_embedding.py → processing_colgranitevision.py RENAMED Viewed

@@ -21,9 +21,9 @@ def floor_by_factor(number: float, factor: int) -> int:
     return math.floor(number / factor) * factor
-class GraniteVisionEmbProcessor(LlavaNextProcessor):
     """
-    Processor for GraniteVisionEmb.
     """
     visual_prompt_prefix: ClassVar[str] = "<|user|>\n<image>\nDescribe the image.\n"
@@ -140,14 +140,14 @@ class GraniteVisionEmbProcessor(LlavaNextProcessor):
             max_size=self.max_size,
             fill_color=0
         )
     def resize_and_pad_centered_to_long_side(
-            self,
-            image: Image.Image,
-            factor: int,
-            min_size: int,
-            max_size: int,
-            fill_color=0
     ) -> Image.Image:
         """
         Resizes and pads an image such that:
@@ -183,10 +183,10 @@ class GraniteVisionEmbProcessor(LlavaNextProcessor):
         # Resize the image
         resized_image = image.resize((target_width, target_height), Image.LANCZOS)
-        final_image = resized_image.convert("RGB")
         return final_image
     def resize_and_pad_centered(self,
                                 image: Image.Image,
                                 factor: int,
@@ -300,7 +300,7 @@ class GraniteVisionEmbProcessor(LlavaNextProcessor):
             images: List[Image.Image],
     ) -> BatchFeature:
         """
-        Process images.
         """
         # texts_doc = [self.apply_chat_template(self.format_data_wo_role(self.visual_prompt_prefix, img),tokenize=False ) for img in images]
         texts_doc = [self.visual_prompt_prefix for _ in images]
@@ -320,7 +320,10 @@ class GraniteVisionEmbProcessor(LlavaNextProcessor):
         processed = []
         for q in queries:
-            q = self.query_start + self.query_prefix + q + ' ' + q
             q += suffix + "\n"
             processed.append(q)
@@ -391,7 +394,7 @@ class GraniteVisionEmbProcessor(LlavaNextProcessor):
     ) -> torch.Tensor:
         """
         Compute the late-interaction/MaxSim score (ColBERT-like) for the given multi-vector
-        query embeddings (`qs`) and passage embeddings (`ps`). For us, a passage is the
         image of a document page.
         Because the embedding tensors are multi-vector and can thus have different shapes, they
@@ -436,4 +439,4 @@ class GraniteVisionEmbProcessor(LlavaNextProcessor):
         assert scores.shape[0] == len(qs), f"Expected {len(qs)} scores, got {scores.shape[0]}"
         scores = scores.to(torch.float32)
-        return scores

     return math.floor(number / factor) * factor
+class ColGraniteVisionProcessor(LlavaNextProcessor):
     """
+    Processor for ColPali.
     """
     visual_prompt_prefix: ClassVar[str] = "<|user|>\n<image>\nDescribe the image.\n"
             max_size=self.max_size,
             fill_color=0
         )
     def resize_and_pad_centered_to_long_side(
+        self,
+        image: Image.Image,
+        factor: int,
+        min_size: int,
+        max_size: int,
+        fill_color=0
     ) -> Image.Image:
         """
         Resizes and pads an image such that:
         # Resize the image
         resized_image = image.resize((target_width, target_height), Image.LANCZOS)
+        final_image =resized_image.convert("RGB")
         return final_image
     def resize_and_pad_centered(self,
                                 image: Image.Image,
                                 factor: int,
             images: List[Image.Image],
     ) -> BatchFeature:
         """
+        Process images for ColPali.
         """
         # texts_doc = [self.apply_chat_template(self.format_data_wo_role(self.visual_prompt_prefix, img),tokenize=False ) for img in images]
         texts_doc = [self.visual_prompt_prefix for _ in images]
         processed = []
         for q in queries:
+            q = self.query_start + self.query_prefix + q
+            # truncate before it eats actual query content
+            if len(q) + len(suffix) > max_length:
+                q = q[: max_length - len(suffix) - 1]
             q += suffix + "\n"
             processed.append(q)
     ) -> torch.Tensor:
         """
         Compute the late-interaction/MaxSim score (ColBERT-like) for the given multi-vector
+        query embeddings (`qs`) and passage embeddings (`ps`). For ColPali, a passage is the
         image of a document page.
         Because the embedding tensors are multi-vector and can thus have different shapes, they
         assert scores.shape[0] == len(qs), f"Expected {len(qs)} scores, got {scores.shape[0]}"
         scores = scores.to(torch.float32)
+        return scores

processor_config.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
-  "processor_class": "GraniteVisionEmbProcessor",
   "auto_map": {
-    "AutoProcessor": "processing_granite_vision_embedding.GraniteVisionEmbProcessor"
   }
 }

 {
+  "processor_class": "ColGraniteVisionProcessor",
   "auto_map": {
+    "AutoProcessor": "processing_colgranitevision.ColGraniteVisionProcessor"
   }
 }