.gitattributes CHANGED
@@ -33,4 +33,3 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
- REPORT_Benchmarking[[:space:]]the[[:space:]]AI[[:space:]]advantage[[:space:]]in[[:space:]]finance.pdf filter=lfs diff=lfs merge=lfs -text
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
README.md CHANGED
@@ -3,70 +3,62 @@ license: apache-2.0
3
  language:
4
  - en
5
  base_model:
6
- - ibm-granite/granite-vision-3.3-2b
7
  library_name: transformers
8
  ---
9
- # granite-vision-3.3-2b-embedding
10
  **Model Summary:**
11
- Granite-vision-3.3-2b-embedding is an efficient embedding model based on [granite-vision-3.3-2b](https://huggingface.co/ibm-granite/granite-vision-3.3-2b). This model is specifically designed for multimodal document retrieval, enabling queries on documents with tables, charts, infographics, and complex layouts. The model generates ColBERT-style multi-vector representations of pages.
12
- By removing the need for OCR-based text extractions, granite-vision-3.3-2b-embedding can help simplify and accelerate RAG pipelines.
 
13
 
14
  **Evaluations:**
15
- We evaluated granite-vision-3.3-2b-embedding alongside other top colBERT style multi-modal embedding models in the 1B-4B parameter range using two benchmark: [Vidore2](https://github.com/illuin-tech/vidore-benchmark/) and [Real-MM-RAG-Bench](https://huggingface.co/collections/ibm-research/real-mm-rag-bench-67d2dc0ddf2dfafe66f09d34) which aim to specifically address complex multimodal document retrieval tasks.
16
 
17
  ## **NDCG@5 - ViDoRe V2**
18
- | Collection \ Model | ColPali-v1.3 | ColQwen2.5-v0.2 | ColNomic-3b | ColSmolvlm-v0.1 | granite-vision-3.3-2b-embedding |
19
- |----------------------------------------|--------------|------------------|-------------|-------------------|-----------
20
- | ESG Restaurant Human | 51.1 | 68.4 | 65.8 | 62.4 | 65.3 |
21
- | Economics Macro Multilingual | 49.9 | 56.5 | 55.4 | 47.4 | 51.2 |
22
- | MIT Biomedical | 59.7 | 63.6 | 63.5 | 58.1 |61.5 |
23
- | ESG Restaurant Synthetic | 57.0 | 57.4 | 56.6 | 51.1 |56.6 |
24
- | ESG Restaurant Synthetic Multilingual | 55.7 | 57.4 | 57.2 | 47.6 |55.7 |
25
- | MIT Biomedical Multilingual | 56.5 | 61.1 | 62.5 | 50.5 | 55.5 |
26
- | Economics Macro | 51.6 | 59.8 | 60.2 | 60.9 |58.3 |
27
- | **Avg (ViDoRe2)** | **54.5** | **60.6** | **60.2** | **54.0** |**57.7** |
28
 
29
  ## **NDCG@5 - REAL-MM-RAG**
30
- | Collection \ Model | ColPali-v1.3 | ColQwen2.5-v0.2 | ColNomic-3b | ColSmolvlm-v0.1 | granite-vision-3.3-2b-embedding |
31
- |----------------------------------------|--------------|------------------|-------------|--------------------------| ------------------
32
- | FinReport | 55 | 66 | 78 | 65 |73
33
- | FinSlides | 68 | 79 | 81 | 55 |79
34
- | TechReport | 78 | 86 | 88 | 83 |87
35
- | TechSlides | 90 | 93 | 92 | 91 |93
36
- | **Avg (REAL-MM-RAG)** | **73** | **81** | **85** | **74** |**83**
37
-
38
- - **Release Date**: June 11th 2025
39
  - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
40
- - **Supported Input Format:** Currently the model supports English instructions and images (png, jpeg) as input format.
41
-
42
  **Intended Use:**
43
  The model is intended to be used in enterprise applications that involve retrieval of visual and text data. In particular, the model is well-suited for multi-modal RAG systems where the knowledge base is composed of complex enterprise documents, such as reports, slides, images, canned doscuments, manuals and more. The model can be used as a standalone retriever, or alongside a text-based retriever.
44
-
45
  ### Usage
 
46
  ```shell
47
  pip install -q torch torchvision torchaudio
48
- pip install transformers==4.50
49
  ```
50
  Then run the code:
51
  ```python
52
- from io import BytesIO
53
-
54
- import requests
55
- import torch
56
- from PIL import Image
57
  from transformers import AutoProcessor, AutoModel
58
- from transformers.utils.import_utils import is_flash_attn_2_available
 
59
 
60
  device = "cuda" if torch.cuda.is_available() else "cpu"
61
- model_name = "ibm-granite/granite-vision-3.3-2b-embedding"
62
- model = AutoModel.from_pretrained(
63
- model_name,
64
- trust_remote_code=True,
65
- torch_dtype=torch.float16,
66
- device_map=device,
67
- attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None
68
- ).eval()
69
- processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
70
 
71
  # ─────────────────────────────────────────────
72
  # Inputs: Image + Text
@@ -106,35 +98,24 @@ similarity = processor.score(txt_emb, img_emb, batch_size=1, device=device)
106
  print("\n" + "=" * 50)
107
  print(f"📊 Similarity between image and text: {similarity.item():.4f}")
108
  print("=" * 50)
 
109
  ```
110
  ### Use granite-vision-embedding-3.3-2b for MM RAG
111
- For an example of MM-RAG using granite-vision-3.3-2b-embedding refer to [this notebook](https://github.com/ibm-granite/granite-vision-models/blob/main/cookbooks/GraniteVisionEmbedding_MM-RAG_Notebook.ipynb).
112
 
113
  **Model Architecture:**
114
- The architecture of granite-vision-3.3-2b-embedding follows ColPali(https://arxiv.org/abs/2407.01449) approach and consists of the following components:
115
-
116
- (1) Vision-Language model : granite-vision-3.3-2b (https://huggingface.co/ibm-granite/granite-vision-3.3-2b).
117
-
118
- (2) Projection layer: linear layer that projects the hidden layer dimension of Vision-Language model to 128 and outputs 729 embedding vectors per image.
119
-
120
- The scoring is computed using MaxSim-based late interaction mechanism.
121
-
122
  **Training Data:**
123
- Our training data is entirly comprised from DocFM. DocFM is a large-scale comprehensive dataset effort at IBM consisting of 85 million document pages extracted from unique PDF
124
  documents sourced from Common Crawl, Wikipedia, and ESG (Environmental, Social, and Governance)
125
- reports.
126
-
127
  **Infrastructure:**
128
- We train granite-vision-3.3-2b-embedding on IBM’s cognitive computing cluster, which is outfitted with NVIDIA A100 GPUs.
129
-
130
  **Ethical Considerations and Limitations:**
131
- The use of Large Vision and Language Models involves risks and ethical considerations people must be aware of, including but not limited to: bias and fairness, misinformation, and autonomous decision-making. Granite-vision-3.3-2b-embedding is not the exception in this regard. Although our alignment processes include safety considerations, the model may in some cases produce inaccurate or biased responses.
132
- Regarding ethics, a latent risk associated with all Large Language Models is their malicious utilization. We urge the community to use granite-vision-3.3-2b-embedding with ethical intentions and in a responsible way.
133
-
134
  **Resources**
135
- - 📄 Granite Vision technical report [here](https://arxiv.org/abs/2502.09927)
136
- - 📄 Real-MM-RAG-Bench paper (ACL 2025) [here](https://arxiv.org/abs/2502.12342)
137
- - 📄 Vidore 2 paper [here](https://www.arxiv.org/pdf/2505.17166)
138
- - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite
139
- - 🚀 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/
140
- - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources
 
3
  language:
4
  - en
5
  base_model:
6
+ - ibm-granite/granite-vision-3.3-2b-preview
7
  library_name: transformers
8
  ---
9
+ # granite-vision-embedding-3.3-2b
10
  **Model Summary:**
11
+ granite-vision-embedding-3.3-2b is an efficient embedding model, based on granite-vision Vision Language Model(VLM). This model is specifically designed for multi-modal document retrieval, enabling queries on documents with tables, charts, infographics, and complex layout. The model generates ColBERT-style multi-vector representations of pages.
12
+ The model eliminates the need for OCR-based text extraction and related preprocessing steps.
13
+
14
 
15
  **Evaluations:**
16
+ We evaluated granite-vision-embedding-3.3-2b alongside other top colBERT style multi-modal embedding models in the 1B-3B parameter range using two benchmark: Vidore2 and [Real-MM-RAG-Bench](https://arxiv.org/abs/2502.12342) which are specifically addressing complex multi-modal documents retrieval task.
17
 
18
  ## **NDCG@5 - ViDoRe V2**
19
+ | Collection \ Model | ColPali-v1.3 | ColQwen2.5-v0.2 | ColNomic-3b | ColGraniteVision-3.3-2b |
20
+ |----------------------------------------|--------------|------------------|-------------|--------------------------|
21
+ | ESG Restaurant Human | 51.10 | 68.40 | 65.80 | 60.00 |
22
+ | Economics Macro Multilingual | 49.90 | 56.50 | 55.40 | 50.13 |
23
+ | MIT Biomedical | 59.70 | 63.60 | 63.50 | 60.00 |
24
+ | ESG Restaurant Synthetic | 57.00 | 57.40 | 56.60 | 54.00 |
25
+ | ESG Restaurant Synthetic Multilingual | 55.70 | 57.40 | 57.20 | 52.00 |
26
+ | MIT Biomedical Multilingual | 56.50 | 61.10 | 62.50 | 54.00 |
27
+ | Economics Macro | 51.60 | 59.80 | 60.20 | 57.00 |
28
+ | **Avg (ViDoRe2)** | **54.50** | **60.60** | **60.17** | **55.20** |
29
 
30
  ## **NDCG@5 - REAL-MM-RAG**
31
+ | Collection \ Model | ColPali-v1.3 | ColQwen2.5-v0.2 | ColNomic-3b | ColGraniteVision-3.3-2b |
32
+ |----------------------------------------|--------------|------------------|-------------|--------------------------|
33
+ | FinReport | 0.55 | 0.66 | 0.78 | 0.60 |
34
+ | FinSlides | 0.68 | 0.79 | 0.81 | 0.72 |
35
+ | TechReport | 0.78 | 0.86 | 0.88 | 0.80 |
36
+ | TechSlides | 0.90 | 0.93 | 0.92 | 0.92 |
37
+ | **Avg (REAL-MM-RAG)** | **0.73** | **0.81** | **0.85** | **0.79** |
38
+
39
+ - **Release Date**: June 2025
40
  - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
41
+ **Supported Input Format:**
42
+ Currently the model supports English queries and images (png, jpeg, etc.) as input format.
43
  **Intended Use:**
44
  The model is intended to be used in enterprise applications that involve retrieval of visual and text data. In particular, the model is well-suited for multi-modal RAG systems where the knowledge base is composed of complex enterprise documents, such as reports, slides, images, canned doscuments, manuals and more. The model can be used as a standalone retriever, or alongside a text-based retriever.
 
45
  ### Usage
46
+ First, make sure to build the latest verions of transormers:
47
  ```shell
48
  pip install -q torch torchvision torchaudio
49
+ pip install transformers>=4.49
50
  ```
51
  Then run the code:
52
  ```python
 
 
 
 
 
53
  from transformers import AutoProcessor, AutoModel
54
+ from PIL import Image
55
+ import torch
56
 
57
  device = "cuda" if torch.cuda.is_available() else "cpu"
58
+ model_name = "ibm-granite/granite-vision-embedding-3.3-2b"
59
+ model = AutoModel.from_pretrained(model_name, trust_remote_code=True,torch_dtype=torch.float16).to(device).eval()
60
+ processor = AutoProcessor.from_pretrained(model_name,trust_remote_code=True)
61
+
 
 
 
 
 
62
 
63
  # ─────────────────────────────────────────────
64
  # Inputs: Image + Text
 
98
  print("\n" + "=" * 50)
99
  print(f"📊 Similarity between image and text: {similarity.item():.4f}")
100
  print("=" * 50)
101
+
102
  ```
103
  ### Use granite-vision-embedding-3.3-2b for MM RAG
104
+ For an example of MM RAG using col-granite-visionrefer to [this notebook](......).
105
 
106
  **Model Architecture:**
107
+ We built our model upon [granite-vision-3.3-2b](https://huggingface.co/ibm-granite/granite-vision-3.3-2b) with additional projection layer.
 
 
 
 
 
 
 
108
  **Training Data:**
109
+ The model was trained on a random subset from DOCFM. DOCFM is a large-scale comprehensive dataset effort at IBM consisting of 85 million document pages extracted from unique PDF
110
  documents sourced from Common Crawl, Wikipedia, and ESG (Environmental, Social, and Governance)
111
+ reports. For each image in the dataset, Pseudo-questions were generated using Pixtral12B VLM.
 
112
  **Infrastructure:**
113
+ We train granite-vision-embedding-3.3-2b on IBM’s cognitive computing cluster, which is outfitted with NVIDIA A100 GPUs.
 
114
  **Ethical Considerations and Limitations:**
115
+ The use of Large Vision and Language Models involves risks and ethical considerations people must be aware of, including but not limited to: bias and fairness, misinformation, and autonomous decision-making. col-granite-vision-1.0-2b is not the exception in this regard. Although our alignment processes include safety considerations, the model may in some cases produce inaccurate or biased responses.
116
+ Regarding ethics, a latent risk associated with all Large Language Models is their malicious utilization. We urge the community to use col-granite-vision-1.0-2b with ethical intentions and in a responsible way.
 
117
  **Resources**
118
+ - :page_facing_up: Granite Vision technical report [here](https://arxiv.org/abs/2502.09927)
119
+ - :star:️ Learn about the latest updates with Granite: https://www.ibm.com/granite
120
+ - :rocket: Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/
121
+ - :bulb: Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources
 
 
REPORT_Benchmarking the AI advantage in finance.pdf DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:4e6da951c55eef3fd52aa41543f3b4377ab26e2758c579aec2d11068a66b3d20
3
- size 1746880
 
 
 
 
granite_vision_embedding_config.py → colgranitevision_config.py RENAMED
@@ -1,8 +1,8 @@
1
  from transformers import LlavaNextConfig
2
 
3
 
4
- class GraniteVisionEmbConfig(LlavaNextConfig):
5
- model_type = "granitevisionemb"
6
 
7
  def __init__(self, **kwargs):
8
  self.base_model = kwargs.get("base_model", None)
@@ -11,5 +11,3 @@ class GraniteVisionEmbConfig(LlavaNextConfig):
11
  self.base_image_feature_location = kwargs.get("base_image_feature_location", "last")
12
  self.adapter_path = kwargs.get("adapter_path", None)
13
  super().__init__(**kwargs)
14
-
15
-
 
1
  from transformers import LlavaNextConfig
2
 
3
 
4
+ class ColGraniteVisionConfig(LlavaNextConfig):
5
+ model_type = "colgranitevision"
6
 
7
  def __init__(self, **kwargs):
8
  self.base_model = kwargs.get("base_model", None)
 
11
  self.base_image_feature_location = kwargs.get("base_image_feature_location", "last")
12
  self.adapter_path = kwargs.get("adapter_path", None)
13
  super().__init__(**kwargs)
 
 
config.json CHANGED
@@ -1,18 +1,18 @@
1
  {
2
- "_name_or_path": "ibm_granite/granite-vision-3.3-2b",
3
  "adapter_path": null,
4
- "auto_map": {
5
- "AutoModel": "modeling_granite_vision_embedding.GraniteVisionEmb",
6
- "AutoProcessor": "processing_granite_vision_embedding.GraniteVisionEmbProcessor",
7
- "AutoConfig": "granite_vision_embedding_config.GraniteVisionEmbConfig"
8
  },
9
  "architectures": [
10
- "GraniteVisionEmb"
11
  ],
12
- "base_image_feature_location": "last",
13
  "base_model": null,
14
  "emb_dim_doc": 128,
15
  "emb_dim_query": 128,
 
16
  "image_grid_pinpoints": [
17
  [
18
  384,
@@ -121,7 +121,7 @@
121
  ],
122
  "image_seq_length": 576,
123
  "image_token_index": 49155,
124
- "model_type": "granitevisionemb",
125
  "multimodal_projector_bias": true,
126
  "pretrained_language_model": "",
127
  "pretrained_vision_tower": "",
@@ -149,12 +149,12 @@
149
  "rms_norm_eps": 1e-05,
150
  "rope_theta": 300000,
151
  "tie_word_embeddings": true,
152
- "torch_dtype": "bfloat16",
153
  "vocab_size": 49156
154
  },
155
  "tie_word_embeddings": true,
156
  "torch_dtype": "float32",
157
- "transformers_version": "4.49.0",
158
  "use_image_newline_parameter": true,
159
  "vision_config": {
160
  "_attn_implementation_autoset": true,
@@ -167,7 +167,7 @@
167
  "num_attention_heads": 16,
168
  "num_hidden_layers": 27,
169
  "patch_size": 14,
170
- "torch_dtype": "bfloat16"
171
  },
172
  "vision_feature_layer": [
173
  -24,
 
1
  {
2
+ "_name_or_path": "ibm-granite/granite-vision-3.3-2b",
3
  "adapter_path": null,
4
+ "auto_map": {
5
+ "AutoModel": "modeling_colgranitevision.ColGraniteVision",
6
+ "AutoProcessor": "processing_colgranitevision.ColGraniteVisionProcessor",
7
+ "AutoConfig": "colgranitevision_config.ColGraniteVisionConfig"
8
  },
9
  "architectures": [
10
+ "ColGraniteVision"
11
  ],
 
12
  "base_model": null,
13
  "emb_dim_doc": 128,
14
  "emb_dim_query": 128,
15
+ "base_image_feature_location": "last",
16
  "image_grid_pinpoints": [
17
  [
18
  384,
 
121
  ],
122
  "image_seq_length": 576,
123
  "image_token_index": 49155,
124
+ "model_type": "colgranitevision",
125
  "multimodal_projector_bias": true,
126
  "pretrained_language_model": "",
127
  "pretrained_vision_tower": "",
 
149
  "rms_norm_eps": 1e-05,
150
  "rope_theta": 300000,
151
  "tie_word_embeddings": true,
152
+ "torch_dtype": "float32",
153
  "vocab_size": 49156
154
  },
155
  "tie_word_embeddings": true,
156
  "torch_dtype": "float32",
157
+ "transformers_version": "4.50.0.dev0",
158
  "use_image_newline_parameter": true,
159
  "vision_config": {
160
  "_attn_implementation_autoset": true,
 
167
  "num_attention_heads": 16,
168
  "num_hidden_layers": 27,
169
  "patch_size": 14,
170
+ "torch_dtype": "float32"
171
  },
172
  "vision_feature_layer": [
173
  -24,
modeling_granite_vision_embedding.py → modeling_colgranitevision.py RENAMED
@@ -7,10 +7,11 @@ from transformers import LlavaNextPreTrainedModel
7
  from transformers.models.llava_next.modeling_llava_next import LlavaNextForConditionalGeneration
8
  from transformers.models.llava_next.modeling_llava_next import unpad_image, get_anyres_image_grid_shape
9
 
10
- from .granite_vision_embedding_config import GraniteVisionEmbConfig
11
 
12
- class LlavaNextWithCustomPacking(LlavaNextForConditionalGeneration):
13
 
 
 
14
  def pack_image_features(
15
  self,
16
  image_features,
@@ -92,15 +93,15 @@ class LlavaNextWithCustomPacking(LlavaNextForConditionalGeneration):
92
  return image_features, feature_lens
93
 
94
 
95
- class GraniteVisionEmb(LlavaNextPreTrainedModel):
96
  """
97
- GraniteVisionEmb model implementation.
98
  """
99
 
100
  main_input_name: ClassVar[str] = "doc_input_ids" # transformers-related
101
- config_class = GraniteVisionEmbConfig
102
 
103
- def __init__(self, config: GraniteVisionEmbConfig):
104
  super().__init__(config=config)
105
 
106
  model = LlavaNextWithCustomPacking(config=config)
@@ -108,6 +109,8 @@ class GraniteVisionEmb(LlavaNextPreTrainedModel):
108
  self._tied_weights_keys = [f"model.language_model.{k}" for k in model.language_model._tied_weights_keys]
109
  self.model = model
110
 
 
 
111
  self.dim = 128
112
  self.custom_text_proj = nn.Linear(self.model.config.text_config.hidden_size, self.dim)
113
 
 
7
  from transformers.models.llava_next.modeling_llava_next import LlavaNextForConditionalGeneration
8
  from transformers.models.llava_next.modeling_llava_next import unpad_image, get_anyres_image_grid_shape
9
 
10
+ from .colgranitevision_config import ColGraniteVisionConfig
11
 
 
12
 
13
+ class LlavaNextWithCustomPacking(LlavaNextForConditionalGeneration):
14
+
15
  def pack_image_features(
16
  self,
17
  image_features,
 
93
  return image_features, feature_lens
94
 
95
 
96
+ class ColGraniteVision(LlavaNextPreTrainedModel):
97
  """
98
+ ColGraniteVision model implementation.
99
  """
100
 
101
  main_input_name: ClassVar[str] = "doc_input_ids" # transformers-related
102
+ config_class = ColGraniteVisionConfig
103
 
104
+ def __init__(self, config: ColGraniteVisionConfig):
105
  super().__init__(config=config)
106
 
107
  model = LlavaNextWithCustomPacking(config=config)
 
109
  self._tied_weights_keys = [f"model.language_model.{k}" for k in model.language_model._tied_weights_keys]
110
  self.model = model
111
 
112
+ # TODO: Wait for ColPali2 to create a ColPaliConfig to allow specifying the embedding dimension.
113
+ # We could do it now but it would break all the models trying to load the model from the checkpoint.
114
  self.dim = 128
115
  self.custom_text_proj = nn.Linear(self.model.config.text_config.hidden_size, self.dim)
116
 
preprocessor_config.json CHANGED
@@ -127,7 +127,7 @@
127
  0.5,
128
  0.5
129
  ],
130
- "processor_class": "GraniteVisionEmbProcessor",
131
  "resample": 3,
132
  "rescale_factor": 0.00392156862745098,
133
  "size": {
 
127
  0.5,
128
  0.5
129
  ],
130
+ "processor_class": "ColGraniteVisionProcessor",
131
  "resample": 3,
132
  "rescale_factor": 0.00392156862745098,
133
  "size": {
processing_granite_vision_embedding.py → processing_colgranitevision.py RENAMED
@@ -21,9 +21,9 @@ def floor_by_factor(number: float, factor: int) -> int:
21
  return math.floor(number / factor) * factor
22
 
23
 
24
- class GraniteVisionEmbProcessor(LlavaNextProcessor):
25
  """
26
- Processor for GraniteVisionEmb.
27
  """
28
 
29
  visual_prompt_prefix: ClassVar[str] = "<|user|>\n<image>\nDescribe the image.\n"
@@ -140,14 +140,14 @@ class GraniteVisionEmbProcessor(LlavaNextProcessor):
140
  max_size=self.max_size,
141
  fill_color=0
142
  )
143
-
144
  def resize_and_pad_centered_to_long_side(
145
- self,
146
- image: Image.Image,
147
- factor: int,
148
- min_size: int,
149
- max_size: int,
150
- fill_color=0
151
  ) -> Image.Image:
152
  """
153
  Resizes and pads an image such that:
@@ -183,10 +183,10 @@ class GraniteVisionEmbProcessor(LlavaNextProcessor):
183
 
184
  # Resize the image
185
  resized_image = image.resize((target_width, target_height), Image.LANCZOS)
186
- final_image = resized_image.convert("RGB")
187
 
188
  return final_image
189
-
190
  def resize_and_pad_centered(self,
191
  image: Image.Image,
192
  factor: int,
@@ -300,7 +300,7 @@ class GraniteVisionEmbProcessor(LlavaNextProcessor):
300
  images: List[Image.Image],
301
  ) -> BatchFeature:
302
  """
303
- Process images.
304
  """
305
  # texts_doc = [self.apply_chat_template(self.format_data_wo_role(self.visual_prompt_prefix, img),tokenize=False ) for img in images]
306
  texts_doc = [self.visual_prompt_prefix for _ in images]
@@ -320,7 +320,10 @@ class GraniteVisionEmbProcessor(LlavaNextProcessor):
320
 
321
  processed = []
322
  for q in queries:
323
- q = self.query_start + self.query_prefix + q + ' ' + q
 
 
 
324
  q += suffix + "\n"
325
  processed.append(q)
326
 
@@ -391,7 +394,7 @@ class GraniteVisionEmbProcessor(LlavaNextProcessor):
391
  ) -> torch.Tensor:
392
  """
393
  Compute the late-interaction/MaxSim score (ColBERT-like) for the given multi-vector
394
- query embeddings (`qs`) and passage embeddings (`ps`). For us, a passage is the
395
  image of a document page.
396
 
397
  Because the embedding tensors are multi-vector and can thus have different shapes, they
@@ -436,4 +439,4 @@ class GraniteVisionEmbProcessor(LlavaNextProcessor):
436
  assert scores.shape[0] == len(qs), f"Expected {len(qs)} scores, got {scores.shape[0]}"
437
 
438
  scores = scores.to(torch.float32)
439
- return scores
 
21
  return math.floor(number / factor) * factor
22
 
23
 
24
+ class ColGraniteVisionProcessor(LlavaNextProcessor):
25
  """
26
+ Processor for ColPali.
27
  """
28
 
29
  visual_prompt_prefix: ClassVar[str] = "<|user|>\n<image>\nDescribe the image.\n"
 
140
  max_size=self.max_size,
141
  fill_color=0
142
  )
143
+
144
  def resize_and_pad_centered_to_long_side(
145
+ self,
146
+ image: Image.Image,
147
+ factor: int,
148
+ min_size: int,
149
+ max_size: int,
150
+ fill_color=0
151
  ) -> Image.Image:
152
  """
153
  Resizes and pads an image such that:
 
183
 
184
  # Resize the image
185
  resized_image = image.resize((target_width, target_height), Image.LANCZOS)
186
+ final_image =resized_image.convert("RGB")
187
 
188
  return final_image
189
+
190
  def resize_and_pad_centered(self,
191
  image: Image.Image,
192
  factor: int,
 
300
  images: List[Image.Image],
301
  ) -> BatchFeature:
302
  """
303
+ Process images for ColPali.
304
  """
305
  # texts_doc = [self.apply_chat_template(self.format_data_wo_role(self.visual_prompt_prefix, img),tokenize=False ) for img in images]
306
  texts_doc = [self.visual_prompt_prefix for _ in images]
 
320
 
321
  processed = []
322
  for q in queries:
323
+ q = self.query_start + self.query_prefix + q
324
+ # truncate before it eats actual query content
325
+ if len(q) + len(suffix) > max_length:
326
+ q = q[: max_length - len(suffix) - 1]
327
  q += suffix + "\n"
328
  processed.append(q)
329
 
 
394
  ) -> torch.Tensor:
395
  """
396
  Compute the late-interaction/MaxSim score (ColBERT-like) for the given multi-vector
397
+ query embeddings (`qs`) and passage embeddings (`ps`). For ColPali, a passage is the
398
  image of a document page.
399
 
400
  Because the embedding tensors are multi-vector and can thus have different shapes, they
 
439
  assert scores.shape[0] == len(qs), f"Expected {len(qs)} scores, got {scores.shape[0]}"
440
 
441
  scores = scores.to(torch.float32)
442
+ return scores
processor_config.json CHANGED
@@ -1,6 +1,6 @@
1
  {
2
- "processor_class": "GraniteVisionEmbProcessor",
3
  "auto_map": {
4
- "AutoProcessor": "processing_granite_vision_embedding.GraniteVisionEmbProcessor"
5
  }
6
  }
 
1
  {
2
+ "processor_class": "ColGraniteVisionProcessor",
3
  "auto_map": {
4
+ "AutoProcessor": "processing_colgranitevision.ColGraniteVisionProcessor"
5
  }
6
  }