.gitattributes CHANGED
@@ -33,4 +33,3 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
- REPORT_Benchmarking[[:space:]]the[[:space:]]AI[[:space:]]advantage[[:space:]]in[[:space:]]finance.pdf filter=lfs diff=lfs merge=lfs -text
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
README.md CHANGED
@@ -3,70 +3,62 @@ license: apache-2.0
3
  language:
4
  - en
5
  base_model:
6
- - ibm-granite/granite-vision-3.3-2b
7
  library_name: transformers
8
  ---
9
- # granite-vision-3.3-2b-embedding
10
  **Model Summary:**
11
- Granite-vision-3.3-2b-embedding is an efficient embedding model based on [granite-vision-3.3-2b](https://huggingface.co/ibm-granite/granite-vision-3.3-2b). This model is specifically designed for multimodal document retrieval, enabling queries on documents with tables, charts, infographics, and complex layouts. The model generates ColBERT-style multi-vector representations of pages.
12
- By removing the need for OCR-based text extractions, granite-vision-3.3-2b-embedding can help simplify and accelerate RAG pipelines.
 
13
 
14
  **Evaluations:**
15
- We evaluated granite-vision-3.3-2b-embedding alongside other top colBERT style multi-modal embedding models in the 1B-4B parameter range using two benchmark: [Vidore2](https://github.com/illuin-tech/vidore-benchmark/) and [Real-MM-RAG-Bench](https://huggingface.co/collections/ibm-research/real-mm-rag-bench-67d2dc0ddf2dfafe66f09d34) which aim to specifically address complex multimodal document retrieval tasks.
16
 
17
  ## **NDCG@5 - ViDoRe V2**
18
- | Collection \ Model | ColPali-v1.3 | ColQwen2.5-v0.2 | ColNomic-3b | ColSmolvlm-v0.1 | granite-vision-3.3-2b-embedding |
19
- |----------------------------------------|--------------|------------------|-------------|-------------------|-----------
20
- | ESG Restaurant Human | 51.1 | 68.4 | 65.8 | 62.4 | 65.3 |
21
- | Economics Macro Multilingual | 49.9 | 56.5 | 55.4 | 47.4 | 51.2 |
22
- | MIT Biomedical | 59.7 | 63.6 | 63.5 | 58.1 |61.5 |
23
- | ESG Restaurant Synthetic | 57.0 | 57.4 | 56.6 | 51.1 |56.6 |
24
- | ESG Restaurant Synthetic Multilingual | 55.7 | 57.4 | 57.2 | 47.6 |55.7 |
25
- | MIT Biomedical Multilingual | 56.5 | 61.1 | 62.5 | 50.5 | 55.5 |
26
- | Economics Macro | 51.6 | 59.8 | 60.2 | 60.9 |58.3 |
27
- | **Avg (ViDoRe2)** | **54.5** | **60.6** | **60.2** | **54.0** |**57.7** |
28
 
29
  ## **NDCG@5 - REAL-MM-RAG**
30
- | Collection \ Model | ColPali-v1.3 | ColQwen2.5-v0.2 | ColNomic-3b | ColSmolvlm-v0.1 | granite-vision-3.3-2b-embedding |
31
- |----------------------------------------|--------------|------------------|-------------|--------------------------| ------------------
32
- | FinReport | 55 | 66 | 78 | 65 |73
33
- | FinSlides | 68 | 79 | 81 | 55 |79
34
- | TechReport | 78 | 86 | 88 | 83 |87
35
- | TechSlides | 90 | 93 | 92 | 91 |93
36
- | **Avg (REAL-MM-RAG)** | **73** | **81** | **85** | **74** |**83**
37
-
38
- - **Release Date**: June 11th 2025
39
  - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
40
- - **Supported Input Format:** Currently the model supports English instructions and images (png, jpeg) as input format.
41
-
42
  **Intended Use:**
43
  The model is intended to be used in enterprise applications that involve retrieval of visual and text data. In particular, the model is well-suited for multi-modal RAG systems where the knowledge base is composed of complex enterprise documents, such as reports, slides, images, canned doscuments, manuals and more. The model can be used as a standalone retriever, or alongside a text-based retriever.
44
-
45
  ### Usage
 
46
  ```shell
47
  pip install -q torch torchvision torchaudio
48
- pip install transformers==4.50
49
  ```
50
  Then run the code:
51
  ```python
52
- from io import BytesIO
53
-
54
- import requests
55
- import torch
56
- from PIL import Image
57
  from transformers import AutoProcessor, AutoModel
58
- from transformers.utils.import_utils import is_flash_attn_2_available
 
59
 
60
  device = "cuda" if torch.cuda.is_available() else "cpu"
61
- model_name = "ibm-granite/granite-vision-3.3-2b-embedding"
62
- model = AutoModel.from_pretrained(
63
- model_name,
64
- trust_remote_code=True,
65
- torch_dtype=torch.float16,
66
- device_map=device,
67
- attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None
68
- ).eval()
69
- processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
70
 
71
  # ─────────────────────────────────────────────
72
  # Inputs: Image + Text
@@ -106,35 +98,24 @@ similarity = processor.score(txt_emb, img_emb, batch_size=1, device=device)
106
  print("\n" + "=" * 50)
107
  print(f"📊 Similarity between image and text: {similarity.item():.4f}")
108
  print("=" * 50)
 
109
  ```
110
  ### Use granite-vision-embedding-3.3-2b for MM RAG
111
- For an example of MM-RAG using granite-vision-3.3-2b-embedding refer to [this notebook](https://github.com/ibm-granite/granite-vision-models/blob/main/cookbooks/GraniteVisionEmbedding_MM-RAG_Notebook.ipynb).
112
 
113
  **Model Architecture:**
114
- The architecture of granite-vision-3.3-2b-embedding follows ColPali(https://arxiv.org/abs/2407.01449) approach and consists of the following components:
115
-
116
- (1) Vision-Language model : granite-vision-3.3-2b (https://huggingface.co/ibm-granite/granite-vision-3.3-2b).
117
-
118
- (2) Projection layer: linear layer that projects the hidden layer dimension of Vision-Language model to 128 and outputs 729 embedding vectors per image.
119
-
120
- The scoring is computed using MaxSim-based late interaction mechanism.
121
-
122
  **Training Data:**
123
- Our training data is entirly comprised from DocFM. DocFM is a large-scale comprehensive dataset effort at IBM consisting of 85 million document pages extracted from unique PDF
124
  documents sourced from Common Crawl, Wikipedia, and ESG (Environmental, Social, and Governance)
125
- reports.
126
-
127
  **Infrastructure:**
128
- We train granite-vision-3.3-2b-embedding on IBM’s cognitive computing cluster, which is outfitted with NVIDIA A100 GPUs.
129
-
130
  **Ethical Considerations and Limitations:**
131
- The use of Large Vision and Language Models involves risks and ethical considerations people must be aware of, including but not limited to: bias and fairness, misinformation, and autonomous decision-making. Granite-vision-3.3-2b-embedding is not the exception in this regard. Although our alignment processes include safety considerations, the model may in some cases produce inaccurate or biased responses.
132
- Regarding ethics, a latent risk associated with all Large Language Models is their malicious utilization. We urge the community to use granite-vision-3.3-2b-embedding with ethical intentions and in a responsible way.
133
-
134
  **Resources**
135
- - 📄 Granite Vision technical report [here](https://arxiv.org/abs/2502.09927)
136
- - 📄 Real-MM-RAG-Bench paper (ACL 2025) [here](https://arxiv.org/abs/2502.12342)
137
- - 📄 Vidore 2 paper [here](https://www.arxiv.org/pdf/2505.17166)
138
- - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite
139
- - 🚀 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/
140
- - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources
 
3
  language:
4
  - en
5
  base_model:
6
+ - ibm-granite/granite-vision-3.3-2b-preview
7
  library_name: transformers
8
  ---
9
+ # granite-vision-embedding-3.3-2b
10
  **Model Summary:**
11
+ granite-vision-embedding-3.3-2b is an efficient embedding model, based on granite-vision Vision Language Model(VLM). This model is specifically designed for multi-modal document retrieval, enabling queries on documents with tables, charts, infographics, and complex layout. The model generates ColBERT-style multi-vector representations of pages.
12
+ The model eliminates the need for OCR-based text extraction and related preprocessing steps.
13
+
14
 
15
  **Evaluations:**
16
+ We evaluated granite-vision-embedding-3.3-2b alongside other top colBERT style multi-modal embedding models in the 1B-3B parameter range using two benchmark: Vidore2 and [Real-MM-RAG-Bench](https://arxiv.org/abs/2502.12342) which are specifically addressing complex multi-modal documents retrieval task.
17
 
18
  ## **NDCG@5 - ViDoRe V2**
19
+ | Collection \ Model | ColPali-v1.3 | ColQwen2.5-v0.2 | ColNomic-3b | ColGraniteVision-3.3-2b |
20
+ |----------------------------------------|--------------|------------------|-------------|--------------------------|
21
+ | ESG Restaurant Human | 51.10 | 68.40 | 65.80 | 60.00 |
22
+ | Economics Macro Multilingual | 49.90 | 56.50 | 55.40 | 50.13 |
23
+ | MIT Biomedical | 59.70 | 63.60 | 63.50 | 60.00 |
24
+ | ESG Restaurant Synthetic | 57.00 | 57.40 | 56.60 | 54.00 |
25
+ | ESG Restaurant Synthetic Multilingual | 55.70 | 57.40 | 57.20 | 52.00 |
26
+ | MIT Biomedical Multilingual | 56.50 | 61.10 | 62.50 | 54.00 |
27
+ | Economics Macro | 51.60 | 59.80 | 60.20 | 57.00 |
28
+ | **Avg (ViDoRe2)** | **54.50** | **60.60** | **60.17** | **55.30** |
29
 
30
  ## **NDCG@5 - REAL-MM-RAG**
31
+ | Collection \ Model | ColPali-v1.3 | ColQwen2.5-v0.2 | ColNomic-3b | ColGraniteVision-3.3-2b |
32
+ |----------------------------------------|--------------|------------------|-------------|--------------------------|
33
+ | FinReport | 0.55 | 0.66 | 0.78 | 0.60 |
34
+ | FinSlides | 0.68 | 0.79 | 0.81 | 0.72 |
35
+ | TechReport | 0.78 | 0.86 | 0.88 | 0.80 |
36
+ | TechSlides | 0.90 | 0.93 | 0.92 | 0.92 |
37
+ | **Avg (REAL-MM-RAG)** | **0.73** | **0.81** | **0.85** | **0.76** |
38
+
39
+ - **Release Date**: June 2025
40
  - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
41
+ **Supported Input Format:**
42
+ Currently the model supports English queries and images (png, jpeg, etc.) as input format.
43
  **Intended Use:**
44
  The model is intended to be used in enterprise applications that involve retrieval of visual and text data. In particular, the model is well-suited for multi-modal RAG systems where the knowledge base is composed of complex enterprise documents, such as reports, slides, images, canned doscuments, manuals and more. The model can be used as a standalone retriever, or alongside a text-based retriever.
 
45
  ### Usage
46
+ First, make sure to build the latest verions of transormers:
47
  ```shell
48
  pip install -q torch torchvision torchaudio
49
+ pip install transformers>=4.49
50
  ```
51
  Then run the code:
52
  ```python
 
 
 
 
 
53
  from transformers import AutoProcessor, AutoModel
54
+ from PIL import Image
55
+ import torch
56
 
57
  device = "cuda" if torch.cuda.is_available() else "cpu"
58
+ model_name = "ibm-granite/granite-vision-embedding-3.3-2b"
59
+ model = AutoModel.from_pretrained(model_name, trust_remote_code=True,torch_dtype=torch.float16).to(device).eval()
60
+ processor = AutoProcessor.from_pretrained(model_name,trust_remote_code=True)
61
+
 
 
 
 
 
62
 
63
  # ─────────────────────────────────────────────
64
  # Inputs: Image + Text
 
98
  print("\n" + "=" * 50)
99
  print(f"📊 Similarity between image and text: {similarity.item():.4f}")
100
  print("=" * 50)
101
+
102
  ```
103
  ### Use granite-vision-embedding-3.3-2b for MM RAG
104
+ For an example of MM RAG using col-granite-visionrefer to [this notebook](......).
105
 
106
  **Model Architecture:**
107
+ We built our model upon [granite-vision-3.3-2b](https://huggingface.co/ibm-granite/granite-vision-3.3-2b) with additional projection layer.
 
 
 
 
 
 
 
108
  **Training Data:**
109
+ The model was trained on a random subset from DOCFM. DOCFM is a large-scale comprehensive dataset effort at IBM consisting of 85 million document pages extracted from unique PDF
110
  documents sourced from Common Crawl, Wikipedia, and ESG (Environmental, Social, and Governance)
111
+ reports. For each image in the dataset, Pseudo-questions were generated using Pixtral12B VLM.
 
112
  **Infrastructure:**
113
+ We train granite-vision-embedding-3.3-2b on IBM’s cognitive computing cluster, which is outfitted with NVIDIA A100 GPUs.
 
114
  **Ethical Considerations and Limitations:**
115
+ The use of Large Vision and Language Models involves risks and ethical considerations people must be aware of, including but not limited to: bias and fairness, misinformation, and autonomous decision-making. col-granite-vision-1.0-2b is not the exception in this regard. Although our alignment processes include safety considerations, the model may in some cases produce inaccurate or biased responses.
116
+ Regarding ethics, a latent risk associated with all Large Language Models is their malicious utilization. We urge the community to use col-granite-vision-1.0-2b with ethical intentions and in a responsible way.
 
117
  **Resources**
118
+ - :page_facing_up: Granite Vision technical report [here](https://arxiv.org/abs/2502.09927)
119
+ - :star:️ Learn about the latest updates with Granite: https://www.ibm.com/granite
120
+ - :rocket: Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/
121
+ - :bulb: Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources
 
 
REPORT_Benchmarking the AI advantage in finance.pdf DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:4e6da951c55eef3fd52aa41543f3b4377ab26e2758c579aec2d11068a66b3d20
3
- size 1746880
 
 
 
 
added_tokens.json CHANGED
@@ -1,6 +1,6 @@
1
- {
2
- "<image>": 49155,
3
- "<|end_of_role|>": 49153,
4
- "<|start_of_role|>": 49152,
5
- "<|tool_call|>": 49154
6
- }
 
1
+ {
2
+ "<image>": 49155,
3
+ "<|end_of_role|>": 49153,
4
+ "<|start_of_role|>": 49152,
5
+ "<|tool_call|>": 49154
6
+ }
granite_vision_embedding_config.py → colgranitevision_config.py RENAMED
@@ -1,15 +1,12 @@
1
  from transformers import LlavaNextConfig
2
 
3
 
4
- class GraniteVisionEmbConfig(LlavaNextConfig):
5
- model_type = "granitevisionemb"
6
 
7
  def __init__(self, **kwargs):
8
  self.base_model = kwargs.get("base_model", None)
9
  self.emb_dim_query = kwargs.get("emb_dim_query", 128)
10
  self.emb_dim_doc = kwargs.get("emb_dim_doc", 128)
11
- self.base_image_feature_location = kwargs.get("base_image_feature_location", "last")
12
  self.adapter_path = kwargs.get("adapter_path", None)
13
  super().__init__(**kwargs)
14
-
15
-
 
1
  from transformers import LlavaNextConfig
2
 
3
 
4
+ class ColGraniteVisionConfig(LlavaNextConfig):
5
+ model_type = "colgranitevision"
6
 
7
  def __init__(self, **kwargs):
8
  self.base_model = kwargs.get("base_model", None)
9
  self.emb_dim_query = kwargs.get("emb_dim_query", 128)
10
  self.emb_dim_doc = kwargs.get("emb_dim_doc", 128)
 
11
  self.adapter_path = kwargs.get("adapter_path", None)
12
  super().__init__(**kwargs)
 
 
config.json CHANGED
@@ -1,15 +1,14 @@
1
  {
2
- "_name_or_path": "ibm_granite/granite-vision-3.3-2b",
3
- "adapter_path": null,
4
- "auto_map": {
5
- "AutoModel": "modeling_granite_vision_embedding.GraniteVisionEmb",
6
- "AutoProcessor": "processing_granite_vision_embedding.GraniteVisionEmbProcessor",
7
- "AutoConfig": "granite_vision_embedding_config.GraniteVisionEmbConfig"
8
- },
9
  "architectures": [
10
- "GraniteVisionEmb"
11
  ],
12
- "base_image_feature_location": "last",
13
  "base_model": null,
14
  "emb_dim_doc": 128,
15
  "emb_dim_query": 128,
@@ -121,32 +120,28 @@
121
  ],
122
  "image_seq_length": 576,
123
  "image_token_index": 49155,
124
- "model_type": "granitevisionemb",
125
  "multimodal_projector_bias": true,
126
- "pretrained_language_model": "",
127
- "pretrained_vision_tower": "",
128
  "projector_hidden_act": "gelu",
129
  "text_config": {
130
- "_attn_implementation_autoset": true,
131
- "_name_or_path": "ibm-granite/granite-3.1-2b-instruct",
132
  "architectures": [
133
  "GraniteForCausalLM"
134
  ],
135
  "attention_dropout": 0.1,
136
  "attention_multiplier": 0.015625,
137
  "bos_token_id": 0,
138
- "embedding_multiplier": 12.0,
139
  "eos_token_id": 0,
140
  "hidden_size": 2048,
141
  "intermediate_size": 8192,
142
- "logits_scaling": 8.0,
143
- "max_position_embeddings": 131072,
144
  "model_type": "granite",
145
  "num_hidden_layers": 40,
146
  "num_key_value_heads": 8,
147
  "pad_token_id": 0,
148
  "residual_multiplier": 0.22,
149
- "rms_norm_eps": 1e-05,
150
  "rope_theta": 300000,
151
  "tie_word_embeddings": true,
152
  "torch_dtype": "bfloat16",
@@ -154,20 +149,18 @@
154
  },
155
  "tie_word_embeddings": true,
156
  "torch_dtype": "float32",
157
- "transformers_version": "4.49.0",
158
  "use_image_newline_parameter": true,
159
  "vision_config": {
160
- "_attn_implementation_autoset": true,
161
  "hidden_act": "gelu_pytorch_tanh",
162
  "hidden_size": 1152,
163
  "image_size": 384,
164
  "intermediate_size": 4304,
165
- "layer_norm_eps": 1e-06,
166
  "model_type": "siglip_vision_model",
167
  "num_attention_heads": 16,
168
  "num_hidden_layers": 27,
169
- "patch_size": 14,
170
- "torch_dtype": "bfloat16"
171
  },
172
  "vision_feature_layer": [
173
  -24,
@@ -176,4 +169,4 @@
176
  -1
177
  ],
178
  "vision_feature_select_strategy": "full"
179
- }
 
1
  {
2
+ "_name_or_path": "ibm-granite/granite-vision-3.1-2b-preview",
3
+ "_class_name": "ColGraniteVisionConfig",
4
+ "auto_map": {
5
+ "AutoModel": "modeling_colgranitevision.ColGraniteVision",
6
+ "AutoProcessor": "processing_colgranitevision.ColGraniteVisionProcessor",
7
+ "AutoConfig": "colgranitevision_config.ColGraniteVisionConfig"
8
+ },
9
  "architectures": [
10
+ "ColGraniteVision"
11
  ],
 
12
  "base_model": null,
13
  "emb_dim_doc": 128,
14
  "emb_dim_query": 128,
 
120
  ],
121
  "image_seq_length": 576,
122
  "image_token_index": 49155,
123
+ "model_type": "colgranitevision",
124
  "multimodal_projector_bias": true,
 
 
125
  "projector_hidden_act": "gelu",
126
  "text_config": {
 
 
127
  "architectures": [
128
  "GraniteForCausalLM"
129
  ],
130
  "attention_dropout": 0.1,
131
  "attention_multiplier": 0.015625,
132
  "bos_token_id": 0,
133
+ "embedding_multiplier": 12,
134
  "eos_token_id": 0,
135
  "hidden_size": 2048,
136
  "intermediate_size": 8192,
137
+ "logits_scaling": 8,
138
+ "max_position_embeddings": 16384,
139
  "model_type": "granite",
140
  "num_hidden_layers": 40,
141
  "num_key_value_heads": 8,
142
  "pad_token_id": 0,
143
  "residual_multiplier": 0.22,
144
+ "rms_norm_eps": 0.00001,
145
  "rope_theta": 300000,
146
  "tie_word_embeddings": true,
147
  "torch_dtype": "bfloat16",
 
149
  },
150
  "tie_word_embeddings": true,
151
  "torch_dtype": "float32",
152
+ "transformers_version": "4.50.0.dev0",
153
  "use_image_newline_parameter": true,
154
  "vision_config": {
 
155
  "hidden_act": "gelu_pytorch_tanh",
156
  "hidden_size": 1152,
157
  "image_size": 384,
158
  "intermediate_size": 4304,
159
+ "layer_norm_eps": 0.000001,
160
  "model_type": "siglip_vision_model",
161
  "num_attention_heads": 16,
162
  "num_hidden_layers": 27,
163
+ "patch_size": 14
 
164
  },
165
  "vision_feature_layer": [
166
  -24,
 
169
  -1
170
  ],
171
  "vision_feature_select_strategy": "full"
172
+ }
model-00001-of-00003.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:4e838b6d98f48fbf45ae6c0d9c74cba649fd06b27ed78ced3971efbab7e16a69
3
  size 4955415688
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ec8a694db663b30616ff06812d60256bb474c52051df2003faaec47c42b9a556
3
  size 4955415688
model-00002-of-00003.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:d6bf1675fc15977b4d8f37ea1d4960ca2750e6793a80da9771e4693ae8cb13d6
3
  size 4999979448
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9df92d92a0d79465e4ee5eb57a51ee1630b159dc5833e26af9ca7bc9b3788d24
3
  size 4999979448
model-00003-of-00003.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:15978cba0606360676faad5c3cf486a58e6d78a1352dbfcd1db51a7410a574d5
3
  size 1947355456
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:42ebe0fe87507de69b86074af24513756fbcc205e83ccb2ee7bbe9238a751f29
3
  size 1947355456
modeling_granite_vision_embedding.py → modeling_colgranitevision.py RENAMED
@@ -7,16 +7,17 @@ from transformers import LlavaNextPreTrainedModel
7
  from transformers.models.llava_next.modeling_llava_next import LlavaNextForConditionalGeneration
8
  from transformers.models.llava_next.modeling_llava_next import unpad_image, get_anyres_image_grid_shape
9
 
10
- from .granite_vision_embedding_config import GraniteVisionEmbConfig
11
 
12
- class LlavaNextWithCustomPacking(LlavaNextForConditionalGeneration):
13
 
 
14
  def pack_image_features(
15
  self,
16
  image_features,
17
  image_sizes,
18
  vision_feature_select_strategy,
19
- image_newline=None
 
20
  ):
21
  """
22
  Reshape, unpad and then pack each image_feature into a single image_features tensor containing all visual vectors.
@@ -36,7 +37,6 @@ class LlavaNextWithCustomPacking(LlavaNextForConditionalGeneration):
36
  token length of each image in image_features
37
  """
38
 
39
- base_image_feature_location = self.config.base_image_feature_location
40
  new_image_features = []
41
  feature_lens = []
42
  for image_idx, image_feature in enumerate(image_features):
@@ -92,15 +92,15 @@ class LlavaNextWithCustomPacking(LlavaNextForConditionalGeneration):
92
  return image_features, feature_lens
93
 
94
 
95
- class GraniteVisionEmb(LlavaNextPreTrainedModel):
96
  """
97
- GraniteVisionEmb model implementation.
98
  """
99
 
100
  main_input_name: ClassVar[str] = "doc_input_ids" # transformers-related
101
- config_class = GraniteVisionEmbConfig
102
 
103
- def __init__(self, config: GraniteVisionEmbConfig):
104
  super().__init__(config=config)
105
 
106
  model = LlavaNextWithCustomPacking(config=config)
@@ -108,6 +108,8 @@ class GraniteVisionEmb(LlavaNextPreTrainedModel):
108
  self._tied_weights_keys = [f"model.language_model.{k}" for k in model.language_model._tied_weights_keys]
109
  self.model = model
110
 
 
 
111
  self.dim = 128
112
  self.custom_text_proj = nn.Linear(self.model.config.text_config.hidden_size, self.dim)
113
 
 
7
  from transformers.models.llava_next.modeling_llava_next import LlavaNextForConditionalGeneration
8
  from transformers.models.llava_next.modeling_llava_next import unpad_image, get_anyres_image_grid_shape
9
 
10
+ from .colgranitevision_config import ColGraniteVisionConfig
11
 
 
12
 
13
+ class LlavaNextWithCustomPacking(LlavaNextForConditionalGeneration):
14
  def pack_image_features(
15
  self,
16
  image_features,
17
  image_sizes,
18
  vision_feature_select_strategy,
19
+ image_newline=None,
20
+ base_image_feature_location="last",
21
  ):
22
  """
23
  Reshape, unpad and then pack each image_feature into a single image_features tensor containing all visual vectors.
 
37
  token length of each image in image_features
38
  """
39
 
 
40
  new_image_features = []
41
  feature_lens = []
42
  for image_idx, image_feature in enumerate(image_features):
 
92
  return image_features, feature_lens
93
 
94
 
95
+ class ColGraniteVision(LlavaNextPreTrainedModel):
96
  """
97
+ ColGraniteVision model implementation.
98
  """
99
 
100
  main_input_name: ClassVar[str] = "doc_input_ids" # transformers-related
101
+ config_class = ColGraniteVisionConfig
102
 
103
+ def __init__(self, config: ColGraniteVisionConfig):
104
  super().__init__(config=config)
105
 
106
  model = LlavaNextWithCustomPacking(config=config)
 
108
  self._tied_weights_keys = [f"model.language_model.{k}" for k in model.language_model._tied_weights_keys]
109
  self.model = model
110
 
111
+ # TODO: Wait for ColPali2 to create a ColPaliConfig to allow specifying the embedding dimension.
112
+ # We could do it now but it would break all the models trying to load the model from the checkpoint.
113
  self.dim = 128
114
  self.custom_text_proj = nn.Linear(self.model.config.text_config.hidden_size, self.dim)
115
 
preprocessor_config.json CHANGED
@@ -1,137 +1,136 @@
1
- {
2
- "crop_size": {
3
- "height": 384,
4
- "width": 384
5
- },
6
- "default_to_square": false,
7
- "do_center_crop": true,
8
- "do_convert_rgb": null,
9
- "do_normalize": true,
10
- "do_pad": true,
11
- "do_rescale": true,
12
- "do_resize": true,
13
- "image_grid_pinpoints": [
14
- [
15
- 384,
16
- 768
17
- ],
18
- [
19
- 384,
20
- 1152
21
- ],
22
- [
23
- 384,
24
- 1536
25
- ],
26
- [
27
- 384,
28
- 1920
29
- ],
30
- [
31
- 384,
32
- 2304
33
- ],
34
- [
35
- 384,
36
- 2688
37
- ],
38
- [
39
- 384,
40
- 3072
41
- ],
42
- [
43
- 384,
44
- 3456
45
- ],
46
- [
47
- 384,
48
- 3840
49
- ],
50
- [
51
- 768,
52
- 384
53
- ],
54
- [
55
- 768,
56
- 768
57
- ],
58
- [
59
- 768,
60
- 1152
61
- ],
62
- [
63
- 768,
64
- 1536
65
- ],
66
- [
67
- 768,
68
- 1920
69
- ],
70
- [
71
- 1152,
72
- 384
73
- ],
74
- [
75
- 1152,
76
- 768
77
- ],
78
- [
79
- 1152,
80
- 1152
81
- ],
82
- [
83
- 1536,
84
- 384
85
- ],
86
- [
87
- 1536,
88
- 768
89
- ],
90
- [
91
- 1920,
92
- 384
93
- ],
94
- [
95
- 1920,
96
- 768
97
- ],
98
- [
99
- 2304,
100
- 384
101
- ],
102
- [
103
- 2688,
104
- 384
105
- ],
106
- [
107
- 3072,
108
- 384
109
- ],
110
- [
111
- 3456,
112
- 384
113
- ],
114
- [
115
- 3840,
116
- 384
117
- ]
118
- ],
119
- "image_mean": [
120
- 0.5,
121
- 0.5,
122
- 0.5
123
- ],
124
- "image_processor_type": "LlavaNextImageProcessor",
125
- "image_std": [
126
- 0.5,
127
- 0.5,
128
- 0.5
129
- ],
130
- "processor_class": "GraniteVisionEmbProcessor",
131
- "resample": 3,
132
- "rescale_factor": 0.00392156862745098,
133
- "size": {
134
- "height": 384,
135
- "width": 384
136
- }
137
- }
 
1
+ {
2
+ "crop_size": {
3
+ "height": 384,
4
+ "width": 384
5
+ },
6
+ "do_center_crop": true,
7
+ "do_convert_rgb": null,
8
+ "do_normalize": true,
9
+ "do_pad": true,
10
+ "do_rescale": true,
11
+ "do_resize": true,
12
+ "image_grid_pinpoints": [
13
+ [
14
+ 384,
15
+ 768
16
+ ],
17
+ [
18
+ 384,
19
+ 1152
20
+ ],
21
+ [
22
+ 384,
23
+ 1536
24
+ ],
25
+ [
26
+ 384,
27
+ 1920
28
+ ],
29
+ [
30
+ 384,
31
+ 2304
32
+ ],
33
+ [
34
+ 384,
35
+ 2688
36
+ ],
37
+ [
38
+ 384,
39
+ 3072
40
+ ],
41
+ [
42
+ 384,
43
+ 3456
44
+ ],
45
+ [
46
+ 384,
47
+ 3840
48
+ ],
49
+ [
50
+ 768,
51
+ 384
52
+ ],
53
+ [
54
+ 768,
55
+ 768
56
+ ],
57
+ [
58
+ 768,
59
+ 1152
60
+ ],
61
+ [
62
+ 768,
63
+ 1536
64
+ ],
65
+ [
66
+ 768,
67
+ 1920
68
+ ],
69
+ [
70
+ 1152,
71
+ 384
72
+ ],
73
+ [
74
+ 1152,
75
+ 768
76
+ ],
77
+ [
78
+ 1152,
79
+ 1152
80
+ ],
81
+ [
82
+ 1536,
83
+ 384
84
+ ],
85
+ [
86
+ 1536,
87
+ 768
88
+ ],
89
+ [
90
+ 1920,
91
+ 384
92
+ ],
93
+ [
94
+ 1920,
95
+ 768
96
+ ],
97
+ [
98
+ 2304,
99
+ 384
100
+ ],
101
+ [
102
+ 2688,
103
+ 384
104
+ ],
105
+ [
106
+ 3072,
107
+ 384
108
+ ],
109
+ [
110
+ 3456,
111
+ 384
112
+ ],
113
+ [
114
+ 3840,
115
+ 384
116
+ ]
117
+ ],
118
+ "image_mean": [
119
+ 0.5,
120
+ 0.5,
121
+ 0.5
122
+ ],
123
+ "image_processor_type": "LlavaNextImageProcessor",
124
+ "image_std": [
125
+ 0.5,
126
+ 0.5,
127
+ 0.5
128
+ ],
129
+ "processor_class": "ColGraniteVisionProcessor",
130
+ "resample": 3,
131
+ "rescale_factor": 0.00392156862745098,
132
+ "size": {
133
+ "height": 384,
134
+ "width": 384
135
+ }
136
+ }
 
processing_granite_vision_embedding.py → processing_colgranitevision.py RENAMED
@@ -21,9 +21,9 @@ def floor_by_factor(number: float, factor: int) -> int:
21
  return math.floor(number / factor) * factor
22
 
23
 
24
- class GraniteVisionEmbProcessor(LlavaNextProcessor):
25
  """
26
- Processor for GraniteVisionEmb.
27
  """
28
 
29
  visual_prompt_prefix: ClassVar[str] = "<|user|>\n<image>\nDescribe the image.\n"
@@ -133,7 +133,7 @@ class GraniteVisionEmbProcessor(LlavaNextProcessor):
133
  """
134
  Resize and pad the image to the required format.
135
  """
136
- return self.resize_and_pad_centered_to_long_side(
137
  image=image,
138
  factor=self.factor,
139
  min_size=self.min_size,
@@ -141,52 +141,6 @@ class GraniteVisionEmbProcessor(LlavaNextProcessor):
141
  fill_color=0
142
  )
143
 
144
- def resize_and_pad_centered_to_long_side(
145
- self,
146
- image: Image.Image,
147
- factor: int,
148
- min_size: int,
149
- max_size: int,
150
- fill_color=0
151
- ) -> Image.Image:
152
- """
153
- Resizes and pads an image such that:
154
- - The long side is set to `max_size`.
155
- - The short side is scaled proportionally but not below `min_size`.
156
- - The image is centered within the final padded area.
157
-
158
- :param image: PIL Image
159
- :param factor: Factor to make dimensions divisible by
160
- :param min_size: Minimum allowed size for the short side
161
- :param max_size: Target size for the long side
162
- :param fill_color: Background padding color (default black)
163
- :return: Resized and padded image
164
- """
165
-
166
- # Get original size
167
- width, height = image.size
168
-
169
- if min_size == -1 or max_size == -1:
170
- return image.convert("RGB")
171
-
172
- # Step 1: scale long side to max_size, keep aspect ratio
173
- if width > height:
174
- scale_factor = max_size / width
175
- target_width = max_size
176
- max_scale_factor = max(min_size / height, scale_factor)
177
- target_height = round(height * max_scale_factor)
178
- else:
179
- scale_factor = max_size / height
180
- target_height = max_size
181
- max_scale_factor = max(min_size / width, scale_factor)
182
- target_width = round(width * max_scale_factor)
183
-
184
- # Resize the image
185
- resized_image = image.resize((target_width, target_height), Image.LANCZOS)
186
- final_image = resized_image.convert("RGB")
187
-
188
- return final_image
189
-
190
  def resize_and_pad_centered(self,
191
  image: Image.Image,
192
  factor: int,
@@ -300,7 +254,7 @@ class GraniteVisionEmbProcessor(LlavaNextProcessor):
300
  images: List[Image.Image],
301
  ) -> BatchFeature:
302
  """
303
- Process images.
304
  """
305
  # texts_doc = [self.apply_chat_template(self.format_data_wo_role(self.visual_prompt_prefix, img),tokenize=False ) for img in images]
306
  texts_doc = [self.visual_prompt_prefix for _ in images]
@@ -320,7 +274,10 @@ class GraniteVisionEmbProcessor(LlavaNextProcessor):
320
 
321
  processed = []
322
  for q in queries:
323
- q = self.query_start + self.query_prefix + q + ' ' + q
 
 
 
324
  q += suffix + "\n"
325
  processed.append(q)
326
 
@@ -391,7 +348,7 @@ class GraniteVisionEmbProcessor(LlavaNextProcessor):
391
  ) -> torch.Tensor:
392
  """
393
  Compute the late-interaction/MaxSim score (ColBERT-like) for the given multi-vector
394
- query embeddings (`qs`) and passage embeddings (`ps`). For us, a passage is the
395
  image of a document page.
396
 
397
  Because the embedding tensors are multi-vector and can thus have different shapes, they
@@ -436,4 +393,4 @@ class GraniteVisionEmbProcessor(LlavaNextProcessor):
436
  assert scores.shape[0] == len(qs), f"Expected {len(qs)} scores, got {scores.shape[0]}"
437
 
438
  scores = scores.to(torch.float32)
439
- return scores
 
21
  return math.floor(number / factor) * factor
22
 
23
 
24
+ class ColGraniteVisionProcessor(LlavaNextProcessor):
25
  """
26
+ Processor for ColPali.
27
  """
28
 
29
  visual_prompt_prefix: ClassVar[str] = "<|user|>\n<image>\nDescribe the image.\n"
 
133
  """
134
  Resize and pad the image to the required format.
135
  """
136
+ return self.resize_and_pad_centered(
137
  image=image,
138
  factor=self.factor,
139
  min_size=self.min_size,
 
141
  fill_color=0
142
  )
143
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
144
  def resize_and_pad_centered(self,
145
  image: Image.Image,
146
  factor: int,
 
254
  images: List[Image.Image],
255
  ) -> BatchFeature:
256
  """
257
+ Process images for ColPali.
258
  """
259
  # texts_doc = [self.apply_chat_template(self.format_data_wo_role(self.visual_prompt_prefix, img),tokenize=False ) for img in images]
260
  texts_doc = [self.visual_prompt_prefix for _ in images]
 
274
 
275
  processed = []
276
  for q in queries:
277
+ q = self.query_start + self.query_prefix + q
278
+ # truncate before it eats actual query content
279
+ if len(q) + len(suffix) > max_length:
280
+ q = q[: max_length - len(suffix) - 1]
281
  q += suffix + "\n"
282
  processed.append(q)
283
 
 
348
  ) -> torch.Tensor:
349
  """
350
  Compute the late-interaction/MaxSim score (ColBERT-like) for the given multi-vector
351
+ query embeddings (`qs`) and passage embeddings (`ps`). For ColPali, a passage is the
352
  image of a document page.
353
 
354
  Because the embedding tensors are multi-vector and can thus have different shapes, they
 
393
  assert scores.shape[0] == len(qs), f"Expected {len(qs)} scores, got {scores.shape[0]}"
394
 
395
  scores = scores.to(torch.float32)
396
+ return scores
processor_config.json CHANGED
@@ -1,6 +1,6 @@
1
  {
2
- "processor_class": "GraniteVisionEmbProcessor",
3
  "auto_map": {
4
- "AutoProcessor": "processing_granite_vision_embedding.GraniteVisionEmbProcessor"
5
  }
6
  }
 
1
  {
2
+ "processor_class": "ColGraniteVisionProcessor",
3
  "auto_map": {
4
+ "AutoProcessor": "processing_colgranitevision.ColGraniteVisionProcessor"
5
  }
6
  }
special_tokens_map.json CHANGED
@@ -1,35 +1,35 @@
1
- {
2
- "additional_special_tokens": [
3
- "<|start_of_role|>",
4
- "<|end_of_role|>",
5
- "<|tool_call|>"
6
- ],
7
- "bos_token": {
8
- "content": "<|end_of_text|>",
9
- "lstrip": false,
10
- "normalized": false,
11
- "rstrip": false,
12
- "single_word": false
13
- },
14
- "eos_token": {
15
- "content": "<|end_of_text|>",
16
- "lstrip": false,
17
- "normalized": false,
18
- "rstrip": false,
19
- "single_word": false
20
- },
21
- "pad_token": {
22
- "content": "<|end_of_text|>",
23
- "lstrip": false,
24
- "normalized": false,
25
- "rstrip": false,
26
- "single_word": false
27
- },
28
- "unk_token": {
29
- "content": "<|end_of_text|>",
30
- "lstrip": false,
31
- "normalized": false,
32
- "rstrip": false,
33
- "single_word": false
34
- }
35
- }
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|start_of_role|>",
4
+ "<|end_of_role|>",
5
+ "<|tool_call|>"
6
+ ],
7
+ "bos_token": {
8
+ "content": "<|end_of_text|>",
9
+ "lstrip": false,
10
+ "normalized": false,
11
+ "rstrip": false,
12
+ "single_word": false
13
+ },
14
+ "eos_token": {
15
+ "content": "<|end_of_text|>",
16
+ "lstrip": false,
17
+ "normalized": false,
18
+ "rstrip": false,
19
+ "single_word": false
20
+ },
21
+ "pad_token": {
22
+ "content": "<|end_of_text|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false
27
+ },
28
+ "unk_token": {
29
+ "content": "<|end_of_text|>",
30
+ "lstrip": false,
31
+ "normalized": false,
32
+ "rstrip": false,
33
+ "single_word": false
34
+ }
35
+ }
tokenizer_config.json CHANGED
@@ -1,208 +1,207 @@
1
- {
2
- "add_bos_token": false,
3
- "add_prefix_space": false,
4
- "added_tokens_decoder": {
5
- "0": {
6
- "content": "<|end_of_text|>",
7
- "lstrip": false,
8
- "normalized": false,
9
- "rstrip": false,
10
- "single_word": false,
11
- "special": true
12
- },
13
- "1": {
14
- "content": "<fim_prefix>",
15
- "lstrip": false,
16
- "normalized": false,
17
- "rstrip": false,
18
- "single_word": false,
19
- "special": true
20
- },
21
- "2": {
22
- "content": "<fim_middle>",
23
- "lstrip": false,
24
- "normalized": false,
25
- "rstrip": false,
26
- "single_word": false,
27
- "special": true
28
- },
29
- "3": {
30
- "content": "<fim_suffix>",
31
- "lstrip": false,
32
- "normalized": false,
33
- "rstrip": false,
34
- "single_word": false,
35
- "special": true
36
- },
37
- "4": {
38
- "content": "<fim_pad>",
39
- "lstrip": false,
40
- "normalized": false,
41
- "rstrip": false,
42
- "single_word": false,
43
- "special": true
44
- },
45
- "5": {
46
- "content": "<filename>",
47
- "lstrip": false,
48
- "normalized": false,
49
- "rstrip": false,
50
- "single_word": false,
51
- "special": true
52
- },
53
- "6": {
54
- "content": "<gh_stars>",
55
- "lstrip": false,
56
- "normalized": false,
57
- "rstrip": false,
58
- "single_word": false,
59
- "special": true
60
- },
61
- "7": {
62
- "content": "<issue_start>",
63
- "lstrip": false,
64
- "normalized": false,
65
- "rstrip": false,
66
- "single_word": false,
67
- "special": true
68
- },
69
- "8": {
70
- "content": "<issue_comment>",
71
- "lstrip": false,
72
- "normalized": false,
73
- "rstrip": false,
74
- "single_word": false,
75
- "special": true
76
- },
77
- "9": {
78
- "content": "<issue_closed>",
79
- "lstrip": false,
80
- "normalized": false,
81
- "rstrip": false,
82
- "single_word": false,
83
- "special": true
84
- },
85
- "10": {
86
- "content": "<jupyter_start>",
87
- "lstrip": false,
88
- "normalized": false,
89
- "rstrip": false,
90
- "single_word": false,
91
- "special": true
92
- },
93
- "11": {
94
- "content": "<jupyter_text>",
95
- "lstrip": false,
96
- "normalized": false,
97
- "rstrip": false,
98
- "single_word": false,
99
- "special": true
100
- },
101
- "12": {
102
- "content": "<jupyter_code>",
103
- "lstrip": false,
104
- "normalized": false,
105
- "rstrip": false,
106
- "single_word": false,
107
- "special": true
108
- },
109
- "13": {
110
- "content": "<jupyter_output>",
111
- "lstrip": false,
112
- "normalized": false,
113
- "rstrip": false,
114
- "single_word": false,
115
- "special": true
116
- },
117
- "14": {
118
- "content": "<empty_output>",
119
- "lstrip": false,
120
- "normalized": false,
121
- "rstrip": false,
122
- "single_word": false,
123
- "special": true
124
- },
125
- "15": {
126
- "content": "<commit_before>",
127
- "lstrip": false,
128
- "normalized": false,
129
- "rstrip": false,
130
- "single_word": false,
131
- "special": true
132
- },
133
- "16": {
134
- "content": "<commit_msg>",
135
- "lstrip": false,
136
- "normalized": false,
137
- "rstrip": false,
138
- "single_word": false,
139
- "special": true
140
- },
141
- "17": {
142
- "content": "<commit_after>",
143
- "lstrip": false,
144
- "normalized": false,
145
- "rstrip": false,
146
- "single_word": false,
147
- "special": true
148
- },
149
- "18": {
150
- "content": "<reponame>",
151
- "lstrip": false,
152
- "normalized": false,
153
- "rstrip": false,
154
- "single_word": false,
155
- "special": true
156
- },
157
- "49152": {
158
- "content": "<|start_of_role|>",
159
- "lstrip": false,
160
- "normalized": false,
161
- "rstrip": false,
162
- "single_word": false,
163
- "special": true
164
- },
165
- "49153": {
166
- "content": "<|end_of_role|>",
167
- "lstrip": false,
168
- "normalized": false,
169
- "rstrip": false,
170
- "single_word": false,
171
- "special": true
172
- },
173
- "49154": {
174
- "content": "<|tool_call|>",
175
- "lstrip": false,
176
- "normalized": false,
177
- "rstrip": false,
178
- "single_word": false,
179
- "special": true
180
- },
181
- "49155": {
182
- "content": "<image>",
183
- "lstrip": false,
184
- "normalized": false,
185
- "rstrip": false,
186
- "single_word": false,
187
- "special": true
188
- }
189
- },
190
- "additional_special_tokens": [
191
- "<|start_of_role|>",
192
- "<|end_of_role|>",
193
- "<|tool_call|>"
194
- ],
195
- "bos_token": "<|end_of_text|>",
196
- "chat_template": "{%- if tools %}\n {{- '<|start_of_role|>available_tools<|end_of_role|>\n' }}\n {%- for tool in tools %}\n {{- tool | tojson(indent=4) }}\n {%- if not loop.last %}\n {{- '\n\n' }}\n {%- endif %}\n {%- endfor %}\n {{- '<|end_of_text|>\n' }}\n{%- endif %}\n{%- for message in messages if message['role'] == 'system'%}{% else %}<|system|>\nA chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n{% endfor %}{%- for message in messages %}\n {%- if message['role'] == 'system' %}\n {{- '<|system|>\n' + message['content'] + '\n' }}\n {%- elif message['role'] == 'user' %}\n {{- '<|user|>\n' + message['content'] + '\n' }}\n {%- elif message['role'] == 'assistant' %}\n {{- '<|assistant|>\n' + message['content'] + '<|end_of_text|>' }}\n {%- elif message['role'] == 'assistant_tool_call' %}\n {{- '<|start_of_role|>assistant<|end_of_role|><|tool_call|>' + message['content'] + '<|end_of_text|>\n' }}\n {%- elif message['role'] == 'tool_response' %}\n {{- '<|start_of_role|>tool_response<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }}\n {%- endif %}\n {%- if loop.last and add_generation_prompt %}\n {{- '<|assistant|>\n' }}\n {%- endif %}\n{%- endfor %}",
197
- "clean_up_tokenization_spaces": true,
198
- "do_image_splitting": false,
199
- "eos_token": "<|end_of_text|>",
200
- "errors": "replace",
201
- "extra_special_tokens": {},
202
- "model_max_length": 131072,
203
- "pad_token": "<|end_of_text|>",
204
- "padding_side": "right",
205
- "tokenizer_class": "GPT2Tokenizer",
206
- "unk_token": "<|end_of_text|>",
207
- "vocab_size": 49152
208
  }
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "0": {
6
+ "content": "<|end_of_text|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "1": {
14
+ "content": "<fim_prefix>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "2": {
22
+ "content": "<fim_middle>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "3": {
30
+ "content": "<fim_suffix>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "4": {
38
+ "content": "<fim_pad>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "5": {
46
+ "content": "<filename>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "6": {
54
+ "content": "<gh_stars>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "7": {
62
+ "content": "<issue_start>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "8": {
70
+ "content": "<issue_comment>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "9": {
78
+ "content": "<issue_closed>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "10": {
86
+ "content": "<jupyter_start>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "11": {
94
+ "content": "<jupyter_text>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "12": {
102
+ "content": "<jupyter_code>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "13": {
110
+ "content": "<jupyter_output>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "14": {
118
+ "content": "<empty_output>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": true
124
+ },
125
+ "15": {
126
+ "content": "<commit_before>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": true
132
+ },
133
+ "16": {
134
+ "content": "<commit_msg>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": true
140
+ },
141
+ "17": {
142
+ "content": "<commit_after>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": true
148
+ },
149
+ "18": {
150
+ "content": "<reponame>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": true
156
+ },
157
+ "49152": {
158
+ "content": "<|start_of_role|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": true
164
+ },
165
+ "49153": {
166
+ "content": "<|end_of_role|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": true
172
+ },
173
+ "49154": {
174
+ "content": "<|tool_call|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": true
180
+ },
181
+ "49155": {
182
+ "content": "<image>",
183
+ "lstrip": false,
184
+ "normalized": false,
185
+ "rstrip": false,
186
+ "single_word": false,
187
+ "special": true
188
+ }
189
+ },
190
+ "additional_special_tokens": [
191
+ "<|start_of_role|>",
192
+ "<|end_of_role|>",
193
+ "<|tool_call|>"
194
+ ],
195
+ "bos_token": "<|end_of_text|>",
196
+ "chat_template": "{%- if tools %}\n {{- '<|start_of_role|>available_tools<|end_of_role|>\n' }}\n {%- for tool in tools %}\n {{- tool | tojson(indent=4) }}\n {%- if not loop.last %}\n {{- '\n\n' }}\n {%- endif %}\n {%- endfor %}\n {{- '<|end_of_text|>\n' }}\n{%- endif %}\n{%- for message in messages if message['role'] == 'system'%}{% else %}<|system|>\nA chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n{% endfor %}{%- for message in messages %}\n {%- if message['role'] == 'system' %}\n {{- '<|system|>\n' + message['content'] + '\n' }}\n {%- elif message['role'] == 'user' %}\n {{- '<|user|>\n' + message['content'] + '\n' }}\n {%- elif message['role'] == 'assistant' %}\n {{- '<|assistant|>\n' + message['content'] + '<|end_of_text|>' }}\n {%- elif message['role'] == 'assistant_tool_call' %}\n {{- '<|start_of_role|>assistant<|end_of_role|><|tool_call|>' + message['content'] + '<|end_of_text|>\n' }}\n {%- elif message['role'] == 'tool_response' %}\n {{- '<|start_of_role|>tool_response<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }}\n {%- endif %}\n {%- if loop.last and add_generation_prompt %}\n {{- '<|assistant|>\n' }}\n {%- endif %}\n{%- endfor %}",
197
+ "clean_up_tokenization_spaces": true,
198
+ "eos_token": "<|end_of_text|>",
199
+ "errors": "replace",
200
+ "extra_special_tokens": {},
201
+ "model_max_length": 16384,
202
+ "pad_token": "<|end_of_text|>",
203
+ "padding_side": "right",
204
+ "tokenizer_class": "GPT2Tokenizer",
205
+ "unk_token": "<|end_of_text|>",
206
+ "vocab_size": 49152
 
207
  }