granite 3.3

by Adirazgold - opened Jun 9

base: refs/heads/main

←

from: refs/pr/6

Discussion Files changed

+474

-550

This PR is in draft mode

Files changed (15) hide show

.gitattributes +0 -1
README.md +47 -66
REPORT_Benchmarking the AI advantage in finance.pdf +0 -3
added_tokens.json +6 -6
granite_vision_embedding_config.py → colgranitevision_config.py +2 -5
config.json +17 -24
model-00001-of-00003.safetensors +1 -1
model-00002-of-00003.safetensors +1 -1
model-00003-of-00003.safetensors +1 -1
modeling_granite_vision_embedding.py → modeling_colgranitevision.py +10 -8
preprocessor_config.json +136 -137
processing_granite_vision_embedding.py → processing_colgranitevision.py +10 -53
processor_config.json +2 -2
special_tokens_map.json +35 -35
tokenizer_config.json +206 -207

.gitattributes CHANGED Viewed

@@ -33,4 +33,3 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
-REPORT_Benchmarking[[:space:]]the[[:space:]]AI[[:space:]]advantage[[:space:]]in[[:space:]]finance.pdf filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -3,70 +3,62 @@ license: apache-2.0
 language:
 - en
 base_model:
-- ibm-granite/granite-vision-3.3-2b
 library_name: transformers
 ---
-# granite-vision-3.3-2b-embedding
 **Model Summary:**
-Granite-vision-3.3-2b-embedding is an efficient embedding model based on [granite-vision-3.3-2b](https://huggingface.co/ibm-granite/granite-vision-3.3-2b). This model is specifically designed for multimodal document retrieval, enabling queries on documents with tables, charts, infographics, and complex layouts. The model generates ColBERT-style multi-vector representations of pages.
-By removing the need for OCR-based text extractions, granite-vision-3.3-2b-embedding can help simplify and accelerate RAG pipelines.
 **Evaluations:**
-We evaluated granite-vision-3.3-2b-embedding alongside other top colBERT style multi-modal embedding models in the 1B-4B parameter range using two benchmark: [Vidore2](https://github.com/illuin-tech/vidore-benchmark/) and [Real-MM-RAG-Bench](https://huggingface.co/collections/ibm-research/real-mm-rag-bench-67d2dc0ddf2dfafe66f09d34) which aim to specifically address complex multimodal document retrieval tasks.
 ## **NDCG@5 - ViDoRe V2**
-| Collection \ Model                     | ColPali-v1.3 | ColQwen2.5-v0.2 | ColNomic-3b |  ColSmolvlm-v0.1     |  granite-vision-3.3-2b-embedding |
-|----------------------------------------|--------------|------------------|-------------|-------------------|-----------
-| ESG Restaurant Human                   | 51.1        | 68.4           | 65.8       |    62.4               | 65.3                    |
-| Economics Macro Multilingual           | 49.9        | 56.5            | 55.4       |     47.4              | 51.2                    |
-| MIT Biomedical                         | 59.7        | 63.6            | 63.5       |    58.1               |61.5                   |
-| ESG Restaurant Synthetic               | 57.0        | 57.4            | 56.6       |     51.1              |56.6                    |
-| ESG Restaurant Synthetic Multilingual  | 55.7        | 57.4            | 57.2       |     47.6             |55.7                    |
-| MIT Biomedical Multilingual            | 56.5        | 61.1            | 62.5       |      50.5             | 55.5                    |
-| Economics Macro                        | 51.6        | 59.8            | 60.2       |      60.9            |58.3                    |
-| **Avg (ViDoRe2)**                      | **54.5**    | **60.6**        | **60.2**   | **54.0**              |**57.7**                    |
 ## **NDCG@5 - REAL-MM-RAG**
-| Collection \ Model                     | ColPali-v1.3 | ColQwen2.5-v0.2 | ColNomic-3b |   ColSmolvlm-v0.1            |  granite-vision-3.3-2b-embedding |
-|----------------------------------------|--------------|------------------|-------------|--------------------------| ------------------
-| FinReport                              | 55         | 66             | 78        |   65                  |73
-| FinSlides                              | 68        | 79             | 81        |   55                 |79
-| TechReport                             | 78         | 86             | 88        |   83                 |87
-| TechSlides                             | 90         | 93             | 92        |   91            |93
-| **Avg (REAL-MM-RAG)**                  | **73**     | **81**         | **85**    |   **74**           |**83**
-- **Release Date**: June 11th 2025
 - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
-- **Supported Input Format:** Currently the model supports English instructions and images (png, jpeg) as input format.
 **Intended Use:**
 The model is intended to be used in enterprise applications that involve retrieval of visual and text data. In particular, the model is well-suited for multi-modal RAG systems where the knowledge base is composed of complex enterprise documents, such as reports, slides, images, canned doscuments, manuals and more. The model can be used as a standalone retriever, or alongside a text-based retriever.
 ### Usage
 ```shell
 pip install -q torch torchvision torchaudio
-pip install transformers==4.50
 ```
 Then run the code:
 ```python
-from io import BytesIO
-import requests
-import torch
-from PIL import Image
 from transformers import AutoProcessor, AutoModel
-from transformers.utils.import_utils import is_flash_attn_2_available
 device = "cuda" if torch.cuda.is_available() else "cpu"
-model_name = "ibm-granite/granite-vision-3.3-2b-embedding"
-model = AutoModel.from_pretrained(
-                      model_name,
-                      trust_remote_code=True,
-                      torch_dtype=torch.float16,
-                      device_map=device,
-                      attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None
-                      ).eval()
-processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
 # ─────────────────────────────────────────────
 # Inputs: Image + Text
@@ -106,35 +98,24 @@ similarity = processor.score(txt_emb, img_emb, batch_size=1, device=device)
 print("\n" + "=" * 50)
 print(f"📊 Similarity between image and text: {similarity.item():.4f}")
 print("=" * 50)
 ```
 ### Use granite-vision-embedding-3.3-2b for MM RAG
-For an example of MM-RAG using granite-vision-3.3-2b-embedding refer to [this notebook](https://github.com/ibm-granite/granite-vision-models/blob/main/cookbooks/GraniteVisionEmbedding_MM-RAG_Notebook.ipynb).
 **Model Architecture:**
-The architecture of granite-vision-3.3-2b-embedding follows ColPali(https://arxiv.org/abs/2407.01449) approach and consists of the following components:
-(1) Vision-Language model : granite-vision-3.3-2b (https://huggingface.co/ibm-granite/granite-vision-3.3-2b).
-(2) Projection layer: linear layer that projects the hidden layer dimension of Vision-Language model to 128 and outputs 729 embedding vectors per image.
-The scoring is computed using MaxSim-based late interaction mechanism.
 **Training Data:**
-Our training data is entirly comprised from DocFM. DocFM is a large-scale comprehensive dataset effort at IBM consisting of 85 million document pages extracted from unique PDF
 documents sourced from Common Crawl, Wikipedia, and ESG (Environmental, Social, and Governance)
-reports.
 **Infrastructure:**
-We train granite-vision-3.3-2b-embedding on IBM’s cognitive computing cluster, which is outfitted with NVIDIA A100 GPUs.
 **Ethical Considerations and Limitations:**
-The use of Large Vision and Language Models involves risks and ethical considerations people must be aware of, including but not limited to: bias and fairness, misinformation, and autonomous decision-making. Granite-vision-3.3-2b-embedding is not the exception in this regard. Although our alignment processes include safety considerations, the model may in some cases produce inaccurate or biased responses.
-Regarding ethics, a latent risk associated with all Large Language Models is their malicious utilization. We urge the community to use granite-vision-3.3-2b-embedding with ethical intentions and in a responsible way.
 **Resources**
-- 📄 Granite Vision technical report [here](https://arxiv.org/abs/2502.09927)
-- 📄 Real-MM-RAG-Bench paper (ACL 2025) [here](https://arxiv.org/abs/2502.12342)
-- 📄 Vidore 2 paper [here](https://www.arxiv.org/pdf/2505.17166)
-- ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite
-- 🚀 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/
-- 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources

 language:
 - en
 base_model:
+- ibm-granite/granite-vision-3.3-2b-preview
 library_name: transformers
 ---
+# granite-vision-embedding-3.3-2b
 **Model Summary:**
+granite-vision-embedding-3.3-2b is an efficient embedding model, based on granite-vision Vision Language Model(VLM). This model is specifically designed for multi-modal document retrieval, enabling queries on documents with tables, charts, infographics, and complex layout. The model generates ColBERT-style multi-vector representations of pages.
+The model eliminates the need for OCR-based text extraction and related preprocessing steps.
 **Evaluations:**
+We evaluated granite-vision-embedding-3.3-2b alongside other top colBERT style multi-modal embedding models in the 1B-3B parameter range using two benchmark: Vidore2 and [Real-MM-RAG-Bench](https://arxiv.org/abs/2502.12342) which are specifically addressing complex multi-modal documents retrieval task.
 ## **NDCG@5 - ViDoRe V2**
+| Collection \ Model                     | ColPali-v1.3 | ColQwen2.5-v0.2 | ColNomic-3b | ColGraniteVision-3.3-2b |
+|----------------------------------------|--------------|------------------|-------------|--------------------------|
+| ESG Restaurant Human                   | 51.10        | 68.40            | 65.80       | 60.00                    |
+| Economics Macro Multilingual           | 49.90        | 56.50            | 55.40       | 50.13                    |
+| MIT Biomedical                         | 59.70        | 63.60            | 63.50       | 60.00                    |
+| ESG Restaurant Synthetic               | 57.00        | 57.40            | 56.60       | 54.00                    |
+| ESG Restaurant Synthetic Multilingual  | 55.70        | 57.40            | 57.20       | 52.00                    |
+| MIT Biomedical Multilingual            | 56.50        | 61.10            | 62.50       | 54.00                    |
+| Economics Macro                        | 51.60        | 59.80            | 60.20       | 57.00                    |
+| **Avg (ViDoRe2)**                      | **54.50**    | **60.60**        | **60.17**   | **55.30**                    |
 ## **NDCG@5 - REAL-MM-RAG**
+| Collection \ Model                     | ColPali-v1.3 | ColQwen2.5-v0.2 | ColNomic-3b | ColGraniteVision-3.3-2b |
+|----------------------------------------|--------------|------------------|-------------|--------------------------|
+| FinReport                              | 0.55         | 0.66             | 0.78        | 0.60                     |
+| FinSlides                              | 0.68         | 0.79             | 0.81        | 0.72                     |
+| TechReport                             | 0.78         | 0.86             | 0.88        | 0.80                     |
+| TechSlides                             | 0.90         | 0.93             | 0.92        | 0.92                     |
+| **Avg (REAL-MM-RAG)**                  | **0.73**     | **0.81**         | **0.85**    | **0.76**                 |
+- **Release Date**: June 2025
 - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
+**Supported Input Format:**
+Currently the model supports English queries and images (png, jpeg, etc.) as input format.
 **Intended Use:**
 The model is intended to be used in enterprise applications that involve retrieval of visual and text data. In particular, the model is well-suited for multi-modal RAG systems where the knowledge base is composed of complex enterprise documents, such as reports, slides, images, canned doscuments, manuals and more. The model can be used as a standalone retriever, or alongside a text-based retriever.
 ### Usage
+First, make sure to build the latest verions of transormers:
 ```shell
 pip install -q torch torchvision torchaudio
+pip install transformers>=4.49
 ```
 Then run the code:
 ```python
 from transformers import AutoProcessor, AutoModel
+from PIL import Image
+import torch
 device = "cuda" if torch.cuda.is_available() else "cpu"
+model_name = "ibm-granite/granite-vision-embedding-3.3-2b"
+model = AutoModel.from_pretrained(model_name, trust_remote_code=True,torch_dtype=torch.float16).to(device).eval()
+processor = AutoProcessor.from_pretrained(model_name,trust_remote_code=True)
 # ─────────────────────────────────────────────
 # Inputs: Image + Text
 print("\n" + "=" * 50)
 print(f"📊 Similarity between image and text: {similarity.item():.4f}")
 print("=" * 50)
 ```
 ### Use granite-vision-embedding-3.3-2b for MM RAG
+For an example of MM RAG using col-granite-visionrefer to [this notebook](......).
 **Model Architecture:**
+We built our model upon [granite-vision-3.3-2b](https://huggingface.co/ibm-granite/granite-vision-3.3-2b) with additional projection layer.
 **Training Data:**
+The model was trained on a random subset from DOCFM. DOCFM is a large-scale comprehensive dataset effort at IBM consisting of 85 million document pages extracted from unique PDF
 documents sourced from Common Crawl, Wikipedia, and ESG (Environmental, Social, and Governance)
+reports. For each image in the dataset, Pseudo-questions were generated using Pixtral12B VLM.
 **Infrastructure:**
+We train granite-vision-embedding-3.3-2b on IBM’s cognitive computing cluster, which is outfitted with NVIDIA A100 GPUs.
 **Ethical Considerations and Limitations:**
+The use of Large Vision and Language Models involves risks and ethical considerations people must be aware of, including but not limited to: bias and fairness, misinformation, and autonomous decision-making. col-granite-vision-1.0-2b is not the exception in this regard. Although our alignment processes include safety considerations, the model may in some cases produce inaccurate or biased responses.
+Regarding ethics, a latent risk associated with all Large Language Models is their malicious utilization. We urge the community to use col-granite-vision-1.0-2b with ethical intentions and in a responsible way.
 **Resources**
+- :page_facing_up: Granite Vision technical report [here](https://arxiv.org/abs/2502.09927)
+- :star:️ Learn about the latest updates with Granite: https://www.ibm.com/granite
+- :rocket: Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/
+- :bulb: Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources

REPORT_Benchmarking the AI advantage in finance.pdf DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:4e6da951c55eef3fd52aa41543f3b4377ab26e2758c579aec2d11068a66b3d20
-size 1746880

added_tokens.json CHANGED Viewed

@@ -1,6 +1,6 @@
-{
-  "<image>": 49155,
-  "<|end_of_role|>": 49153,
-  "<|start_of_role|>": 49152,
-  "<|tool_call|>": 49154
-}

+{
+  "<image>": 49155,
+  "<|end_of_role|>": 49153,
+  "<|start_of_role|>": 49152,
+  "<|tool_call|>": 49154
+}

granite_vision_embedding_config.py → colgranitevision_config.py RENAMED Viewed

@@ -1,15 +1,12 @@
 from transformers import LlavaNextConfig
-class GraniteVisionEmbConfig(LlavaNextConfig):
-    model_type = "granitevisionemb"
     def __init__(self, **kwargs):
         self.base_model = kwargs.get("base_model", None)
         self.emb_dim_query = kwargs.get("emb_dim_query", 128)
         self.emb_dim_doc = kwargs.get("emb_dim_doc", 128)
-        self.base_image_feature_location = kwargs.get("base_image_feature_location", "last")
         self.adapter_path = kwargs.get("adapter_path", None)
         super().__init__(**kwargs)

 from transformers import LlavaNextConfig
+class ColGraniteVisionConfig(LlavaNextConfig):
+    model_type = "colgranitevision"
     def __init__(self, **kwargs):
         self.base_model = kwargs.get("base_model", None)
         self.emb_dim_query = kwargs.get("emb_dim_query", 128)
         self.emb_dim_doc = kwargs.get("emb_dim_doc", 128)
         self.adapter_path = kwargs.get("adapter_path", None)
         super().__init__(**kwargs)

config.json CHANGED Viewed

@@ -1,15 +1,14 @@
 {
-  "_name_or_path": "ibm_granite/granite-vision-3.3-2b",
-  "adapter_path": null,
-    "auto_map": {
-        "AutoModel": "modeling_granite_vision_embedding.GraniteVisionEmb",
-        "AutoProcessor": "processing_granite_vision_embedding.GraniteVisionEmbProcessor",
-        "AutoConfig":    "granite_vision_embedding_config.GraniteVisionEmbConfig"
-    },
   "architectures": [
-    "GraniteVisionEmb"
   ],
-  "base_image_feature_location": "last",
   "base_model": null,
   "emb_dim_doc": 128,
   "emb_dim_query": 128,
@@ -121,32 +120,28 @@
   ],
   "image_seq_length": 576,
   "image_token_index": 49155,
-  "model_type": "granitevisionemb",
   "multimodal_projector_bias": true,
-  "pretrained_language_model": "",
-  "pretrained_vision_tower": "",
   "projector_hidden_act": "gelu",
   "text_config": {
-    "_attn_implementation_autoset": true,
-    "_name_or_path": "ibm-granite/granite-3.1-2b-instruct",
     "architectures": [
       "GraniteForCausalLM"
     ],
     "attention_dropout": 0.1,
     "attention_multiplier": 0.015625,
     "bos_token_id": 0,
-    "embedding_multiplier": 12.0,
     "eos_token_id": 0,
     "hidden_size": 2048,
     "intermediate_size": 8192,
-    "logits_scaling": 8.0,
-    "max_position_embeddings": 131072,
     "model_type": "granite",
     "num_hidden_layers": 40,
     "num_key_value_heads": 8,
     "pad_token_id": 0,
     "residual_multiplier": 0.22,
-    "rms_norm_eps": 1e-05,
     "rope_theta": 300000,
     "tie_word_embeddings": true,
     "torch_dtype": "bfloat16",
@@ -154,20 +149,18 @@
   },
   "tie_word_embeddings": true,
   "torch_dtype": "float32",
-  "transformers_version": "4.49.0",
   "use_image_newline_parameter": true,
   "vision_config": {
-    "_attn_implementation_autoset": true,
     "hidden_act": "gelu_pytorch_tanh",
     "hidden_size": 1152,
     "image_size": 384,
     "intermediate_size": 4304,
-    "layer_norm_eps": 1e-06,
     "model_type": "siglip_vision_model",
     "num_attention_heads": 16,
     "num_hidden_layers": 27,
-    "patch_size": 14,
-    "torch_dtype": "bfloat16"
   },
   "vision_feature_layer": [
     -24,
@@ -176,4 +169,4 @@
     -1
   ],
   "vision_feature_select_strategy": "full"
-}

 {
+  "_name_or_path": "ibm-granite/granite-vision-3.1-2b-preview",
+  "_class_name": "ColGraniteVisionConfig",
+  "auto_map": {
+    "AutoModel": "modeling_colgranitevision.ColGraniteVision",
+    "AutoProcessor": "processing_colgranitevision.ColGraniteVisionProcessor",
+    "AutoConfig":    "colgranitevision_config.ColGraniteVisionConfig"
+  },
   "architectures": [
+    "ColGraniteVision"
   ],
   "base_model": null,
   "emb_dim_doc": 128,
   "emb_dim_query": 128,
   ],
   "image_seq_length": 576,
   "image_token_index": 49155,
+  "model_type": "colgranitevision",
   "multimodal_projector_bias": true,
   "projector_hidden_act": "gelu",
   "text_config": {
     "architectures": [
       "GraniteForCausalLM"
     ],
     "attention_dropout": 0.1,
     "attention_multiplier": 0.015625,
     "bos_token_id": 0,
+    "embedding_multiplier": 12,
     "eos_token_id": 0,
     "hidden_size": 2048,
     "intermediate_size": 8192,
+    "logits_scaling": 8,
+    "max_position_embeddings": 16384,
     "model_type": "granite",
     "num_hidden_layers": 40,
     "num_key_value_heads": 8,
     "pad_token_id": 0,
     "residual_multiplier": 0.22,
+    "rms_norm_eps": 0.00001,
     "rope_theta": 300000,
     "tie_word_embeddings": true,
     "torch_dtype": "bfloat16",
   },
   "tie_word_embeddings": true,
   "torch_dtype": "float32",
+  "transformers_version": "4.50.0.dev0",
   "use_image_newline_parameter": true,
   "vision_config": {
     "hidden_act": "gelu_pytorch_tanh",
     "hidden_size": 1152,
     "image_size": 384,
     "intermediate_size": 4304,
+    "layer_norm_eps": 0.000001,
     "model_type": "siglip_vision_model",
     "num_attention_heads": 16,
     "num_hidden_layers": 27,
+    "patch_size": 14
   },
   "vision_feature_layer": [
     -24,
     -1
   ],
   "vision_feature_select_strategy": "full"
+}

model-00001-of-00003.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:4e838b6d98f48fbf45ae6c0d9c74cba649fd06b27ed78ced3971efbab7e16a69
 size 4955415688

 version https://git-lfs.github.com/spec/v1
+oid sha256:ec8a694db663b30616ff06812d60256bb474c52051df2003faaec47c42b9a556
 size 4955415688

model-00002-of-00003.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:d6bf1675fc15977b4d8f37ea1d4960ca2750e6793a80da9771e4693ae8cb13d6
 size 4999979448

 version https://git-lfs.github.com/spec/v1
+oid sha256:9df92d92a0d79465e4ee5eb57a51ee1630b159dc5833e26af9ca7bc9b3788d24
 size 4999979448

model-00003-of-00003.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:15978cba0606360676faad5c3cf486a58e6d78a1352dbfcd1db51a7410a574d5
 size 1947355456

 version https://git-lfs.github.com/spec/v1
+oid sha256:42ebe0fe87507de69b86074af24513756fbcc205e83ccb2ee7bbe9238a751f29
 size 1947355456

modeling_granite_vision_embedding.py → modeling_colgranitevision.py RENAMED Viewed

@@ -7,16 +7,17 @@ from transformers import LlavaNextPreTrainedModel
 from transformers.models.llava_next.modeling_llava_next import LlavaNextForConditionalGeneration
 from transformers.models.llava_next.modeling_llava_next import unpad_image, get_anyres_image_grid_shape
-from .granite_vision_embedding_config import GraniteVisionEmbConfig
-class LlavaNextWithCustomPacking(LlavaNextForConditionalGeneration):
     def pack_image_features(
             self,
             image_features,
             image_sizes,
             vision_feature_select_strategy,
-            image_newline=None
     ):
         """
         Reshape, unpad and then pack each image_feature into a single image_features tensor containing all visual vectors.
@@ -36,7 +37,6 @@ class LlavaNextWithCustomPacking(LlavaNextForConditionalGeneration):
                 token length of each image in image_features
         """
-        base_image_feature_location = self.config.base_image_feature_location
         new_image_features = []
         feature_lens = []
         for image_idx, image_feature in enumerate(image_features):
@@ -92,15 +92,15 @@ class LlavaNextWithCustomPacking(LlavaNextForConditionalGeneration):
         return image_features, feature_lens
-class GraniteVisionEmb(LlavaNextPreTrainedModel):
     """
-    GraniteVisionEmb model implementation.
     """
     main_input_name: ClassVar[str] = "doc_input_ids"  # transformers-related
-    config_class = GraniteVisionEmbConfig
-    def __init__(self, config: GraniteVisionEmbConfig):
         super().__init__(config=config)
         model = LlavaNextWithCustomPacking(config=config)
@@ -108,6 +108,8 @@ class GraniteVisionEmb(LlavaNextPreTrainedModel):
             self._tied_weights_keys = [f"model.language_model.{k}" for k in model.language_model._tied_weights_keys]
         self.model = model
         self.dim = 128
         self.custom_text_proj = nn.Linear(self.model.config.text_config.hidden_size, self.dim)

 from transformers.models.llava_next.modeling_llava_next import LlavaNextForConditionalGeneration
 from transformers.models.llava_next.modeling_llava_next import unpad_image, get_anyres_image_grid_shape
+from .colgranitevision_config import ColGraniteVisionConfig
+class LlavaNextWithCustomPacking(LlavaNextForConditionalGeneration):
     def pack_image_features(
             self,
             image_features,
             image_sizes,
             vision_feature_select_strategy,
+            image_newline=None,
+            base_image_feature_location="last",
     ):
         """
         Reshape, unpad and then pack each image_feature into a single image_features tensor containing all visual vectors.
                 token length of each image in image_features
         """
         new_image_features = []
         feature_lens = []
         for image_idx, image_feature in enumerate(image_features):
         return image_features, feature_lens
+class ColGraniteVision(LlavaNextPreTrainedModel):
     """
+    ColGraniteVision model implementation.
     """
     main_input_name: ClassVar[str] = "doc_input_ids"  # transformers-related
+    config_class = ColGraniteVisionConfig
+    def __init__(self, config: ColGraniteVisionConfig):
         super().__init__(config=config)
         model = LlavaNextWithCustomPacking(config=config)
             self._tied_weights_keys = [f"model.language_model.{k}" for k in model.language_model._tied_weights_keys]
         self.model = model
+        # TODO: Wait for ColPali2 to create a ColPaliConfig to allow specifying the embedding dimension.
+        # We could do it now but it would break all the models trying to load the model from the checkpoint.
         self.dim = 128
         self.custom_text_proj = nn.Linear(self.model.config.text_config.hidden_size, self.dim)

preprocessor_config.json CHANGED Viewed

@@ -1,137 +1,136 @@
-{
-  "crop_size": {
-    "height": 384,
-    "width": 384
-  },
-  "default_to_square": false,
-  "do_center_crop": true,
-  "do_convert_rgb": null,
-  "do_normalize": true,
-  "do_pad": true,
-  "do_rescale": true,
-  "do_resize": true,
-  "image_grid_pinpoints": [
-    [
-      384,
-      768
-    ],
-    [
-      384,
-      1152
-    ],
-    [
-      384,
-      1536
-    ],
-    [
-      384,
-      1920
-    ],
-    [
-      384,
-      2304
-    ],
-    [
-      384,
-      2688
-    ],
-    [
-      384,
-      3072
-    ],
-    [
-      384,
-      3456
-    ],
-    [
-      384,
-      3840
-    ],
-    [
-      768,
-      384
-    ],
-    [
-      768,
-      768
-    ],
-    [
-      768,
-      1152
-    ],
-    [
-      768,
-      1536
-    ],
-    [
-      768,
-      1920
-    ],
-    [
-      1152,
-      384
-    ],
-    [
-      1152,
-      768
-    ],
-    [
-      1152,
-      1152
-    ],
-    [
-      1536,
-      384
-    ],
-    [
-      1536,
-      768
-    ],
-    [
-      1920,
-      384
-    ],
-    [
-      1920,
-      768
-    ],
-    [
-      2304,
-      384
-    ],
-    [
-      2688,
-      384
-    ],
-    [
-      3072,
-      384
-    ],
-    [
-      3456,
-      384
-    ],
-    [
-      3840,
-      384
-    ]
-  ],
-  "image_mean": [
-    0.5,
-    0.5,
-    0.5
-  ],
-  "image_processor_type": "LlavaNextImageProcessor",
-  "image_std": [
-    0.5,
-    0.5,
-    0.5
-  ],
-  "processor_class": "GraniteVisionEmbProcessor",
-  "resample": 3,
-  "rescale_factor": 0.00392156862745098,
-  "size": {
-    "height": 384,
-    "width": 384
-  }
-}

+{
+  "crop_size": {
+    "height": 384,
+    "width": 384
+  },
+  "do_center_crop": true,
+  "do_convert_rgb": null,
+  "do_normalize": true,
+  "do_pad": true,
+  "do_rescale": true,
+  "do_resize": true,
+  "image_grid_pinpoints": [
+    [
+      384,
+      768
+    ],
+    [
+      384,
+      1152
+    ],
+    [
+      384,
+      1536
+    ],
+    [
+      384,
+      1920
+    ],
+    [
+      384,
+      2304
+    ],
+    [
+      384,
+      2688
+    ],
+    [
+      384,
+      3072
+    ],
+    [
+      384,
+      3456
+    ],
+    [
+      384,
+      3840
+    ],
+    [
+      768,
+      384
+    ],
+    [
+      768,
+      768
+    ],
+    [
+      768,
+      1152
+    ],
+    [
+      768,
+      1536
+    ],
+    [
+      768,
+      1920
+    ],
+    [
+      1152,
+      384
+    ],
+    [
+      1152,
+      768
+    ],
+    [
+      1152,
+      1152
+    ],
+    [
+      1536,
+      384
+    ],
+    [
+      1536,
+      768
+    ],
+    [
+      1920,
+      384
+    ],
+    [
+      1920,
+      768
+    ],
+    [
+      2304,
+      384
+    ],
+    [
+      2688,
+      384
+    ],
+    [
+      3072,
+      384
+    ],
+    [
+      3456,
+      384
+    ],
+    [
+      3840,
+      384
+    ]
+  ],
+  "image_mean": [
+    0.5,
+    0.5,
+    0.5
+  ],
+  "image_processor_type": "LlavaNextImageProcessor",
+  "image_std": [
+    0.5,
+    0.5,
+    0.5
+  ],
+  "processor_class": "ColGraniteVisionProcessor",
+  "resample": 3,
+  "rescale_factor": 0.00392156862745098,
+  "size": {
+    "height": 384,
+    "width": 384
+  }
+}

processing_granite_vision_embedding.py → processing_colgranitevision.py RENAMED Viewed

@@ -21,9 +21,9 @@ def floor_by_factor(number: float, factor: int) -> int:
     return math.floor(number / factor) * factor
-class GraniteVisionEmbProcessor(LlavaNextProcessor):
     """
-    Processor for GraniteVisionEmb.
     """
     visual_prompt_prefix: ClassVar[str] = "<|user|>\n<image>\nDescribe the image.\n"
@@ -133,7 +133,7 @@ class GraniteVisionEmbProcessor(LlavaNextProcessor):
         """
         Resize and pad the image to the required format.
         """
-        return self.resize_and_pad_centered_to_long_side(
             image=image,
             factor=self.factor,
             min_size=self.min_size,
@@ -141,52 +141,6 @@ class GraniteVisionEmbProcessor(LlavaNextProcessor):
             fill_color=0
         )
-    def resize_and_pad_centered_to_long_side(
-            self,
-            image: Image.Image,
-            factor: int,
-            min_size: int,
-            max_size: int,
-            fill_color=0
-    ) -> Image.Image:
-        """
-        Resizes and pads an image such that:
-        - The long side is set to `max_size`.
-        - The short side is scaled proportionally but not below `min_size`.
-        - The image is centered within the final padded area.
-        :param image: PIL Image
-        :param factor: Factor to make dimensions divisible by
-        :param min_size: Minimum allowed size for the short side
-        :param max_size: Target size for the long side
-        :param fill_color: Background padding color (default black)
-        :return: Resized and padded image
-        """
-        # Get original size
-        width, height = image.size
-        if min_size == -1 or max_size == -1:
-            return image.convert("RGB")
-        # Step 1: scale long side to max_size, keep aspect ratio
-        if width > height:
-            scale_factor = max_size / width
-            target_width = max_size
-            max_scale_factor = max(min_size / height, scale_factor)
-            target_height = round(height * max_scale_factor)
-        else:
-            scale_factor = max_size / height
-            target_height = max_size
-            max_scale_factor = max(min_size / width, scale_factor)
-            target_width = round(width * max_scale_factor)
-        # Resize the image
-        resized_image = image.resize((target_width, target_height), Image.LANCZOS)
-        final_image = resized_image.convert("RGB")
-        return final_image
     def resize_and_pad_centered(self,
                                 image: Image.Image,
                                 factor: int,
@@ -300,7 +254,7 @@ class GraniteVisionEmbProcessor(LlavaNextProcessor):
             images: List[Image.Image],
     ) -> BatchFeature:
         """
-        Process images.
         """
         # texts_doc = [self.apply_chat_template(self.format_data_wo_role(self.visual_prompt_prefix, img),tokenize=False ) for img in images]
         texts_doc = [self.visual_prompt_prefix for _ in images]
@@ -320,7 +274,10 @@ class GraniteVisionEmbProcessor(LlavaNextProcessor):
         processed = []
         for q in queries:
-            q = self.query_start + self.query_prefix + q + ' ' + q
             q += suffix + "\n"
             processed.append(q)
@@ -391,7 +348,7 @@ class GraniteVisionEmbProcessor(LlavaNextProcessor):
     ) -> torch.Tensor:
         """
         Compute the late-interaction/MaxSim score (ColBERT-like) for the given multi-vector
-        query embeddings (`qs`) and passage embeddings (`ps`). For us, a passage is the
         image of a document page.
         Because the embedding tensors are multi-vector and can thus have different shapes, they
@@ -436,4 +393,4 @@ class GraniteVisionEmbProcessor(LlavaNextProcessor):
         assert scores.shape[0] == len(qs), f"Expected {len(qs)} scores, got {scores.shape[0]}"
         scores = scores.to(torch.float32)
-        return scores

     return math.floor(number / factor) * factor
+class ColGraniteVisionProcessor(LlavaNextProcessor):
     """
+    Processor for ColPali.
     """
     visual_prompt_prefix: ClassVar[str] = "<|user|>\n<image>\nDescribe the image.\n"
         """
         Resize and pad the image to the required format.
         """
+        return self.resize_and_pad_centered(
             image=image,
             factor=self.factor,
             min_size=self.min_size,
             fill_color=0
         )
     def resize_and_pad_centered(self,
                                 image: Image.Image,
                                 factor: int,
             images: List[Image.Image],
     ) -> BatchFeature:
         """
+        Process images for ColPali.
         """
         # texts_doc = [self.apply_chat_template(self.format_data_wo_role(self.visual_prompt_prefix, img),tokenize=False ) for img in images]
         texts_doc = [self.visual_prompt_prefix for _ in images]
         processed = []
         for q in queries:
+            q = self.query_start + self.query_prefix + q
+            # truncate before it eats actual query content
+            if len(q) + len(suffix) > max_length:
+                q = q[: max_length - len(suffix) - 1]
             q += suffix + "\n"
             processed.append(q)
     ) -> torch.Tensor:
         """
         Compute the late-interaction/MaxSim score (ColBERT-like) for the given multi-vector
+        query embeddings (`qs`) and passage embeddings (`ps`). For ColPali, a passage is the
         image of a document page.
         Because the embedding tensors are multi-vector and can thus have different shapes, they
         assert scores.shape[0] == len(qs), f"Expected {len(qs)} scores, got {scores.shape[0]}"
         scores = scores.to(torch.float32)
+        return scores

processor_config.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
-  "processor_class": "GraniteVisionEmbProcessor",
   "auto_map": {
-    "AutoProcessor": "processing_granite_vision_embedding.GraniteVisionEmbProcessor"
   }
 }

 {
+  "processor_class": "ColGraniteVisionProcessor",
   "auto_map": {
+    "AutoProcessor": "processing_colgranitevision.ColGraniteVisionProcessor"
   }
 }

special_tokens_map.json CHANGED Viewed

@@ -1,35 +1,35 @@
-{
-  "additional_special_tokens": [
-    "<|start_of_role|>",
-    "<|end_of_role|>",
-    "<|tool_call|>"
-  ],
-  "bos_token": {
-    "content": "<|end_of_text|>",
-    "lstrip": false,
-    "normalized": false,
-    "rstrip": false,
-    "single_word": false
-  },
-  "eos_token": {
-    "content": "<|end_of_text|>",
-    "lstrip": false,
-    "normalized": false,
-    "rstrip": false,
-    "single_word": false
-  },
-  "pad_token": {
-    "content": "<|end_of_text|>",
-    "lstrip": false,
-    "normalized": false,
-    "rstrip": false,
-    "single_word": false
-  },
-  "unk_token": {
-    "content": "<|end_of_text|>",
-    "lstrip": false,
-    "normalized": false,
-    "rstrip": false,
-    "single_word": false
-  }
-}

+{
+  "additional_special_tokens": [
+    "<|start_of_role|>",
+    "<|end_of_role|>",
+    "<|tool_call|>"
+  ],
+  "bos_token": {
+    "content": "<|end_of_text|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|end_of_text|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|end_of_text|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<|end_of_text|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer_config.json CHANGED Viewed

@@ -1,208 +1,207 @@
-{
-  "add_bos_token": false,
-  "add_prefix_space": false,
-  "added_tokens_decoder": {
-    "0": {
-      "content": "<|end_of_text|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "1": {
-      "content": "<fim_prefix>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "2": {
-      "content": "<fim_middle>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "3": {
-      "content": "<fim_suffix>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "4": {
-      "content": "<fim_pad>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "5": {
-      "content": "<filename>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "6": {
-      "content": "<gh_stars>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "7": {
-      "content": "<issue_start>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "8": {
-      "content": "<issue_comment>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "9": {
-      "content": "<issue_closed>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "10": {
-      "content": "<jupyter_start>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "11": {
-      "content": "<jupyter_text>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "12": {
-      "content": "<jupyter_code>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "13": {
-      "content": "<jupyter_output>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "14": {
-      "content": "<empty_output>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "15": {
-      "content": "<commit_before>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "16": {
-      "content": "<commit_msg>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "17": {
-      "content": "<commit_after>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "18": {
-      "content": "<reponame>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "49152": {
-      "content": "<|start_of_role|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "49153": {
-      "content": "<|end_of_role|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "49154": {
-      "content": "<|tool_call|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "49155": {
-      "content": "<image>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    }
-  },
-  "additional_special_tokens": [
-    "<|start_of_role|>",
-    "<|end_of_role|>",
-    "<|tool_call|>"
-  ],
-  "bos_token": "<|end_of_text|>",
-  "chat_template": "{%- if tools %}\n    {{- '<|start_of_role|>available_tools<|end_of_role|>\n' }}\n    {%- for tool in tools %}\n    {{- tool | tojson(indent=4) }}\n    {%- if not loop.last %}\n        {{- '\n\n' }}\n    {%- endif %}\n    {%- endfor %}\n    {{- '<|end_of_text|>\n' }}\n{%- endif %}\n{%- for message in messages if message['role'] == 'system'%}{% else %}<|system|>\nA chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n{% endfor %}{%- for message in messages %}\n    {%- if message['role'] == 'system' %}\n    {{- '<|system|>\n' + message['content'] + '\n' }}\n    {%- elif message['role'] == 'user' %}\n    {{- '<|user|>\n' + message['content'] + '\n' }}\n    {%- elif message['role'] == 'assistant' %}\n    {{- '<|assistant|>\n'  + message['content'] + '<|end_of_text|>' }}\n    {%- elif message['role'] == 'assistant_tool_call' %}\n    {{- '<|start_of_role|>assistant<|end_of_role|><|tool_call|>' + message['content'] + '<|end_of_text|>\n' }}\n    {%- elif message['role'] == 'tool_response' %}\n    {{- '<|start_of_role|>tool_response<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }}\n    {%- endif %}\n    {%- if loop.last and add_generation_prompt %}\n    {{- '<|assistant|>\n' }}\n    {%- endif %}\n{%- endfor %}",
-  "clean_up_tokenization_spaces": true,
-  "do_image_splitting": false,
-  "eos_token": "<|end_of_text|>",
-  "errors": "replace",
-  "extra_special_tokens": {},
-  "model_max_length": 131072,
-  "pad_token": "<|end_of_text|>",
-  "padding_side": "right",
-  "tokenizer_class": "GPT2Tokenizer",
-  "unk_token": "<|end_of_text|>",
-  "vocab_size": 49152
 }

+{
+  "add_bos_token": false,
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<|end_of_text|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<fim_prefix>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "<fim_middle>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<fim_suffix>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "<fim_pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5": {
+      "content": "<filename>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "6": {
+      "content": "<gh_stars>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "7": {
+      "content": "<issue_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "8": {
+      "content": "<issue_comment>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "9": {
+      "content": "<issue_closed>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "10": {
+      "content": "<jupyter_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "11": {
+      "content": "<jupyter_text>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "12": {
+      "content": "<jupyter_code>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "13": {
+      "content": "<jupyter_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "14": {
+      "content": "<empty_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "15": {
+      "content": "<commit_before>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "16": {
+      "content": "<commit_msg>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "17": {
+      "content": "<commit_after>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "18": {
+      "content": "<reponame>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49152": {
+      "content": "<|start_of_role|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49153": {
+      "content": "<|end_of_role|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49154": {
+      "content": "<|tool_call|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49155": {
+      "content": "<image>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<|start_of_role|>",
+    "<|end_of_role|>",
+    "<|tool_call|>"
+  ],
+  "bos_token": "<|end_of_text|>",
+  "chat_template": "{%- if tools %}\n    {{- '<|start_of_role|>available_tools<|end_of_role|>\n' }}\n    {%- for tool in tools %}\n    {{- tool | tojson(indent=4) }}\n    {%- if not loop.last %}\n        {{- '\n\n' }}\n    {%- endif %}\n    {%- endfor %}\n    {{- '<|end_of_text|>\n' }}\n{%- endif %}\n{%- for message in messages if message['role'] == 'system'%}{% else %}<|system|>\nA chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n{% endfor %}{%- for message in messages %}\n    {%- if message['role'] == 'system' %}\n    {{- '<|system|>\n' + message['content'] + '\n' }}\n    {%- elif message['role'] == 'user' %}\n    {{- '<|user|>\n' + message['content'] + '\n' }}\n    {%- elif message['role'] == 'assistant' %}\n    {{- '<|assistant|>\n'  + message['content'] + '<|end_of_text|>' }}\n    {%- elif message['role'] == 'assistant_tool_call' %}\n    {{- '<|start_of_role|>assistant<|end_of_role|><|tool_call|>' + message['content'] + '<|end_of_text|>\n' }}\n    {%- elif message['role'] == 'tool_response' %}\n    {{- '<|start_of_role|>tool_response<|end_of_role|>' + message['content'] + '<|end_of_text|>\n' }}\n    {%- endif %}\n    {%- if loop.last and add_generation_prompt %}\n    {{- '<|assistant|>\n' }}\n    {%- endif %}\n{%- endfor %}",
+  "clean_up_tokenization_spaces": true,
+  "eos_token": "<|end_of_text|>",
+  "errors": "replace",
+  "extra_special_tokens": {},
+  "model_max_length": 16384,
+  "pad_token": "<|end_of_text|>",
+  "padding_side": "right",
+  "tokenizer_class": "GPT2Tokenizer",
+  "unk_token": "<|end_of_text|>",
+  "vocab_size": 49152
 }