README.md CHANGED
@@ -6,7 +6,7 @@ library_name: transformers
6
  ---
7
 
8
 
9
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6512d9827fccffe1e9e28fa7/Lra7yfdthGdKcNk7vP5RS.png)
10
 
11
 
12
  ## **Overview**
@@ -18,12 +18,6 @@ The model is primarily designed with a focus on lightweight architecture, optimi
18
  Particularly, the model shows relative strengths in handling Korean-language inputs and outperforms similarly sized open-source models in related benchmarks. As the first open-source vision-language model in Korea capable of visual understanding, it is expected to significantly contribute to strengthening Korea's sovereign AI capabilities.
19
 
20
 
21
- ## **Updates**
22
- - **(2025.07.25)**: vLLM engine is available with [our repository](https://github.com/NAVER-Cloud-HyperCLOVA-X/vllm/tree/v0.9.2rc2_hyperclovax_vision_seed)
23
- - **(2025.07.08)**: Major code update for supporting vLLM engine ([link - related_discussion](https://huggingface.co/naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B/discussions/27))
24
- - **(2025.04.22)**: Initial release of the repository.
25
-
26
-
27
  ## **Basic Information**
28
 
29
  - **Model Architecture**: LLaVA-based Vision-Language Model
@@ -83,7 +77,6 @@ Although HyperCLOVAX-SEED-Vision-Instruct-3B is a lightweight model, it is capab
83
  - [decord](https://github.com/dmlc/decord)
84
 
85
  ## Example
86
- **(code & benchmark score) checked with transformers 4.52.4**
87
 
88
  ```python
89
 
@@ -91,115 +84,9 @@ from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
91
 
92
  model_name = "naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B"
93
  model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True).to(device="cuda")
94
- processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
95
  tokenizer = AutoTokenizer.from_pretrained(model_name)
96
 
97
- # LLM Example
98
- # It is recommended to use the chat template with HyperCLOVAX models.
99
- # Using the chat template allows you to easily format your input in ChatML style.
100
- llm_chat = [
101
- {"role": "system", "content": [{"type": "text", "text": "you are helpful assistant!"}]},
102
- {
103
- "role": "user",
104
- "content": [
105
- {"type": "text", "text": "Hello, how are you?"},
106
- {"type": "text", "text": "I said. Hello, how are you today?"},
107
- ]
108
- },
109
- {"role": "assistant", "content": [{"type": "text", "text": "I'm doing great. How can I help you today?"}]},
110
- {"role": "user", "content": [{"type": "text", "text": "I'd like to show off how chat templating works!"}]},
111
- ]
112
- model_inputs = processor.apply_chat_template(
113
- llm_chat, tokenize=True, return_dict=True, return_tensors="pt", add_generation_prompt=True
114
- )
115
- model_inputs = model_inputs.to(device="cuda")
116
-
117
- # Please adjust parameters like top_p appropriately for your use case.
118
- output_ids = model.generate(
119
- **model_inputs,
120
- max_new_tokens=64,
121
- do_sample=True,
122
- top_p=0.6,
123
- temperature=0.5,
124
- repetition_penalty=1.0,
125
- )
126
- print("=" * 80)
127
- print("LLM EXAMPLE")
128
- print(processor.batch_decode(output_ids)[0])
129
- print("=" * 80)
130
-
131
- # VLM Example
132
- # For images and videos, you can use url, local_path, base64, or bytes as input sources.
133
- vlm_chat = [
134
- {"role": "system", "content": [{"text": "System Prompt", "type": "text"}]},
135
- {"role": "user", "content": [{"text": "User Text Prompt 1", "type": "text"}]},
136
- {
137
- "role": "user",
138
- "content": [{
139
- "filename": "tradeoff_sota.png",
140
- "image": "https://github.com/naver-ai/rdnet/blob/main/resources/images/tradeoff_sota.png?raw=true",
141
- "lens_keywords": "Gucci Ophidia, cross bag, Ophidia small, GG, Supreme shoulder bag",
142
- "lens_local_keywords": "[0.07, 0.21, 0.92, 0.90] Gucci Ophidia",
143
- "ocr": "List the words in the image in raster order. Even if the word order feels unnatural for reading, the model will handle it as long as it follows raster order.", "type": "image",
144
- }],
145
- },
146
- {
147
- "role": "user",
148
- "content": [{
149
- "filename": "tradeoff.png",
150
- "image": "https://github.com/naver-ai/rdnet/blob/main/resources/images/tradeoff.png?raw=true",
151
- "type": "image",
152
- }],
153
- },
154
- {"role": "assistant", "content": [{"text": "Assistant Text Prompt 1", "type": "text"}]},
155
- {"role": "user", "content": [{"text": "User Text Prompt 2", "type": "text"}]},
156
- {
157
- "role": "user",
158
- "content": [
159
- {
160
- "type": "video",
161
- "video": "freenaturestock-rolling-mist-clouds.mp4",
162
- "lens_keywords": "Prada re-edition, nylon bag, mini cross bag, logo strap, essential shoulder bag",
163
- "lens_local_keywords": "[0.12, 0.34, 0.85, 0.76] Prada re-edition",
164
- "speech_to_text": "Please enter the dialogue, voice, sound, lines, and words in the video in text format.",
165
- },
166
- {"text": "User Text Prompt 3", "type": "text"},
167
- ]
168
- },
169
- ]
170
-
171
- model_inputs = processor.apply_chat_template(
172
- vlm_chat, tokenize=True, return_dict=True, return_tensors="pt", add_generation_prompt=True,
173
- )
174
- model_inputs = model_inputs.to(device="cuda")
175
- output_ids = model.generate(
176
- **model_inputs,
177
- max_new_tokens=64,
178
- do_sample=True,
179
- top_p=0.6,
180
- temperature=0.5,
181
- repetition_penalty=1.0,
182
- )
183
- print("=" * 80)
184
- print("VLM EXAMPLE")
185
- print(processor.batch_decode(output_ids)[0])
186
- print("=" * 80)
187
-
188
- ```
189
-
190
- ## Example for v0.1.0
191
- **(code & benchmark score) checked with transformers 4.45.0**
192
-
193
- ```python
194
-
195
- from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
196
-
197
- model_name = "naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B"
198
- revision="v0.1.0"
199
- model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, revision=revision).to(device="cuda")
200
- preprocessor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True, revision=revision)
201
- tokenizer = AutoTokenizer.from_pretrained(model_name, revision=revision)
202
-
203
  # LLM Example
204
  # It is recommended to use the chat template with HyperCLOVAX models.
205
  # Using the chat template allows you to easily format your input in ChatML style.
@@ -278,25 +165,7 @@ output_ids = model.generate(
278
  repetition_penalty=1.0,
279
  **preprocessed,
280
  )
281
- print("=" * 80)
282
- print("VLM EXAMPLE")
283
  print(tokenizer.batch_decode(output_ids)[0])
284
- print("=" * 80)
285
  ```
286
 
287
  - To ensure the highest level of image understanding performance, it is recommended to include additional information such as Optical Character Recognition (OCR) results and entity recognition (Lens). The provided usage examples are written under the assumption that OCR and Lens results are available. If you input data in this format, you can expect significantly improved output quality.
288
-
289
- ## vLLM
290
- To speed up your inference, you can use the vLLM engine from [our repository](https://github.com/NAVER-Cloud-HyperCLOVA-X/vllm/tree/v0.9.2rc2_hyperclovax_vision_seed).
291
-
292
- Make sure to switch to the `v0.9.2rc2_hyperclovax_vision_seed` branch.
293
-
294
- **Launch API server**:
295
- - https://oss.navercorp.com/HYPERSCALE-AI-VISION/vllm/blob/main/README.md
296
-
297
- **Request Example**:
298
- - https://github.com/vllm-project/vllm/pull/20931#issue-3229161410
299
-
300
- **Offline Inference Examples**:
301
- - https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/vision_language.py
302
- - https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/vision_language_multi_image.py
 
6
  ---
7
 
8
 
9
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65265ab8f8db96cffcb969dc/RD1HOJJnDQbz6IvNngiIV.png)
10
 
11
 
12
  ## **Overview**
 
18
  Particularly, the model shows relative strengths in handling Korean-language inputs and outperforms similarly sized open-source models in related benchmarks. As the first open-source vision-language model in Korea capable of visual understanding, it is expected to significantly contribute to strengthening Korea's sovereign AI capabilities.
19
 
20
 
 
 
 
 
 
 
21
  ## **Basic Information**
22
 
23
  - **Model Architecture**: LLaVA-based Vision-Language Model
 
77
  - [decord](https://github.com/dmlc/decord)
78
 
79
  ## Example
 
80
 
81
  ```python
82
 
 
84
 
85
  model_name = "naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B"
86
  model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True).to(device="cuda")
87
+ preprocessor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
88
  tokenizer = AutoTokenizer.from_pretrained(model_name)
89
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90
  # LLM Example
91
  # It is recommended to use the chat template with HyperCLOVAX models.
92
  # Using the chat template allows you to easily format your input in ChatML style.
 
165
  repetition_penalty=1.0,
166
  **preprocessed,
167
  )
 
 
168
  print(tokenizer.batch_decode(output_ids)[0])
 
169
  ```
170
 
171
  - To ensure the highest level of image understanding performance, it is recommended to include additional information such as Optical Character Recognition (OCR) results and entity recognition (Lens). The provided usage examples are written under the assumption that OCR and Lens results are available. If you input data in this format, you can expect significantly improved output quality.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
added_tokens.json DELETED
@@ -1,35 +0,0 @@
1
- {
2
- "<EMAIL>": 110521,
3
- "<KEY>": 110522,
4
- "<NAME>": 110520,
5
- "<PASSWORD>": 110523,
6
- "<code_to_intermediate>": 110502,
7
- "<empty_output>": 110501,
8
- "<file_sep>": 110492,
9
- "<intermediate_to_code>": 110503,
10
- "<issue_closed>": 110495,
11
- "<issue_comment>": 110494,
12
- "<issue_start>": 110493,
13
- "<jupyter_code>": 110498,
14
- "<jupyter_output>": 110499,
15
- "<jupyter_script>": 110500,
16
- "<jupyter_start>": 110496,
17
- "<jupyter_text>": 110497,
18
- "<pr>": 110504,
19
- "<pr_base>": 110507,
20
- "<pr_base_code>": 110509,
21
- "<pr_comment>": 110512,
22
- "<pr_diff>": 110510,
23
- "<pr_diff_hunk>": 110511,
24
- "<pr_diff_hunk_comment_line>": 110519,
25
- "<pr_event_id>": 110513,
26
- "<pr_file>": 110508,
27
- "<pr_in_reply_to_comment_id>": 110518,
28
- "<pr_in_reply_to_review_id>": 110517,
29
- "<pr_is_merged>": 110506,
30
- "<pr_review>": 110514,
31
- "<pr_review_comment>": 110516,
32
- "<pr_review_state>": 110515,
33
- "<pr_status>": 110505,
34
- "<repo_name>": 110491
35
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
chat_template.jinja DELETED
@@ -1,65 +0,0 @@
1
- <|im_start|>tool_list
2
- <|im_end|>
3
- {% for message in messages %}
4
- {% set content = message['content'] %}
5
- {% set role = message['role'] %}
6
- {% if loop.first and role != 'system' %}
7
- <|im_start|>system
8
- You are a helpful assistant.<|im_end|>
9
- {% endif %}
10
- {% if message['content'] is string %}
11
- <|im_start|>{{ role }}
12
- {{ message['content'] }}<|im_end|>
13
- {% elif message['content'] is mapping %}
14
- {% if content['type'] == 'image' %}
15
- <|im_start|>{{ role }} (mime)
16
- {"type": "image/jpeg", "filename": "{{ content['filename'] }}"}<|im_end|>
17
- <|im_start|>{{ role }} (vector)
18
- <|dummy3|><|im_end|>
19
- <|im_start|>image/aux
20
- 다음 중 ocr은 사진에서 검출된 글자이고, lens_keyword는 사진에서 추출된 keyword와 bbox 위치입니다. bbox는 0~1 사이로 정규화된 [x1, y1, x2, y2]의 형태입니다. 참고하여 답변하세요. {"ocr": "{{ content['ocr'] or '' }}", "lens_keywords": "{{ content['lens_keywords'] or '' }}", "lens_local_keywords": "{{ content['lens_local_keywords'] or '' }}"}<|im_end|>
21
- {% elif content['type'] == 'video' %}
22
- <|im_start|>{{ role }} (mime)
23
- {"type": "video/mp4", "filename": "{{ content['filename'] }}"}<|im_end|>
24
- <|im_start|>{{ role }} (vector)
25
- <|_unuse_missing_100270|><|im_end|>
26
- <|im_start|>image/aux
27
- {% if content.get('is_final_grid') %}
28
- 다음 중 lens_keyword는 사진에서 추출된 keyword와 bbox 위치입니다. bbox는 0~1 사이로 정규화된 [x1, y1, x2, y2]의 형태입니다. video_time_stamp는 비디오에서 해당 구간의 시간 정보입니다. speech_to_text는 비디오 속에서의 대화, 음성, 소리, 대사, 그리고 말을 전부 글로 받아 적은 것 입니다. 참고하여 답변하세요. {"video_time_stamp": "{{ content['video_time_stamp'] }}", "lens_keywords": "{{ content.get('lens_keywords', '') }}", "lens_local_keywords": "{{ content.get('lens_local_keywords', '') }}", "speech_to_text": "{{ content.get('speech_to_text', '') }}"}
29
- {% else %}
30
- 다음 중 video_time_stamp는 비디오에서 해당 구간의 시간 정보입니다. 참고하여 답변하세요. {"video_time_stamp": "{{ content['video_time_stamp'] }}"}
31
- {% endif %}<|im_end|>
32
- {% elif content['type'] == 'text' %}
33
- <|im_start|>{{ role }}
34
- {{ content['text'] }}<|im_end|>
35
- {% endif %}
36
- {% elif message['content'] is sequence %}
37
- {% for content in message['content'] %}
38
- {% if content['type'] == 'image' %}
39
- <|im_start|>{{ role }} (mime)
40
- {"type": "image/jpeg", "filename": "{{ content['filename'] }}"}<|im_end|>
41
- <|im_start|>{{ role }} (vector)
42
- <|dummy3|><|im_end|>
43
- <|im_start|>image/aux
44
- 다음 중 ocr은 사진에서 검출된 글자이고, lens_keyword는 사진에서 추출된 keyword와 bbox 위치입니다. bbox는 0~1 사이로 정규화된 [x1, y1, x2, y2]의 형태입니다. 참고하여 답변하세요. {"ocr": "{{ content['ocr'] or '' }}", "lens_keywords": "{{ content['lens_keywords'] or '' }}", "lens_local_keywords": "{{ content['lens_local_keywords'] or '' }}"}<|im_end|>
45
- {% elif content['type'] == 'video' %}
46
- <|im_start|>{{ role }} (mime)
47
- {"type": "video/mp4", "filename": "{{ content['filename'] }}"}<|im_end|>
48
- <|im_start|>{{ role }} (vector)
49
- <|_unuse_missing_100270|><|im_end|>
50
- <|im_start|>image/aux
51
- {% if content.get('is_final_grid') %}
52
- 다음 중 lens_keyword는 사진에서 추출된 keyword와 bbox 위치입니다. bbox는 0~1 사이로 정규화된 [x1, y1, x2, y2]의 형태입니다. video_time_stamp는 비디오에서 해당 구간의 시간 정보입니다. speech_to_text는 비디오 속에서의 대화, 음성, 소리, 대사, 그리고 말을 전부 글로 받아 적은 것 입니다. 참고하여 답변하세요. {"video_time_stamp": "{{ content['video_time_stamp'] }}", "lens_keywords": "{{ content.get('lens_keywords', '') }}", "lens_local_keywords": "{{ content.get('lens_local_keywords', '') }}", "speech_to_text": "{{ content.get('speech_to_text', '') }}"}
53
- {% else %}
54
- 다음 중 video_time_stamp는 비디오에서 해당 구간의 시간 정보입니다. 참고하여 답변하세요. {"video_time_stamp": "{{ content['video_time_stamp'] }}"}
55
- {% endif %}<|im_end|>
56
- {% elif content['type'] == 'text' %}
57
- <|im_start|>{{ role }}
58
- {{ content['text'] }}<|im_end|>
59
- {% endif %}
60
- {% endfor %}
61
- {% endif %}
62
- {% endfor %}
63
- {% if add_generation_prompt %}
64
- <|im_start|>assistant
65
- {% endif %}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
config.json CHANGED
@@ -13,10 +13,8 @@
13
  "freeze_mm_projector": false,
14
  "hidden_size": 3072,
15
  "ignore_index": -100,
16
- "video_token_id": 100270,
17
- "image_token_id": 100271,
18
- "mm_projector_type": "cabstractor",
19
- "text_config": {
20
  "_attn_implementation_autoset": true,
21
  "_name_or_path": "",
22
  "add_cross_attention": false,
@@ -98,7 +96,7 @@
98
  "top_p": 1.0,
99
  "torch_dtype": "bfloat16",
100
  "torchscript": false,
101
- "transformers_version": "4.52.4",
102
  "typical_p": 1.0,
103
  "use_bfloat16": false,
104
  "use_cache": true,
@@ -107,15 +105,12 @@
107
  "max_image_cnt": 12,
108
  "max_num_grids": 9,
109
  "model_type": "hyperclovax_vlm",
110
- "num_queries_vis_abstractor_image": 81,
111
- "num_queries_vis_abstractor_video_slow": 81,
112
- "num_queries_vis_abstractor_video_fast": 9,
113
- "first_last_frames_slow": false,
114
  "proj_pos_emb": true,
115
  "proj_prenorm": false,
116
  "q_former_model_name_or_path": null,
117
- "torch_dtype": "bfloat16",
118
- "transformers_version": "4.52.4",
119
  "unpad": true,
120
  "use_1x1_grid": true,
121
  "use_nth_layer": -2,
@@ -123,6 +118,7 @@
123
  "_attn_implementation_autoset": true,
124
  "_name_or_path": "",
125
  "add_cross_attention": false,
 
126
  "architectures": [
127
  "SiglipVisionModel"
128
  ],
@@ -195,8 +191,8 @@
195
  "top_p": 1.0,
196
  "torch_dtype": "bfloat16",
197
  "torchscript": false,
198
- "transformers_version": "4.52.4",
199
  "typical_p": 1.0,
200
  "use_bfloat16": true
201
  }
202
- }
 
13
  "freeze_mm_projector": false,
14
  "hidden_size": 3072,
15
  "ignore_index": -100,
16
+ "img_start_id": 100271,
17
+ "language_config": {
 
 
18
  "_attn_implementation_autoset": true,
19
  "_name_or_path": "",
20
  "add_cross_attention": false,
 
96
  "top_p": 1.0,
97
  "torch_dtype": "bfloat16",
98
  "torchscript": false,
99
+ "transformers_version": "4.48.2",
100
  "typical_p": 1.0,
101
  "use_bfloat16": false,
102
  "use_cache": true,
 
105
  "max_image_cnt": 12,
106
  "max_num_grids": 9,
107
  "model_type": "hyperclovax_vlm",
108
+ "num_queries_vis_abstractor": 81,
 
 
 
109
  "proj_pos_emb": true,
110
  "proj_prenorm": false,
111
  "q_former_model_name_or_path": null,
112
+ "torch_dtype": "float32",
113
+ "transformers_version": "4.48.2",
114
  "unpad": true,
115
  "use_1x1_grid": true,
116
  "use_nth_layer": -2,
 
118
  "_attn_implementation_autoset": true,
119
  "_name_or_path": "",
120
  "add_cross_attention": false,
121
+ "anyres": true,
122
  "architectures": [
123
  "SiglipVisionModel"
124
  ],
 
191
  "top_p": 1.0,
192
  "torch_dtype": "bfloat16",
193
  "torchscript": false,
194
+ "transformers_version": "4.48.2",
195
  "typical_p": 1.0,
196
  "use_bfloat16": true
197
  }
198
+ }
configuration_hyperclovax.py CHANGED
@@ -1,4 +1,3 @@
1
- from transformers import AutoConfig
2
  from transformers.configuration_utils import PretrainedConfig
3
  from transformers.utils import logging
4
 
@@ -10,7 +9,7 @@ class HCXVisionConfig(PretrainedConfig):
10
  keys_to_ignore_at_inference = ["past_key_values"]
11
 
12
  # The `gpt2` class has a different name, so it needs to be updated accordingly.
13
- text_config_attribute_map = {
14
  "n_embd": "hidden_size",
15
  "n_positions": "max_position_embeddings",
16
  "n_head": "num_attention_heads",
@@ -19,7 +18,7 @@ class HCXVisionConfig(PretrainedConfig):
19
 
20
  def __init__(
21
  self,
22
- text_config=None,
23
  vision_config=None,
24
  use_nth_layer=-2,
25
  img_start_id=100009, # <|dummy3|>
@@ -34,20 +33,18 @@ class HCXVisionConfig(PretrainedConfig):
34
  use_1x1_grid=False,
35
  **kwargs,
36
  ):
37
- for key, val in self.text_config_attribute_map.items():
38
- if text_config is not None and key in text_config:
39
- text_config[val] = text_config.pop(key)
40
 
41
- if text_config is not None:
42
- _text_config = AutoConfig.for_model(text_config["model_type"])
43
- self.text_config = _text_config.from_dict(text_config)
44
 
 
45
  # In DeepSpeed ZeRO-3, the memory size is automatically determined based on the `hidden_size` specified in the config.
46
- self.hidden_size = text_config["hidden_size"] if "hidden_size" in text_config else text_config["n_embd"]
47
- if vision_config is not None:
48
- _vision_config = AutoConfig.for_model(vision_config["model_type"])
49
- self.vision_config = _vision_config.from_dict(vision_config)
50
-
51
  # add VLM configs
52
  self.use_nth_layer = use_nth_layer
53
  self.decoder_max_length = decoder_max_length
@@ -61,6 +58,3 @@ class HCXVisionConfig(PretrainedConfig):
61
  self.proj_prenorm = proj_prenorm
62
  self.use_1x1_grid = use_1x1_grid
63
  super().__init__(**kwargs)
64
-
65
- def get_text_config(self, decoder=False):
66
- return self.text_config
 
 
1
  from transformers.configuration_utils import PretrainedConfig
2
  from transformers.utils import logging
3
 
 
9
  keys_to_ignore_at_inference = ["past_key_values"]
10
 
11
  # The `gpt2` class has a different name, so it needs to be updated accordingly.
12
+ language_config_attribute_map = {
13
  "n_embd": "hidden_size",
14
  "n_positions": "max_position_embeddings",
15
  "n_head": "num_attention_heads",
 
18
 
19
  def __init__(
20
  self,
21
+ language_config=None,
22
  vision_config=None,
23
  use_nth_layer=-2,
24
  img_start_id=100009, # <|dummy3|>
 
33
  use_1x1_grid=False,
34
  **kwargs,
35
  ):
36
+ for key, val in self.language_config_attribute_map.items():
37
+ if language_config is not None and key in language_config:
38
+ language_config[val] = language_config.pop(key)
39
 
40
+ self.language_config = language_config
41
+ self.vision_config = vision_config
 
42
 
43
+ if language_config is not None:
44
  # In DeepSpeed ZeRO-3, the memory size is automatically determined based on the `hidden_size` specified in the config.
45
+ self.hidden_size = (
46
+ language_config["hidden_size"] if "hidden_size" in language_config else language_config["n_embd"]
47
+ )
 
 
48
  # add VLM configs
49
  self.use_nth_layer = use_nth_layer
50
  self.decoder_max_length = decoder_max_length
 
58
  self.proj_prenorm = proj_prenorm
59
  self.use_1x1_grid = use_1x1_grid
60
  super().__init__(**kwargs)
 
 
 
generation_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "transformers_version": "4.48.2"
4
+ }
merges.txt DELETED
The diff for this file is too large to render. See raw diff
 
modeling_hyperclovax.py CHANGED
@@ -2,6 +2,7 @@ import ast
2
  import contextlib
3
  import gc
4
  import json
 
5
  import os
6
  from dataclasses import dataclass
7
  from functools import partial
@@ -32,11 +33,10 @@ from transformers.models.auto import CONFIG_MAPPING
32
  from transformers.utils import ModelOutput
33
 
34
  from .configuration_hyperclovax import HCXVisionConfig
35
- from .image_processing_hyperclovax import select_best_resolution
36
 
37
  EOT = "<|endofturn|>"
38
- IMAGE_LOC = "<|dummy3|>"
39
- VIDEO_LOC = "<|_unuse_missing_100270|>"
40
 
41
 
42
  def get_rank():
@@ -220,9 +220,11 @@ def reshape_and_unpad_image_features(
220
 
221
 
222
  def anyres_postprocessing(
223
- image_forward_outs: List[torch.FloatTensor],
 
224
  image_sizes: List[List[int]],
225
  possible_resolutions: List[Tuple[int, int]],
 
226
  patch_size: int,
227
  grid_size: int,
228
  image_newline: torch.FloatTensor,
@@ -245,6 +247,8 @@ def anyres_postprocessing(
245
  dimensions of the corresponding image sample. Used for unpadding.
246
  possible_resolutions (List[Tuple[int, int]]): A list of supported resolution tuples `(height, width)` used by
247
  `reshape_and_unpad_image_features` for spatial reconstruction, especially during unpadding.
 
 
248
  patch_size (int): The spatial dimension (height and width) of the square patches the image was divided into.
249
  grid_size (int): The spatial dimension (height and width) of the square grid onto which patches are mapped.
250
  `grid_size` should be divisible by `patch_size`.
@@ -270,28 +274,102 @@ def anyres_postprocessing(
270
  assert (num_queries_vis_abstractor**0.5).is_integer(), "n_queries must be square number"
271
  height = width = int(num_queries_vis_abstractor**0.5)
272
 
 
 
273
  # post-processing (unpad, add newline)
274
  new_image_features = []
275
- for image_idx, image_feature in enumerate(image_forward_outs):
276
  if image_feature.shape[0] > 1:
277
- image_feature = reshape_and_unpad_image_features(
278
- image_feature=image_feature,
279
- height=height,
280
- width=width,
281
- image_size=image_sizes[image_idx],
282
- possible_resolutions=possible_resolutions,
283
- grid_size=grid_size, # Pass grid info if needed by helper
284
- unpad=unpad,
285
- image_newline=image_newline,
286
- )
 
 
 
287
  else:
288
  image_feature = image_feature[0]
289
- image_feature = torch.cat((image_feature, image_newline[None].to(image_feature.device)), dim=0)
 
290
  new_image_features.append(image_feature)
291
  image_features = new_image_features
292
  return image_features
293
 
294
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
295
  @dataclass
296
  class HCXVisionOutput(ModelOutput):
297
  """Output class for vision models, containing various computation results.
@@ -335,11 +413,9 @@ class HCXVisionForCausalLM(PreTrainedModel, GenerationMixin):
335
 
336
  config_class = HCXVisionConfig
337
  vision_model_name = "vision_model"
338
- _no_split_modules = ["SiglipEncoderLayer", "LlamaDecoderLayer", "HyperCLOVAXDecoderLayer"]
339
  supports_gradient_checkpointing = True
340
  _skip_keys_device_placement = "past_key_values"
341
- _supports_flash_attn_2 = True
342
- _supports_sdpa = True
343
 
344
  def __init__(
345
  self,
@@ -358,57 +434,98 @@ class HCXVisionForCausalLM(PreTrainedModel, GenerationMixin):
358
  - is_safetensor_save: Whether to save model using safetensors format.
359
 
360
  Raises:
361
- ValueError: If vision_config is not defined or if text_config is not defined.
362
  """
363
- super().__init__(config) # self.config = config
364
 
365
- # init configs
366
- text_config = self._init_text_config(config)
367
- vision_config = self._init_vision_config(config)
368
-
369
- ## possible_resolution should be matched with preprocessor_config.json
370
- config.possible_resolutions = self._init_possible_resolutions(config, vision_config)
371
 
372
- # init models & parameters
373
- with no_init_weights(): # weight will be loaded in from_pretrained
374
- self.vision_model = AutoModel.from_config(vision_config, trust_remote_code=True)
 
 
 
 
 
 
 
 
 
 
375
 
376
- self.mm_projector = self._init_mm_projector(config, text_config, vision_config)
 
 
377
 
378
- self.language_model = AutoModelForCausalLM.from_config(text_config)
379
- self.lm_head_vocab_size = getattr(text_config, "padded_vocab_size", text_config.vocab_size)
380
- self.language_model.lm_head = nn.Linear(text_config.hidden_size, self.lm_head_vocab_size, bias=False)
381
 
 
382
  if config.anyres:
383
- self.image_newline = nn.Parameter(torch.empty(text_config.hidden_size, dtype=self.dtype))
 
 
 
 
 
 
 
 
 
 
384
 
385
- # modify configs or model settings
386
- if text_config.model_type in ["llama", "hyperclovax", "gpt2"]:
387
- self.language_model.gradient_checkpointing_enable()
388
- if text_config.model_type == "hyperclovax" and self.use_liger:
389
- self.language_model._get_apply_liger_kernel_converter()(model=self.language_model)
390
 
391
- # update configs
392
- self.vision_config = vision_config = self.vision_model.config
393
- self.text_config = text_config = self.language_model.config
394
- config.update({"vision_config": vision_config})
395
- config.update({"text_config": text_config})
396
 
397
- # etc
398
- self.use_liger = kwargs.pop("use_liger", False)
399
- self.use_fused_ce = kwargs.pop("use_fused_ce", False)
400
- self.use_meansum_loss = kwargs.pop("use_meansum_loss", False)
401
- self.freeze_before_sampler = kwargs.pop("freeze_before_sampler", False)
402
- self.use_turnmeansum_loss = kwargs.pop("use_turnmeansum_loss", False)
403
- self.vision_input_chunk_size = kwargs.pop("vision_input_chunk_size", None)
404
- self.is_safetensor_save = kwargs.get("is_safetensor_save", True)
405
 
406
- use_sum_loss = True if kwargs.pop("use_sum_loss", False) else False
407
- self.reduction = self._init_reduction_type(use_sum_loss)
408
 
409
- self.vision_model_use_no_grad = None # forward 체크 및 할당
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
410
 
411
- self._backward_compatibility_gradient_checkpointing() # self.post_init() 에 포함되어 있는 gc 가능한지 확인하고 켜주는 함수
 
 
 
 
 
 
412
 
413
  def _init_weights(self, module):
414
  # copies from https://github.com/kakaobrain/honeybee/blob/main/honeybee/common_layers.py#L55
@@ -428,105 +545,26 @@ class HCXVisionForCausalLM(PreTrainedModel, GenerationMixin):
428
  embed_std = 1 / torch.sqrt(torch.tensor(module.size(0), dtype=torch.float)).to(module.dtype)
429
  module.data.normal_(mean=0.0, std=embed_std)
430
 
431
- def _init_reduction_type(self, use_sum_loss):
432
- assert not (
433
- self.use_meansum_loss and self.use_turnmeansum_loss
434
- ), "use_meansum_loss and use_turnmeansum_loss cannot both be True; only one or neither may be True."
435
- if self.use_meansum_loss or self.use_turnmeansum_loss:
436
- reduction = "none"
437
- elif use_sum_loss:
438
- reduction = "sum"
439
- else:
440
- reduction = "mean"
441
- return reduction
442
-
443
- def _init_vision_config(self, config):
444
- vision_model_type = config.vision_config.model_type
445
- if vision_model_type in CONFIG_MAPPING:
446
- vision_config = CONFIG_MAPPING[vision_model_type](**config.vision_config.to_dict())
447
- vision_config.auto_map = {}
448
- else:
449
- if config.vision_model_name_or_path is not None:
450
- vision_config = AutoConfig.from_pretrained(config.vision_model_name_or_path, trust_remote_code=True)
451
- elif config.vision_config._name_or_path is not None:
452
- vision_config = AutoConfig.from_pretrained(config.vision_config._name_or_path, trust_remote_code=True)
453
- else:
454
- raise ValueError("vision_config is not defined")
455
-
456
- vision_config.anyres = config.anyres
457
- vision_config.max_num_grids = config.max_num_grids
458
- return vision_config
459
-
460
- def _init_text_config(self, config):
461
- if hasattr(config, "text_config") and config.text_config is not None:
462
- model_type = config.text_config.model_type
463
- text_config = CONFIG_MAPPING[model_type](**config.text_config.to_dict())
464
- else:
465
- raise ValueError("text_config is not defined")
466
- text_config._attn_implementation = config._attn_implementation
467
- if text_config.model_type != "hyperclovax":
468
- text_config.logits_scaling = 1.0
469
- return text_config
470
-
471
- def _init_possible_resolutions(self, config, vision_config):
472
- """possible_resolution should be matched with preprocessor_config.json"""
473
- if not getattr(config, "possible_resolutions", []):
474
- possible_resolutions = []
475
- if config.anyres:
476
- assert config.max_num_grids > 0
477
- for i in range(1, config.max_num_grids + 1):
478
- for j in range(1, config.max_num_grids + 1):
479
- if i == 1 and j == 1 and not config.use_1x1_grid:
480
- continue
481
- if i * j <= config.max_num_grids:
482
- possible_resolutions.append([i, j])
483
-
484
- possible_resolutions = [
485
- [ys * vision_config.image_size, xs * vision_config.image_size] for ys, xs in possible_resolutions
486
- ]
487
- return possible_resolutions
488
- else:
489
- return config.possible_resolutions
490
-
491
- def _init_mm_projector(self, config, text_config, vision_config):
492
- input_hidden_size = vision_config.hidden_size
493
- if config.mm_projector_type == "linear":
494
- mm_projector = nn.Linear(input_hidden_size, text_config.hidden_size)
495
- mm_projector.dtype = next(mm_projector.parameters()).dtype
496
- elif config.mm_projector_type == "cabstractor":
497
- mm_projector = HCXVisionCAbstractor(
498
- num_queries=config.num_queries_vis_abstractor_image,
499
- num_input_tokens=(vision_config.image_size // vision_config.patch_size) ** 2,
500
- encoder_hidden_size=input_hidden_size,
501
- hidden_size=input_hidden_size,
502
- output_hidden_size=text_config.hidden_size,
503
- pos_emb=config.proj_pos_emb,
504
- prenorm=config.proj_prenorm,
505
- )
506
- else:
507
- mm_projector = HCXVisionMlp(
508
- config.mm_projector_type,
509
- input_hidden_size,
510
- hidden_features=input_hidden_size, # TODO: llava 처럼 hidden_size 를 input_hidden_size 가 아니라 LLM embedding size 로 바꿔주기
511
- out_features=self.text_config.hidden_size,
512
- )
513
- return mm_projector
514
-
515
  def forward(
516
  self,
517
  input_ids: Optional[torch.LongTensor] = None,
518
- pixel_values_images: Optional[List[List[torch.FloatTensor]]] = None,
519
- image_sizes_images: Optional[List[List[Tuple[int, int]]]] = None,
520
- pixel_values_videos: Optional[List[List[torch.FloatTensor]]] = None,
521
  past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None,
522
  attention_mask: Optional[torch.FloatTensor] = None,
523
- position_ids: Optional[torch.LongTensor] = None,
524
  inputs_embeds: Optional[torch.FloatTensor] = None,
525
  labels: Optional[torch.LongTensor] = None,
526
  use_cache: Optional[bool] = None,
527
  output_attentions: Optional[bool] = None,
528
  output_hidden_states: Optional[bool] = None,
529
  return_dict: Optional[bool] = None,
 
 
 
 
 
 
 
 
530
  **kwargs,
531
  ) -> Union[Tuple, HCXVisionOutput]:
532
  """Forward pass of the model.
@@ -570,34 +608,38 @@ class HCXVisionForCausalLM(PreTrainedModel, GenerationMixin):
570
  If return_dict=False, returns a tuple containing the above items except loss_per_sample.
571
  """
572
  output_attentions = (
573
- output_attentions if output_attentions is not None else self.config.vision_config.output_attentions
574
  )
575
  output_hidden_states = (
576
- output_hidden_states if output_hidden_states is not None else self.config.vision_config.output_hidden_states
 
 
577
  )
578
  return_dict = return_dict if return_dict is not None else self.config.use_return_dict
579
 
580
  if inputs_embeds is None and past_key_values is None:
581
- if pixel_values_images is not None or pixel_values_videos is not None:
582
- inputs_embeds = self.extract_inputs_embeds(
583
- input_ids=input_ids,
584
- pixel_values_images=pixel_values_images,
585
- image_sizes_images=image_sizes_images,
586
- pixel_values_videos=pixel_values_videos,
587
- )
588
- else:
589
- inputs_embeds = self.get_input_embeddings()(input_ids)
 
 
 
 
590
 
591
  if inputs_embeds is not None:
592
  input_ids = None
593
 
594
- ################################
595
  # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
596
  outputs = self.language_model.base_model(
597
  input_ids=input_ids,
598
  inputs_embeds=inputs_embeds,
599
  attention_mask=attention_mask,
600
- position_ids=position_ids,
601
  past_key_values=past_key_values,
602
  use_cache=use_cache,
603
  output_attentions=output_attentions,
@@ -606,7 +648,7 @@ class HCXVisionForCausalLM(PreTrainedModel, GenerationMixin):
606
  )
607
 
608
  hidden_states = outputs[0]
609
- hidden_states = hidden_states * self.text_config.logits_scaling
610
 
611
  loss = None
612
  loss_per_sample = None
@@ -615,12 +657,10 @@ class HCXVisionForCausalLM(PreTrainedModel, GenerationMixin):
615
  # Shift so that tokens < n predict n
616
  shift_logits = logits[..., :-1, :].contiguous()
617
  shift_labels = labels[..., 1:].contiguous()
618
-
619
  # Flatten the tokens
620
  loss_fct = CrossEntropyLoss(reduction="none") # ignore IGNORE_INDEX(-100)
621
  shift_logits = shift_logits.view(-1, self.lm_head_vocab_size)
622
  shift_labels = shift_labels.view(-1)
623
-
624
  # Enable model/pipeline parallelism
625
  shift_labels = shift_labels.to(shift_logits.device)
626
  loss = loss_fct(shift_logits, shift_labels)
@@ -642,6 +682,66 @@ class HCXVisionForCausalLM(PreTrainedModel, GenerationMixin):
642
  attentions=outputs.attentions,
643
  )
644
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
645
  # Copied from transformers.models.llava.modeling_llava.LlavaForConditionalGeneration.get_input_embeddings
646
  def get_input_embeddings(self):
647
  return self.language_model.get_input_embeddings()
@@ -680,9 +780,16 @@ class HCXVisionForCausalLM(PreTrainedModel, GenerationMixin):
680
  def extract_inputs_embeds(
681
  self,
682
  input_ids: Optional[torch.LongTensor] = None,
683
- pixel_values_images: Optional[List[List[torch.FloatTensor]]] = None,
684
- image_sizes_images: Optional[List[List[Tuple[int, int]]]] = None,
685
- pixel_values_videos: Optional[List[List[torch.FloatTensor]]] = None,
 
 
 
 
 
 
 
686
  ):
687
  """Extract input embeddings by processing text tokens and visual features.
688
 
@@ -698,6 +805,9 @@ class HCXVisionForCausalLM(PreTrainedModel, GenerationMixin):
698
  vision_query_lengths: List of lists of lengths when each image is converted to visual tokens.
699
  non_vision_query_lengths: List of lengths of text tokens (excluding visual tokens) for each sample.
700
  img_start_ids_list: List of lists containing indices of img_start_id tokens for each sample.
 
 
 
701
  first_last_frames_slows: List of booleans indicating whether the slowfast algorithm is
702
  applied to the first or last frames of the video.
703
  is_videos: List of booleans indicating which inputs are videos.
@@ -705,193 +815,241 @@ class HCXVisionForCausalLM(PreTrainedModel, GenerationMixin):
705
  Returns:
706
  Combined embeddings of text tokens and visual features.
707
  """
708
- # for convert back to List of List format
709
- len_pixel_values_images = [len(pixel_value) for pixel_value in pixel_values_images] if pixel_values_images else []
710
- len_pixel_values_videos = [len(pixel_value) for pixel_value in pixel_values_videos] if pixel_values_videos else []
711
-
712
- if sum(len_pixel_values_images) + sum(len_pixel_values_videos) == 0:
713
- return None
714
-
715
- inputs_embeds = self.get_input_embeddings()(input_ids)
716
-
717
- if sum(len_pixel_values_images) > 0:
718
- image_features_batch = self.forward_images(
719
- pixel_values_images, image_sizes_images, len_pixel_values_images
720
- )
721
- for i, image_features in enumerate(image_features_batch):
722
- if len(image_features) > 0:
723
- image_token_indices = (input_ids[i] == self.config.image_token_id).nonzero().squeeze()
724
- inputs_embeds[i][image_token_indices] = torch.cat(image_features).to(inputs_embeds.dtype)
725
-
726
- if sum(len_pixel_values_videos) > 0:
727
- video_features_batch = self.forward_videos(pixel_values_videos, len_pixel_values_videos)
728
- for i, video_features in enumerate(video_features_batch):
729
- if len(video_features) > 0:
730
- video_token_indices = (input_ids[i] == self.config.video_token_id).nonzero().squeeze()
731
- inputs_embeds[i][video_token_indices] = torch.cat(video_features).to(inputs_embeds.dtype)
732
-
733
- return inputs_embeds
734
-
735
- def forward_images(
736
- self,
737
- pixel_values_images: List[List[torch.FloatTensor]],
738
- image_sizes_images: List[List[Tuple[int, int]]],
739
- len_pixel_values_images: List[int],
740
- ) -> List[List[torch.Tensor]]:
741
- if sum(len_pixel_values_images) == 0:
742
- return None
743
-
744
- concat_pixel_values_images = torch.cat(list(chain(*pixel_values_images)), dim=0)
745
-
746
- visual_token_idx = 0 if "siglip" in self.vision_config.model_type else 1
747
- context_vision_model = torch.no_grad() if self.vision_model_use_no_grad else contextlib.nullcontext()
748
- with context_vision_model:
749
- if self.config.use_nth_layer == -1:
750
- # Replace post_layernorm of the last layer with Identity
751
- self.vision_model.vision_model.post_layernorm = nn.Identity()
752
- image_forward_outs = self.vision_model(concat_pixel_values_images)
753
- image_forward_outs = image_forward_outs.last_hidden_state[:, visual_token_idx:]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
754
  else:
755
- image_forward_outs = self.vision_model(concat_pixel_values_images, output_hidden_states=True)
756
- image_forward_outs = image_forward_outs.hidden_states[self.config.use_nth_layer][:, visual_token_idx:]
757
-
758
- image_forward_outs = image_forward_outs.to(dtype=self.mm_projector.dtype)
759
- image_forward_outs = self.mm_projector(image_forward_outs) # b (h w) d
760
-
761
- # feature 를 분할. e.g. torch.Size([18, 81, 3072]) -> [torch.Size([9, 81, 3072]), torch.Size([9, 81, 3072])]
762
- split_sizes = [pixel_value.shape[0] for pixel_value in chain(*pixel_values_images)]
763
- image_forward_outs = torch.split(image_forward_outs, split_sizes, dim=0)
764
-
765
- # newline 붙여주기 (anyres postprocessing)
766
- image_features = anyres_postprocessing(
767
- image_forward_outs=image_forward_outs,
768
- image_sizes=[image_size for image_sizes in image_sizes_images for image_size in image_sizes],
769
- num_queries_vis_abstractor=self.config.num_queries_vis_abstractor_image,
770
- unpad=self.config.unpad,
771
- patch_size=self.vision_config.patch_size,
772
- grid_size=self.vision_config.image_size,
773
- image_newline=self.image_newline,
774
- possible_resolutions=self.config.possible_resolutions,
775
- )
776
 
777
- # 원래 pixel_values_images 형태로 복원
778
- image_features = [
779
- image_features[sum(len_pixel_values_images[:i]) : sum(len_pixel_values_images[: i + 1])]
780
- for i in range(len(len_pixel_values_images))
781
- ]
 
782
 
783
- return image_features
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
784
 
785
- def forward_videos(
786
- self,
787
- pixel_values_videos: List[List[torch.FloatTensor]],
788
- len_pixel_values_videos: List[int],
789
- ) -> List[torch.Tensor]:
790
 
791
- len_video_grids = sum(len_pixel_values_videos)
792
- if len_video_grids == 0:
793
- return None
794
-
795
- # Run Vision Model
796
- concat_pixel_values_videos = torch.cat(list(chain(*pixel_values_videos)), dim=0)
797
-
798
- visual_token_idx = 0 if "siglip" in self.vision_config.model_type else 1
799
- context_vision_model = torch.no_grad() if self.vision_model_use_no_grad else contextlib.nullcontext()
800
- with context_vision_model:
801
- if self.config.use_nth_layer == -1:
802
- # Replace post_layernorm of the last layer with Identity
803
- self.vision_model.vision_model.post_layernorm = nn.Identity()
804
- video_forward_outs = self.vision_model(concat_pixel_values_videos)
805
- video_forward_outs = video_forward_outs.last_hidden_state[:, visual_token_idx:]
806
  else:
807
- video_forward_outs = self.vision_model(concat_pixel_values_videos, output_hidden_states=True)
808
- video_forward_outs = video_forward_outs.hidden_states[self.config.use_nth_layer][:, visual_token_idx:]
809
-
810
- video_forward_outs = video_forward_outs.to(dtype=self.mm_projector.dtype)
811
-
812
- # Run MM-Projector
813
- # len(num_grids) == len(num_queries_vis_abstractors) + 1
814
- grid_idx = 0
815
- num_grids = [grid_idx] # e.g. [0, 9, 18, 19, 27, 28, 36, 37, 45, 46, 54, 55, 56]
816
- num_queries_vis_abstractors = [] # e.g. [81, 81, 81, 9, 81, 9, 81, 9, 81, 9, 81, 9]
817
- len_total_frames = video_forward_outs.shape[0]
818
-
819
- if self.config.first_last_frames_slow:
820
- # TODO: 동작 확인 안 했음. 해야 함.
821
- # slowfast (first_last_frames_slow)
822
- assert len_total_frames != 0
823
- if len_total_frames <= 2:
824
- num_queries_vis_abstractors.append(self.config.num_queries_vis_abstractor_video_slow)
825
- grid_idx += len_total_frames
826
- num_grids.append(grid_idx)
827
- else:
828
- num_queries_vis_abstractors.append(self.config.num_queries_vis_abstractor_video_slow)
829
- grid_idx += 1
830
- num_grids.append(grid_idx)
831
 
832
- num_queries_vis_abstractors.append(self.config.num_queries_vis_abstractor_video_fast)
833
- grid_idx += len_total_frames - 2
834
- num_grids.append(grid_idx)
 
835
 
836
- num_queries_vis_abstractors.append(self.config.num_queries_vis_abstractor_video_slow)
837
- grid_idx += 1
838
- num_grids.append(grid_idx)
839
- else:
840
- # slowfast
841
- for pixel_values_frames in pixel_values_videos:
842
- for pixel_values_frame in pixel_values_frames:
843
- if len(pixel_values_frame) > 0:
844
- num_queries_vis_abstractors.append(self.config.num_queries_vis_abstractor_video_slow)
845
- grid_idx += 1
846
- num_grids.append(grid_idx)
847
- num_queries_vis_abstractors.append(self.config.num_queries_vis_abstractor_video_fast)
848
- grid_idx = grid_idx + len(pixel_values_frame) - 1
849
- num_grids.append(grid_idx)
850
-
851
- video_forward_outs = self.mm_projector(video_forward_outs, num_queries_vis_abstractors, num_grids)
852
-
853
- # video_group 별로 concat 처리.
854
- # 예를 들어, 3x3 grid 를 사용했을 경우, 총 9개의 feature 가 모일 때까지, grouped_features 에 리스트를 모아주고, concat 처리.
855
- video_features = [] # what we want to return
856
- target_features = []
857
- target_group_size = 0
858
- group_counter = 0
859
- video_groups = [
860
- len(frame) for frames in pixel_values_videos for frame in frames
861
- ] # for concat video features after projector
862
-
863
- for forward_out in video_forward_outs:
864
- target_group_size += len(forward_out)
865
- target_features.append(forward_out.flatten(0, 1))
866
-
867
- video_group_size = video_groups[group_counter]
868
- if video_group_size == target_group_size:
869
- video_features.append(torch.cat(target_features, dim=0))
870
- target_features = []
871
- group_counter += 1
872
- target_group_size = 0
873
-
874
- elif video_group_size < target_group_size:
875
- raise RuntimeError(f"video_group_size < target_group_size!! [{video_group_size} < {target_group_size}]")
876
-
877
- assert len(target_features) == 0, f"target_features is not empty!! {target_features}"
878
- assert len(video_groups) == len(video_features)
879
-
880
- # 원래 pixel_values_videos 형태로 복원
881
- video_features = [
882
- video_features[sum(len_pixel_values_videos[:i]) : sum(len_pixel_values_videos[: i + 1])]
883
- for i in range(len(len_pixel_values_videos))
884
- ]
885
 
886
- return video_features
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
887
 
888
  @torch.no_grad()
889
  def generate(
890
  self,
891
  input_ids: Optional[torch.LongTensor] = None,
892
- pixel_values_images: Optional[List[List[torch.FloatTensor]]] = None,
893
- image_sizes_images: Optional[List[List[Tuple[int, int]]]] = None,
894
- pixel_values_videos: Optional[List[List[torch.FloatTensor]]] = None,
 
 
 
 
 
 
895
  pad_token_id: Optional[int] = None,
896
  eos_token_id: Optional[int] = None,
897
  bad_words_ids: Optional[List[List[int]]] = None,
@@ -905,7 +1063,6 @@ class HCXVisionForCausalLM(PreTrainedModel, GenerationMixin):
905
  repetition_penalty: float = 1.0,
906
  length_penalty: int = 1,
907
  use_cache: bool = True,
908
- verbose: bool = False,
909
  **kwargs,
910
  ) -> torch.LongTensor:
911
  """Generate text based on input tokens and images.
@@ -952,27 +1109,29 @@ class HCXVisionForCausalLM(PreTrainedModel, GenerationMixin):
952
  if bad_words_ids is None:
953
  bad_words_ids = [
954
  [
955
- self.config.text_config.bos_token_id,
956
  ],
957
  [
958
- self.config.text_config.eos_token_id,
959
  ],
960
  ]
961
 
962
- if (pixel_values_images is None or all(len(pixel_values) == 0 for pixel_values in pixel_values_images)) and (
963
- pixel_values_videos is None or all(len(pixel_values) == 0 for pixel_values in pixel_values_videos)
964
- ):
965
  return self.language_model.generate(
966
  input_ids, pad_token_id=pad_token_id, eos_token_id=eos_token_id, bad_words_ids=bad_words_ids, **kwargs
967
  )
968
-
969
  inputs_embeds = self.extract_inputs_embeds(
970
  input_ids=input_ids,
971
- pixel_values_images=pixel_values_images,
972
- image_sizes_images=image_sizes_images,
973
- pixel_values_videos=pixel_values_videos,
 
 
 
 
 
 
974
  )
975
-
976
  inputs_embeds = inputs_embeds.to(device=self.language_model.device, dtype=self.language_model.dtype)
977
 
978
  # pred : torch.int64 : [batchsize, generated token_length]
@@ -981,7 +1140,7 @@ class HCXVisionForCausalLM(PreTrainedModel, GenerationMixin):
981
  pad_token_id=pad_token_id,
982
  eos_token_id=eos_token_id,
983
  bad_words_ids=bad_words_ids,
984
- max_new_tokens=max_length,
985
  min_length=min_length,
986
  num_beams=num_beams,
987
  do_sample=(False if temperature == 0.0 else do_sample), # set do_sample=False if invalid temperature
@@ -992,26 +1151,9 @@ class HCXVisionForCausalLM(PreTrainedModel, GenerationMixin):
992
  length_penalty=length_penalty,
993
  early_stopping=(False if num_beams <= 1 else True), # set early_stopping=False when not beam_search
994
  use_cache=use_cache,
 
995
  )
996
 
997
- if verbose:
998
- llm_query = self.tokenizer.batch_decode(
999
- [
1000
- [token_id for token_id in input_ids_row if token_id != self.tokenizer.pad_token_id]
1001
- for input_ids_row in input_ids.detach().cpu().tolist()
1002
- ],
1003
- skip_special_tokens=False,
1004
- )[0]
1005
- llm_pred = self.tokenizer.batch_decode(
1006
- [
1007
- [token_id for token_id in pred_row if token_id != self.tokenizer.pad_token_id]
1008
- for pred_row in pred.detach().cpu().tolist()
1009
- ],
1010
- skip_special_tokens=False,
1011
- )[0]
1012
- print(f"# [info] llm_query: {llm_query}")
1013
- print(f"# [info] llm_pred: {llm_pred}")
1014
-
1015
  return pred
1016
 
1017
  def to_vision_model_device(self, input_tensor: Union[torch.Tensor, List]) -> Union[torch.Tensor, List]:
@@ -1098,17 +1240,11 @@ class HCXVisionForCausalLM(PreTrainedModel, GenerationMixin):
1098
  model: HCXVisionForCausalLM = super().from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
1099
  model.tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path)
1100
 
1101
- image_token_id = model.tokenizer.encode(IMAGE_LOC, add_special_tokens=False)
1102
- assert (
1103
- len(image_token_id) == 1
1104
- ), f'"<|dummy3|>" was not encoded into a single special token. Encoding result: {image_token_id}'
1105
- model.config.image_token_id = image_token_id[0]
1106
-
1107
- video_token_id = model.tokenizer.encode(VIDEO_LOC, add_special_tokens=False)
1108
  assert (
1109
- len(video_token_id) == 1
1110
- ), f'"<|_unuse_missing_100270|>" was not encoded into a single special token. Encoding result: {video_token_id}'
1111
- model.config.video_token_id = video_token_id[0]
1112
 
1113
  model.save_only_vision = save_only_vision
1114
  model.save_only_qformer = save_only_qformer
@@ -1157,37 +1293,212 @@ class HCXVisionForCausalLM(PreTrainedModel, GenerationMixin):
1157
 
1158
  return state_dict
1159
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1160
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1161
 
1162
- class HCXVisionMlp(nn.Module):
1163
- def __init__(
1164
- self,
1165
- mm_projector_type,
1166
- in_features,
1167
- hidden_features=None,
1168
- out_features=None,
1169
- act_layer=nn.GELU,
1170
- ):
1171
- super().__init__()
1172
- out_features = out_features or in_features
1173
- hidden_features = hidden_features or in_features
1174
- self.mm_projector_type = mm_projector_type
1175
- if self.mm_projector_type == "mlp":
1176
- self.fc1 = nn.Linear(in_features, hidden_features)
1177
- self.act = act_layer()
1178
- self.fc2 = nn.Linear(hidden_features, out_features)
1179
- elif self.mm_projector_type == "inverted_mlp":
1180
- self.fc1 = nn.Linear(in_features, 2 * hidden_features)
1181
- self.act = act_layer()
1182
- self.fc2 = nn.Linear(2 * hidden_features, out_features)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1183
  else:
1184
- raise NotImplementedError("{} is not implemented".format(self.mm_projector_type))
 
1185
 
1186
- def forward(self, x):
1187
- x = self.fc1(x)
1188
- x = self.act(x)
1189
- x = self.fc2(x)
1190
- return x
1191
 
1192
 
1193
  class HCXVisionCAbstractor(nn.Module):
@@ -1259,7 +1570,7 @@ class HCXVisionCAbstractor(nn.Module):
1259
  ) -> torch.Tensor:
1260
  # x: [B, L, dim]
1261
  B, L, dim = x.shape
1262
- hw = int(L**0.5)
1263
  x = rearrange(x, "b (h w) d -> b d h w", h=hw, w=hw)
1264
 
1265
  if num_queries_vis_abstractors is not None:
@@ -1285,7 +1596,7 @@ class HCXVisionCAbstractor(nn.Module):
1285
  for i, num_queries in enumerate(num_queries_vis_abstractors):
1286
  hw = int(num_queries**0.5)
1287
  sampler = nn.AdaptiveAvgPool2d((hw, hw))
1288
- out = sampler(x[num_grids[i] : num_grids[i + 1], :])
1289
  out = self.net[2](out) # s2
1290
 
1291
  out = rearrange(out, "b d h w -> b (h w) d")
@@ -1303,8 +1614,8 @@ class HCXVisionCAbstractor(nn.Module):
1303
  depth: int = 3,
1304
  mlp_depth: int = 2,
1305
  ):
1306
- assert (n_queries**0.5).is_integer(), f"n_queries must be square number. n_queries: {n_queries}"
1307
- hw = int(n_queries**0.5)
1308
 
1309
  # RegBlock = ResBlock + SE
1310
  RegBlock = partial(
@@ -1342,3 +1653,89 @@ class HCXVisionCAbstractor(nn.Module):
1342
  layers.append(nn.Linear(output_hidden_size, output_hidden_size))
1343
  return nn.Sequential(*layers)
1344
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  import contextlib
3
  import gc
4
  import json
5
+ import math
6
  import os
7
  from dataclasses import dataclass
8
  from functools import partial
 
33
  from transformers.utils import ModelOutput
34
 
35
  from .configuration_hyperclovax import HCXVisionConfig
36
+ from .preprocessor import select_best_resolution
37
 
38
  EOT = "<|endofturn|>"
39
+ IMG_LOC = "<|dummy3|>"
 
40
 
41
 
42
  def get_rank():
 
220
 
221
 
222
  def anyres_postprocessing(
223
+ image_forward_outs: torch.FloatTensor,
224
+ split_sizes: List[int],
225
  image_sizes: List[List[int]],
226
  possible_resolutions: List[Tuple[int, int]],
227
+ is_videos: List[bool],
228
  patch_size: int,
229
  grid_size: int,
230
  image_newline: torch.FloatTensor,
 
247
  dimensions of the corresponding image sample. Used for unpadding.
248
  possible_resolutions (List[Tuple[int, int]]): A list of supported resolution tuples `(height, width)` used by
249
  `reshape_and_unpad_image_features` for spatial reconstruction, especially during unpadding.
250
+ is_videos (List[bool]): A list of boolean flags indicating whether each corresponding sample in the batch is a
251
+ video [`True`] or an image [`False`].
252
  patch_size (int): The spatial dimension (height and width) of the square patches the image was divided into.
253
  grid_size (int): The spatial dimension (height and width) of the square grid onto which patches are mapped.
254
  `grid_size` should be divisible by `patch_size`.
 
274
  assert (num_queries_vis_abstractor**0.5).is_integer(), "n_queries must be square number"
275
  height = width = int(num_queries_vis_abstractor**0.5)
276
 
277
+ image_features = torch.split(image_forward_outs, split_sizes, dim=0)
278
+
279
  # post-processing (unpad, add newline)
280
  new_image_features = []
281
+ for image_idx, (image_feature, is_video) in enumerate(zip(image_features, is_videos)):
282
  if image_feature.shape[0] > 1:
283
+ if not is_video:
284
+ image_feature = reshape_and_unpad_image_features(
285
+ image_feature=image_feature,
286
+ height=height,
287
+ width=width,
288
+ image_size=image_sizes[image_idx],
289
+ possible_resolutions=possible_resolutions,
290
+ grid_size=grid_size, # Pass grid info if needed by helper
291
+ unpad=unpad,
292
+ image_newline=image_newline,
293
+ )
294
+ else:
295
+ image_feature = image_feature.flatten(0, 1)
296
  else:
297
  image_feature = image_feature[0]
298
+ if unpad and not is_video:
299
+ image_feature = torch.cat((image_feature, image_newline[None].to(image_feature.device)), dim=0)
300
  new_image_features.append(image_feature)
301
  image_features = new_image_features
302
  return image_features
303
 
304
 
305
+ def adaptive_anyres_postprocessing(
306
+ image_forward_outs: torch.FloatTensor,
307
+ image_sizes: List[List[int]],
308
+ possible_resolutions: List[Tuple[int, int]],
309
+ is_videos: List[bool],
310
+ group_ids: List[List[int]],
311
+ num_queries_vis_abstractors: List[List[int]],
312
+ grid_size: int,
313
+ image_newline: torch.FloatTensor,
314
+ unpad: bool = False,
315
+ ) -> List[torch.FloatTensor]:
316
+ """Adaptive AnyRes postprocessing for multi-group feature aggregation.
317
+
318
+ Processes 2D visual features into 1D sequences with group-wise adaptive processing. Each image can belong to
319
+ multiple processing groups with different query configurations. Features are processed per group and aggregated
320
+ according to group_ids.
321
+
322
+ Args:
323
+ image_forward_outs (List[torch.FloatTensor]): List of input tensors with shape
324
+ (number_of_images_in_grid, total_patches, feature_dim) containing visual features.
325
+ image_sizes (List[List[int]]): Original image dimensions for each sample. [[width, height], ... ]
326
+ possible_resolutions (List[Tuple[int, int]]): Supported resolutions. [[height, width], ... ]
327
+ is_videos (List[bool]): Flags indicating video inputs
328
+ group_ids (List[List[int]]): Group indices for feature aggregation. Each group means a single grid.
329
+ num_queries_vis_abstractors (List[List[int]]): Query numbers per group
330
+ grid_size (int): Total grid size for spatial processing
331
+ image_newline (torch.FloatTensor): Sample-wise config. Newline embedding tensor
332
+ unpad (bool, optional): Sample-wise config. Enable padding removal. Defaults to False.
333
+
334
+ Returns:
335
+ List[torch.FloatTensor]: Aggregated features per group
336
+
337
+ Raises:
338
+ AssertionError: If num_queries is not square number in any group
339
+ """
340
+ # post-processing (unpad, add newline)
341
+ new_image_features = []
342
+ for image_idx, (image_feature, is_video) in enumerate(zip(image_forward_outs, is_videos)):
343
+ num_queries_vis_abstractor = num_queries_vis_abstractors[image_idx]
344
+ assert (num_queries_vis_abstractor**0.5).is_integer(), "n_queries must be square number"
345
+ height = width = int(num_queries_vis_abstractor**0.5)
346
+
347
+ if image_feature.shape[0] > 1:
348
+ if not is_video:
349
+ image_feature = reshape_and_unpad_image_features(
350
+ image_feature=image_feature,
351
+ height=height,
352
+ width=width,
353
+ image_size=image_sizes[image_idx],
354
+ possible_resolutions=possible_resolutions,
355
+ grid_size=grid_size,
356
+ unpad=unpad,
357
+ image_newline=image_newline,
358
+ )
359
+ else:
360
+ image_feature = image_feature.flatten(0, 1)
361
+ else:
362
+ image_feature = image_feature[0]
363
+ if unpad and not is_video:
364
+ image_feature = torch.cat((image_feature, image_newline[None].to(image_feature.device)), dim=0)
365
+ new_image_features.append(image_feature)
366
+
367
+ image_features = [
368
+ torch.cat([new_image_features[group_id] for group_id in group_ids_list], dim=0) for group_ids_list in group_ids
369
+ ]
370
+ return image_features
371
+
372
+
373
  @dataclass
374
  class HCXVisionOutput(ModelOutput):
375
  """Output class for vision models, containing various computation results.
 
413
 
414
  config_class = HCXVisionConfig
415
  vision_model_name = "vision_model"
416
+ _no_split_modules = ["CLIPAttention", "SiglipVisionModel"]
417
  supports_gradient_checkpointing = True
418
  _skip_keys_device_placement = "past_key_values"
 
 
419
 
420
  def __init__(
421
  self,
 
434
  - is_safetensor_save: Whether to save model using safetensors format.
435
 
436
  Raises:
437
+ ValueError: If vision_config is not defined or if language_config is not defined.
438
  """
439
+ super().__init__(config)
440
 
441
+ self.flag_changed_max_position_embeddings = False
 
 
 
 
 
442
 
443
+ vision_model_type = config.vision_config["model_type"]
444
+ if vision_model_type in CONFIG_MAPPING:
445
+ vision_config = CONFIG_MAPPING[vision_model_type](**config.vision_config)
446
+ vision_config.auto_map = {}
447
+ else:
448
+ if config.vision_model_name_or_path is not None:
449
+ vision_config = AutoConfig.from_pretrained(config.vision_model_name_or_path, trust_remote_code=True)
450
+ elif config.vision_config["_name_or_path"] is not None:
451
+ vision_config = AutoConfig.from_pretrained(
452
+ config.vision_config["_name_or_path"], trust_remote_code=True
453
+ )
454
+ else:
455
+ raise ValueError("vision_config is not defined")
456
 
457
+ self.use_liger = kwargs.pop("use_liger", False)
458
+ self.use_fused_ce = kwargs.pop("use_fused_ce", False)
459
+ self.reduction = "sum" if kwargs.pop("use_sum_loss", False) else "mean"
460
 
461
+ self.vision_config = vision_config
462
+ vision_config.anyres = config.anyres
463
+ vision_config.max_num_grids = config.max_num_grids
464
 
465
+ possible_resolutions = []
466
  if config.anyres:
467
+ assert config.max_num_grids > 0
468
+ for i in range(1, config.max_num_grids + 1):
469
+ for j in range(1, config.max_num_grids + 1):
470
+ if i == 1 and j == 1 and not config.use_1x1_grid:
471
+ continue
472
+ if i * j <= config.max_num_grids:
473
+ possible_resolutions.append([i, j])
474
+
475
+ possible_resolutions = [
476
+ [ys * vision_config.image_size, xs * vision_config.image_size] for ys, xs in possible_resolutions
477
+ ]
478
 
479
+ self.possible_resolutions = possible_resolutions
 
 
 
 
480
 
481
+ with no_init_weights():
482
+ self.vision_model = AutoModel.from_config(
483
+ vision_config, trust_remote_code=True
484
+ ) # weight will be loaded in from_pretrained
 
485
 
486
+ assert config.language_config["model_type"] == "llama"
487
+ language_config = CONFIG_MAPPING["llama"](**config.language_config)
488
+ language_config._attn_implementation = kwargs.get("attn_implementation", "sdpa") # activate flash attention
489
+ language_config.logits_scaling = 1.0
490
+
491
+ self.language_config = language_config
492
+ self.language_model = AutoModelForCausalLM.from_config(language_config)
 
493
 
494
+ self.language_model.gradient_checkpointing_enable()
495
+ self.num_queries_vis_abstractor = config.num_queries_vis_abstractor
496
 
497
+ # mm_projctor(==connector); vision_model_hidden_size -> LLM embedding size
498
+ input_hidden_size = vision_config.hidden_size
499
+ self.mm_projector = HCXVisionCAbstractor(
500
+ num_queries=self.num_queries_vis_abstractor,
501
+ num_input_tokens=(self.vision_config.image_size // self.vision_config.patch_size) ** 2,
502
+ encoder_hidden_size=input_hidden_size,
503
+ hidden_size=input_hidden_size,
504
+ output_hidden_size=language_config.hidden_size,
505
+ pos_emb=config.proj_pos_emb,
506
+ prenorm=config.proj_prenorm,
507
+ )
508
+ self.use_nth_layer = config.use_nth_layer
509
+ self.config.update({"vision_config": self.vision_model.config.to_dict()})
510
+ self.config.update({"language_config": self.language_model.config.to_dict()})
511
+ self.lm_head_vocab_size = (
512
+ language_config.padded_vocab_size
513
+ if hasattr(language_config, "padded_vocab_size")
514
+ else language_config.vocab_size
515
+ )
516
+ self.language_model.lm_head = nn.Linear(language_config.hidden_size, self.lm_head_vocab_size, bias=False)
517
+ self.model_parallel = False
518
+ self.device_map = None
519
+ self.use_no_grad = None
520
+ self.decoder_max_length = config.decoder_max_length
521
 
522
+ self.anyres = config.anyres
523
+ self.unpad = config.unpad
524
+ if self.anyres:
525
+ self.image_newline = nn.Parameter(torch.empty(language_config.hidden_size, dtype=self.dtype))
526
+
527
+ self.is_safetensor_save = kwargs.get("is_safetensor_save", True)
528
+ self._backward_compatibility_gradient_checkpointing()
529
 
530
  def _init_weights(self, module):
531
  # copies from https://github.com/kakaobrain/honeybee/blob/main/honeybee/common_layers.py#L55
 
545
  embed_std = 1 / torch.sqrt(torch.tensor(module.size(0), dtype=torch.float)).to(module.dtype)
546
  module.data.normal_(mean=0.0, std=embed_std)
547
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
548
  def forward(
549
  self,
550
  input_ids: Optional[torch.LongTensor] = None,
551
+ pixel_values: Optional[List[List[torch.FloatTensor]]] = None,
 
 
552
  past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None,
553
  attention_mask: Optional[torch.FloatTensor] = None,
 
554
  inputs_embeds: Optional[torch.FloatTensor] = None,
555
  labels: Optional[torch.LongTensor] = None,
556
  use_cache: Optional[bool] = None,
557
  output_attentions: Optional[bool] = None,
558
  output_hidden_states: Optional[bool] = None,
559
  return_dict: Optional[bool] = None,
560
+ image_sizes: Optional[List[List[List[int]]]] = None,
561
+ vision_query_lengths: Optional[List[List[int]]] = None,
562
+ non_vision_query_lengths: Optional[List[int]] = None,
563
+ img_start_ids_list: Optional[List[List[int]]] = None,
564
+ num_queries_vis_abstractors: Optional[List[List[int]]] = None,
565
+ num_queries_vis_abstractors_slow: Optional[List[List[int]]] = None,
566
+ first_last_frames_slows: Optional[List[bool]] = None,
567
+ is_video_list: Optional[List[bool]] = None,
568
  **kwargs,
569
  ) -> Union[Tuple, HCXVisionOutput]:
570
  """Forward pass of the model.
 
608
  If return_dict=False, returns a tuple containing the above items except loss_per_sample.
609
  """
610
  output_attentions = (
611
+ output_attentions if output_attentions is not None else self.config.vision_config["output_attentions"]
612
  )
613
  output_hidden_states = (
614
+ output_hidden_states
615
+ if output_hidden_states is not None
616
+ else self.config.vision_config["output_hidden_states"]
617
  )
618
  return_dict = return_dict if return_dict is not None else self.config.use_return_dict
619
 
620
  if inputs_embeds is None and past_key_values is None:
621
+ inputs_embeds = self.extract_inputs_embeds(
622
+ input_ids=input_ids,
623
+ pixel_values=pixel_values,
624
+ past_key_values=past_key_values,
625
+ image_sizes=image_sizes,
626
+ vision_query_lengths=vision_query_lengths,
627
+ non_vision_query_lengths=non_vision_query_lengths,
628
+ img_start_ids_list=img_start_ids_list,
629
+ num_queries_vis_abstractors=num_queries_vis_abstractors,
630
+ num_queries_vis_abstractors_slow=num_queries_vis_abstractors_slow,
631
+ first_last_frames_slows=first_last_frames_slows,
632
+ is_videos=is_video_list,
633
+ )
634
 
635
  if inputs_embeds is not None:
636
  input_ids = None
637
 
 
638
  # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
639
  outputs = self.language_model.base_model(
640
  input_ids=input_ids,
641
  inputs_embeds=inputs_embeds,
642
  attention_mask=attention_mask,
 
643
  past_key_values=past_key_values,
644
  use_cache=use_cache,
645
  output_attentions=output_attentions,
 
648
  )
649
 
650
  hidden_states = outputs[0]
651
+ hidden_states = hidden_states * self.language_config.logits_scaling
652
 
653
  loss = None
654
  loss_per_sample = None
 
657
  # Shift so that tokens < n predict n
658
  shift_logits = logits[..., :-1, :].contiguous()
659
  shift_labels = labels[..., 1:].contiguous()
 
660
  # Flatten the tokens
661
  loss_fct = CrossEntropyLoss(reduction="none") # ignore IGNORE_INDEX(-100)
662
  shift_logits = shift_logits.view(-1, self.lm_head_vocab_size)
663
  shift_labels = shift_labels.view(-1)
 
664
  # Enable model/pipeline parallelism
665
  shift_labels = shift_labels.to(shift_logits.device)
666
  loss = loss_fct(shift_logits, shift_labels)
 
682
  attentions=outputs.attentions,
683
  )
684
 
685
+ def determine_non_vision_query_lengths(
686
+ self, input_ids: torch.LongTensor, pad_id: int, img_start_id: int
687
+ ) -> List[int]:
688
+ """Calculate the lengths of non-vision query parts in the input.
689
+
690
+ This method calculates the length of text tokens (excluding visual tokens) for each sample.
691
+ When input_ids are collated, they are padded with pad_id on the right, so this method finds
692
+ these values by identifying pad tokens and img_start_id tokens.
693
+
694
+ Args:
695
+ input_ids: Input token IDs with img_start_id markers for image positions.
696
+ pad_id: Token ID used for padding.
697
+ img_start_id: Token ID marking the start of image data.
698
+
699
+ Returns:
700
+ List of lengths of non-vision query parts for each sample in the batch.
701
+ """
702
+ non_vision_query_lengths = []
703
+ batch_size, len_seq = input_ids.size(0), input_ids.size(1)
704
+
705
+ for i in range(batch_size):
706
+ temp_idx = (input_ids[i] == pad_id).nonzero()
707
+ eos_idx = temp_idx[0, 0].item() if len(temp_idx) > 0 else len_seq
708
+ num_imgs = (input_ids[i] == img_start_id).sum().item()
709
+ non_vision_query_lengths.append(eos_idx - num_imgs)
710
+
711
+ if all([pad_id in input_id for input_id in input_ids.tolist()]):
712
+ non_vision_query_lengths = [
713
+ non_vision_query_length + 1 for non_vision_query_length in non_vision_query_lengths
714
+ ]
715
+
716
+ return non_vision_query_lengths
717
+
718
+ def determine_vision_query_lengths(
719
+ self, image_features: List[List[torch.Tensor]], image_cnts: List[int]
720
+ ) -> List[List[int]]:
721
+ """Calculate the lengths of vision query parts in the input.
722
+
723
+ This method calculates the lengths of visual tokens for each image in each sample based on
724
+ the shapes of image feature tensors. For samples without any images, a dummy image is included
725
+ but then converted to an empty list.
726
+
727
+ Args:
728
+ image_features: List of lists of image features tensors.
729
+ image_cnts: List of counts of images for each sample in the batch.
730
+
731
+ Returns:
732
+ List of lists of lengths of visual tokens for each image in each sample.
733
+ """
734
+ vision_query_lengths = [
735
+ [image_feature.size(0) for image_feature in image_feature_list] for image_feature_list in image_features
736
+ ]
737
+
738
+ for i, image_cnt in enumerate(image_cnts):
739
+ if image_cnt == 0:
740
+ assert len(vision_query_lengths[i]) == 1 # 현재 검정 이미지 1개 들어가있음
741
+ vision_query_lengths[i] = [] # 빈 list 로 변환
742
+
743
+ return vision_query_lengths
744
+
745
  # Copied from transformers.models.llava.modeling_llava.LlavaForConditionalGeneration.get_input_embeddings
746
  def get_input_embeddings(self):
747
  return self.language_model.get_input_embeddings()
 
780
  def extract_inputs_embeds(
781
  self,
782
  input_ids: Optional[torch.LongTensor] = None,
783
+ pixel_values: Optional[List[List[torch.FloatTensor]]] = None, # list of list of 4D tensors
784
+ past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None,
785
+ image_sizes: Optional[List[List[List[int]]]] = None,
786
+ vision_query_lengths: Optional[List[List[int]]] = None,
787
+ non_vision_query_lengths: Optional[List[int]] = None,
788
+ img_start_ids_list: Optional[List[List[int]]] = None,
789
+ num_queries_vis_abstractors: Optional[List[List[int]]] = None,
790
+ num_queries_vis_abstractors_slow: Optional[List[List[int]]] = None,
791
+ first_last_frames_slows: Optional[List[bool]] = None,
792
+ is_videos: Optional[List[str]] = None,
793
  ):
794
  """Extract input embeddings by processing text tokens and visual features.
795
 
 
805
  vision_query_lengths: List of lists of lengths when each image is converted to visual tokens.
806
  non_vision_query_lengths: List of lengths of text tokens (excluding visual tokens) for each sample.
807
  img_start_ids_list: List of lists containing indices of img_start_id tokens for each sample.
808
+ num_queries_vis_abstractors: List of lists containing number of visual tokens for each image grid.
809
+ num_queries_vis_abstractors_slow: List of lists containing number of visual tokens for
810
+ the slow part when applying the slowfast algorithm to video frames.
811
  first_last_frames_slows: List of booleans indicating whether the slowfast algorithm is
812
  applied to the first or last frames of the video.
813
  is_videos: List of booleans indicating which inputs are videos.
 
815
  Returns:
816
  Combined embeddings of text tokens and visual features.
817
  """
818
+ inputs_embeds = None
819
+ if past_key_values:
820
+ pass
821
+ else:
822
+ # Flatten CLIP and connector for feature encoding, then convert back to List of List format
823
+ len_pixel_values = [len(pixel_value) for pixel_value in pixel_values]
824
+ concat_pixel_values = torch.cat(list(chain(*pixel_values)), dim=0) # list of list of 4D Tensor
825
+ visual_token_idx = 0 if "siglip" in self.vision_config.model_type else 1
826
+ # Check if all parameters of the model require_grad=False
827
+ if self.use_no_grad is None:
828
+ self.use_no_grad = all(not p.requires_grad for p in self.vision_model.vision_model.encoder.parameters())
829
+ context = torch.no_grad() if self.use_no_grad else contextlib.nullcontext()
830
+ with context:
831
+ if self.use_no_grad:
832
+ # Fixed number of for-loop iterations to 10.
833
+ # Currently no memory effect observed, so proceeding without chunking.
834
+ n_chunks = 1
835
+ else:
836
+ n_chunks = 1
837
+ total_len = concat_pixel_values.size(0)
838
+ # Calculate the size of each chunk based on total data length (divided into 10 chunks)
839
+ chunk_size = math.ceil(total_len / n_chunks) if total_len > 0 else 1
840
+ image_forward_outs_chunks = []
841
+
842
+ for i in range(n_chunks):
843
+ start = i * chunk_size
844
+ end = (i + 1) * chunk_size
845
+ # Current chunk slice (could be an empty tensor if there's no data)
846
+ chunk = concat_pixel_values[start:end].to(self.vision_model.dtype)
847
+ # If the current chunk size is smaller than chunk_size, pad with dummy data
848
+ if chunk.size(0) < chunk_size:
849
+ # print(f"chunk.size(0): {chunk.size(0)}, chunk_size: {chunk_size}")
850
+ pad_size = chunk_size - chunk.size(0)
851
+ # Create dummy tensor based on concat_pixel_values shape
852
+ dummy_shape = (pad_size,) + tuple(concat_pixel_values.shape[1:])
853
+ dummy = torch.zeros(
854
+ dummy_shape,
855
+ dtype=concat_pixel_values.dtype,
856
+ device=concat_pixel_values.device,
857
+ )
858
+ chunk = torch.cat([chunk, dummy], dim=0)
859
+
860
+ # Pass the chunk through the vision model (processed according to use_nth_layer)
861
+ if self.use_nth_layer == -1:
862
+ # Replace post_layernorm of the last layer with Identity
863
+ self.vision_model.vision_model.post_layernorm = nn.Identity()
864
+ outs = self.vision_model(chunk)
865
+ outs = outs.last_hidden_state[:, visual_token_idx:]
866
+ else:
867
+ outs = self.vision_model(chunk, output_hidden_states=True)
868
+ outs = outs.hidden_states[self.use_nth_layer][:, visual_token_idx:]
869
+ image_forward_outs_chunks.append(outs)
870
+
871
+ # Concatenate results from all chunks
872
+ image_forward_outs = torch.cat(image_forward_outs_chunks, dim=0).to(image_forward_outs_chunks[0].dtype)
873
+
874
+ if num_queries_vis_abstractors is None:
875
+ assert num_queries_vis_abstractors_slow is None
876
+ image_sizes = list(chain(*image_sizes))
877
+ if is_videos is not None:
878
+ is_videos = list(chain(*is_videos))
879
+ group_ids = None
880
+ image_forward_outs = image_forward_outs.to(dtype=self.mm_projector.dtype)
881
+ image_forward_outs = self.mm_projector(image_forward_outs)
882
  else:
883
+ # adaptive anyres is only implemented in HCXVisionCAbstractor
884
+ assert isinstance(self.mm_projector, HCXVisionCAbstractor)
885
+
886
+ (
887
+ num_queries_vis_abstractors,
888
+ num_grids,
889
+ image_sizes,
890
+ is_videos,
891
+ group_ids,
892
+ ) = self.compute_adaptive_params(
893
+ pixel_values,
894
+ num_queries_vis_abstractors,
895
+ num_queries_vis_abstractors_slow,
896
+ image_sizes,
897
+ is_videos,
898
+ first_last_frames_slows,
899
+ )
 
 
 
 
900
 
901
+ image_forward_outs = image_forward_outs.to(dtype=self.mm_projector.dtype)
902
+ image_forward_outs = self.mm_projector(
903
+ image_forward_outs,
904
+ num_queries_vis_abstractors=num_queries_vis_abstractors,
905
+ num_grids=num_grids,
906
+ )
907
 
908
+ if self.anyres:
909
+ split_sizes = [pixel_value.shape[0] for pixel_value in chain(*pixel_values)]
910
+
911
+ if num_queries_vis_abstractors is None:
912
+ image_features = anyres_postprocessing(
913
+ image_forward_outs=image_forward_outs,
914
+ split_sizes=split_sizes,
915
+ image_sizes=image_sizes,
916
+ num_queries_vis_abstractor=self.num_queries_vis_abstractor,
917
+ unpad=self.unpad,
918
+ is_videos=is_videos,
919
+ patch_size=self.vision_model.config.patch_size,
920
+ grid_size=self.vision_model.config.image_size,
921
+ image_newline=self.image_newline,
922
+ possible_resolutions=self.possible_resolutions,
923
+ )
924
+ else:
925
+ image_features = adaptive_anyres_postprocessing(
926
+ image_forward_outs=image_forward_outs,
927
+ image_sizes=image_sizes,
928
+ num_queries_vis_abstractors=num_queries_vis_abstractors,
929
+ unpad=self.unpad,
930
+ is_videos=is_videos,
931
+ grid_size=self.vision_model.config.image_size,
932
+ image_newline=self.image_newline,
933
+ possible_resolutions=self.possible_resolutions,
934
+ group_ids=group_ids,
935
+ )
936
+ else:
937
+ if num_queries_vis_abstractors is None:
938
+ image_features = [image_forward_out for image_forward_out in image_forward_outs]
939
+ else:
940
+ image_features = [image_forward_out.unsqueeze(0) for image_forward_out in image_forward_outs]
941
+
942
+ # print(f"BEFORE GROUPING: len(image_features): {len(image_features)}")
943
+ image_features = [
944
+ image_features[sum(len_pixel_values[:i]) : sum(len_pixel_values[: i + 1])]
945
+ for i in range(len(len_pixel_values))
946
+ ]
947
 
948
+ batch_size = input_ids.size(0)
949
+ image_feature_dim = image_features[0][0].size(1)
950
+ image_feature_dtype = image_features[0][0].dtype
 
 
951
 
952
+ if img_start_ids_list is None:
953
+ image_cnts = (input_ids == self.config.img_start_id).sum(dim=1).tolist()
 
 
 
 
 
 
 
 
 
 
 
 
 
954
  else:
955
+ image_cnts = [len(img_start_ids) for img_start_ids in img_start_ids_list]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
956
 
957
+ if non_vision_query_lengths is None:
958
+ non_vision_query_lengths = self.determine_non_vision_query_lengths(
959
+ input_ids, self.tokenizer.pad_token_id, self.config.img_start_id
960
+ )
961
 
962
+ if vision_query_lengths is None:
963
+ vision_query_lengths = self.determine_vision_query_lengths(image_features, image_cnts)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
964
 
965
+ # Slicing is faster than concatenation
966
+ len_inputs_embeds = max(
967
+ [
968
+ sum(vision_query_length) + non_vision_query_length
969
+ for non_vision_query_length, vision_query_length in zip(
970
+ non_vision_query_lengths, vision_query_lengths
971
+ )
972
+ ]
973
+ )
974
+ len_inputs_embeds = min(self.decoder_max_length, len_inputs_embeds)
975
+
976
+ inputs_embeds = torch.zeros(
977
+ [batch_size, len_inputs_embeds, image_feature_dim],
978
+ dtype=image_feature_dtype,
979
+ device=self.device,
980
+ requires_grad=True,
981
+ ).clone()
982
+ # temp_embeds : torch.bfloat16 : [batchsize, 174, 3072]
983
+ temp_embeds = self.get_input_embeddings()(input_ids)
984
+
985
+ # The complete format is <PROMPT><USER_PREFIX><VISION_QUERIES>Sentence
986
+ for batch_idx, sample in enumerate(input_ids):
987
+ # Concatenate with visual tokens and then slice
988
+ non_vision_query_length = non_vision_query_lengths[batch_idx]
989
+ # Safely concatenate with visual tokens and then slice
990
+ sample = sample[: non_vision_query_length + image_cnts[batch_idx]]
991
+
992
+ if image_cnts[batch_idx] == 0: # Text instruction data doesn't insert image features
993
+ temp_idx = 0
994
+ # Reference: https://github.com/haotian-liu/LLaVA/commit/44e0562f9497fb79f042427307472a87d266d90a#diff-4477387d506ccb1897a13972cba26c9da3fad4d3e1c32ec4b8bd8ff7acd3f292
995
+ # https://github.com/intel/intel-extension-for-transformers/issues/1201#issuecomment-1915875119
996
+ inputs_embeds[batch_idx, :non_vision_query_length] = temp_embeds[batch_idx][
997
+ :non_vision_query_length
998
+ ]
999
+ inputs_embeds[batch_idx, temp_idx:temp_idx] = image_features[batch_idx][0][
1000
+ 0:0
1001
+ ] # First image of batch_idx sample (dummy image)
1002
+ else:
1003
+ if img_start_ids_list is None:
1004
+ img_start_ids = (sample == self.config.img_start_id).nonzero()
1005
+ else:
1006
+ img_start_ids = img_start_ids_list[batch_idx]
1007
+ assert len(img_start_ids) == image_cnts[batch_idx] == len(image_features[batch_idx])
1008
+ # Initialize starting points for input embeddings and temporary embeddings
1009
+ input_start, temp_start = 0, 0
1010
+
1011
+ # Iterate through each image starting point in the batch
1012
+ for multi_img_idx, img_start_idx in enumerate(img_start_ids):
1013
+ # Calculate token length up to the current image starting point
1014
+ token_len = img_start_idx - temp_start
1015
+
1016
+ # Copy tokens to inputs_embeds
1017
+ inputs_embeds[batch_idx, input_start : input_start + token_len] = temp_embeds[
1018
+ batch_idx, temp_start : temp_start + token_len
1019
+ ]
1020
+
1021
+ inputs_embeds[
1022
+ batch_idx,
1023
+ input_start
1024
+ + token_len : input_start
1025
+ + token_len
1026
+ + vision_query_lengths[batch_idx][multi_img_idx],
1027
+ ] = image_features[batch_idx][multi_img_idx]
1028
+
1029
+ # Update starting points for next token processing
1030
+ input_start += token_len + vision_query_lengths[batch_idx][multi_img_idx]
1031
+ temp_start += token_len + 1 # Increase by 1 to skip the image start token
1032
+
1033
+ # Process tokens after the last image end token
1034
+ token_len = min(sample[temp_start:].size(0), inputs_embeds.size(1) - input_start)
1035
+ inputs_embeds[batch_idx, input_start : input_start + token_len] = temp_embeds[
1036
+ batch_idx, temp_start : temp_start + token_len
1037
+ ]
1038
+ return inputs_embeds
1039
 
1040
  @torch.no_grad()
1041
  def generate(
1042
  self,
1043
  input_ids: Optional[torch.LongTensor] = None,
1044
+ pixel_values: Optional[List[List[torch.FloatTensor]]] = None,
1045
+ image_sizes: Optional[List[List[List[int]]]] = None,
1046
+ vision_query_lengths: Optional[List[List[int]]] = None,
1047
+ non_vision_query_lengths: Optional[List[int]] = None,
1048
+ num_queries_vis_abstractors: Optional[List[List[int]]] = None,
1049
+ num_queries_vis_abstractors_slow: Optional[List[List[int]]] = None,
1050
+ first_last_frames_slows: Optional[List[bool]] = None,
1051
+ is_videos: Optional[List[bool]] = None,
1052
+ img_start_ids_list: Optional[List[List[int]]] = None,
1053
  pad_token_id: Optional[int] = None,
1054
  eos_token_id: Optional[int] = None,
1055
  bad_words_ids: Optional[List[List[int]]] = None,
 
1063
  repetition_penalty: float = 1.0,
1064
  length_penalty: int = 1,
1065
  use_cache: bool = True,
 
1066
  **kwargs,
1067
  ) -> torch.LongTensor:
1068
  """Generate text based on input tokens and images.
 
1109
  if bad_words_ids is None:
1110
  bad_words_ids = [
1111
  [
1112
+ self.config.language_config["bos_token_id"],
1113
  ],
1114
  [
1115
+ self.config.language_config["eos_token_id"],
1116
  ],
1117
  ]
1118
 
1119
+ if pixel_values is None:
 
 
1120
  return self.language_model.generate(
1121
  input_ids, pad_token_id=pad_token_id, eos_token_id=eos_token_id, bad_words_ids=bad_words_ids, **kwargs
1122
  )
 
1123
  inputs_embeds = self.extract_inputs_embeds(
1124
  input_ids=input_ids,
1125
+ pixel_values=self.to_vision_model_device(pixel_values),
1126
+ image_sizes=image_sizes,
1127
+ vision_query_lengths=vision_query_lengths,
1128
+ non_vision_query_lengths=non_vision_query_lengths,
1129
+ img_start_ids_list=img_start_ids_list,
1130
+ num_queries_vis_abstractors=num_queries_vis_abstractors,
1131
+ num_queries_vis_abstractors_slow=num_queries_vis_abstractors_slow,
1132
+ first_last_frames_slows=first_last_frames_slows,
1133
+ is_videos=is_videos,
1134
  )
 
1135
  inputs_embeds = inputs_embeds.to(device=self.language_model.device, dtype=self.language_model.dtype)
1136
 
1137
  # pred : torch.int64 : [batchsize, generated token_length]
 
1140
  pad_token_id=pad_token_id,
1141
  eos_token_id=eos_token_id,
1142
  bad_words_ids=bad_words_ids,
1143
+ max_length=max_length,
1144
  min_length=min_length,
1145
  num_beams=num_beams,
1146
  do_sample=(False if temperature == 0.0 else do_sample), # set do_sample=False if invalid temperature
 
1151
  length_penalty=length_penalty,
1152
  early_stopping=(False if num_beams <= 1 else True), # set early_stopping=False when not beam_search
1153
  use_cache=use_cache,
1154
+ **kwargs,
1155
  )
1156
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1157
  return pred
1158
 
1159
  def to_vision_model_device(self, input_tensor: Union[torch.Tensor, List]) -> Union[torch.Tensor, List]:
 
1240
  model: HCXVisionForCausalLM = super().from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
1241
  model.tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path)
1242
 
1243
+ img_start_id = model.tokenizer.encode(IMG_LOC, add_special_tokens=False)
 
 
 
 
 
 
1244
  assert (
1245
+ len(img_start_id) == 1
1246
+ ), f'"<|dummy3|>" was not encoded into a single special token. Encoding result: {img_start_id}'
1247
+ model.config.img_start_id = img_start_id[0]
1248
 
1249
  model.save_only_vision = save_only_vision
1250
  model.save_only_qformer = save_only_qformer
 
1293
 
1294
  return state_dict
1295
 
1296
+ def compute_adaptive_params(
1297
+ self,
1298
+ pixel_values: Optional[List[List[torch.FloatTensor]]] = None,
1299
+ num_queries_vis_abstractors: Optional[List[List[int]]] = None,
1300
+ num_queries_vis_abstractors_slow: Optional[List[List[int]]] = None,
1301
+ image_sizes: Optional[List[List[List[int]]]] = None,
1302
+ is_videos: Optional[List[bool]] = None,
1303
+ first_last_frames_slows: Optional[List[bool]] = None,
1304
+ ) -> Tuple[List[int], List[int], List[List[int]], List[bool], List[List[int]]]:
1305
+ """Compute adaptive parameters for processing different image and video inputs.
1306
+
1307
+ This method calculates parameters needed for adaptive processing, especially when handling
1308
+ variable resolutions or applying the slowfast algorithm to video frames. It flattens
1309
+ batch-level inputs (lists of lists) into single lists representing all images/frames
1310
+ in the batch. Based on slowfast configuration, it may split video frames into 'slow'
1311
+ and 'fast' components, adjusting query counts and grid indices accordingly.
1312
 
1313
+ Args:
1314
+ pixel_values: List of lists of image tensors (per sample). Used to determine the initial number of grids per
1315
+ image/frame.
1316
+ num_queries_vis_abstractors: List of lists (per sample) containing the base number of visual tokens
1317
+ generated by the visual abstractor for each image grid
1318
+ (e.g., 81 for a full grid, 9 for a subsampled/fast grid).
1319
+ num_queries_vis_abstractors_slow: List of lists (per sample) containing the number of visual tokens for the
1320
+ 'slow' path when applying slowfast. Non-zero values here trigger the slowfast processing logic.
1321
+ image_sizes: List of lists (per sample) of original image dimensions ([width, height]).
1322
+ is_videos: List of lists (per sample) of booleans indicating if each input item is part of a video sequence.
1323
+ first_last_frames_slows: List (per sample) of booleans. If True, slowfast logic
1324
+ (if active based on `num_queries_vis_abstractors_slow`) is applied only to the first or last frame(s)
1325
+ within each video sequence.
1326
 
1327
+ Returns:
1328
+ Tuple containing:
1329
+ - num_queries_vis_abstractors: Flattened list of final query counts per processed grid.
1330
+ Values might be adjusted based on slow/fast splitting
1331
+ (e.g., using values from `num_queries_vis_abstractors_slow` for slow frames).
1332
+ Example: [81, 81, 81, 9, 81, 9, ...] (Image, Image, Vid_Slow, Vid_Fast, Vid_Slow, Vid_Fast...)
1333
+ - num_grids: Flattened list representing cumulative grid counts, acting as end indices for slicing the
1334
+ flattened `image_forward_outs`. Adjusted for slow/fast splits.
1335
+ Example: [0, 1, 9, 10, 18, 19, 27, ...] (Indices after Grid0_Slow(1),
1336
+ Grid1_Fast(8), Grid2_Slow(1), Grid3_Fast(8)...).
1337
+ - image_sizes: Flattened list of image dimensions ([width, height]), potentially duplicated if slow/fast
1338
+ splitting occurred.
1339
+ - is_videos: Flattened list of booleans indicating video status, potentially duplicated for
1340
+ slow/fast splits. Example: [False, False, True, True, True, True, ...]
1341
+ (Image1, Image2, Vid_grid1_slow, Vid_grid1_fast, Vid_grid2_slow, Vid_grid2_fast...)
1342
+ - group_ids: List of lists, grouping indices that correspond to the same original image or frame.
1343
+ If a frame is split into slow/fast, its group will contain multiple indices.
1344
+ Example: [[0], [1], [2, 3], [4, 5], ...]
1345
+ (Group for Image1, Group for Image2, Group for Vid1_Slow+Fast, Group for Vid2_Slow+Fast...).
1346
+
1347
+ Raises:
1348
+ AssertionError: If input validation fails (e.g., negative query counts).
1349
+ Exception: If an unexpected case is encountered during slowfast processing.
1350
+ """
1351
+
1352
+ # Check if all elements are integers greater than or equal to 0
1353
+ assert all(
1354
+ all(isinstance(value, int) and value >= 0 for value in sublist) for sublist in num_queries_vis_abstractors
1355
+ ), "All values in num_queries_vis_abstractors must be integers >= 0."
1356
+
1357
+ assert all(
1358
+ all(isinstance(value, int) and value >= 0 for value in sublist)
1359
+ for sublist in num_queries_vis_abstractors_slow
1360
+ ), "All values in num_queries_vis_abstractors_slow must be integers >= 0."
1361
+
1362
+ assert is_videos is not None
1363
+
1364
+ # Is it the first or last image? (for applying slowfast to video processing)
1365
+ is_first_images = []
1366
+ is_last_images = []
1367
+ for is_video in is_videos:
1368
+ for idx, is_video_item in enumerate(is_video):
1369
+ if idx == 0:
1370
+ is_first_images.append(True)
1371
+ else:
1372
+ is_first_images.append(False)
1373
+ if idx == len(is_video) - 1:
1374
+ is_last_images.append(True)
1375
+ else:
1376
+ is_last_images.append(False)
1377
+
1378
+ num_queries_vis_abstractors = list(chain(*num_queries_vis_abstractors))
1379
+ num_queries_vis_abstractors_slow = list(chain(*num_queries_vis_abstractors_slow))
1380
+ image_sizes = list(chain(*image_sizes))
1381
+ is_videos = list(chain(*is_videos))
1382
+ first_last_frames_slows = list(chain(*first_last_frames_slows))
1383
+
1384
+ # Use slowfast mode if there's at least one visual token count greater than 0 in num_queries_vis_abstractors_slow
1385
+ use_slowfast = any([num_query > 0 for num_query in num_queries_vis_abstractors_slow])
1386
+ num_grids = [pixel_value.shape[0] for pixel_value in chain(*pixel_values)]
1387
+ num_grids = [0] + num_grids
1388
+ group_ids = []
1389
+
1390
+ if use_slowfast:
1391
+ new_num_grids = [num_grids[0]]
1392
+ new_num_queries = []
1393
+ new_image_sizes = []
1394
+ new_is_videos = []
1395
+
1396
+ # When using slowfast, split more finely
1397
+ # 0th local grid is slow frame, remaining local grids are fast frames
1398
+ for (
1399
+ num_query,
1400
+ num_query_slow,
1401
+ num_grid,
1402
+ image_size,
1403
+ is_video,
1404
+ first_last_frames_slow,
1405
+ is_first_image,
1406
+ is_last_image,
1407
+ ) in zip(
1408
+ num_queries_vis_abstractors,
1409
+ num_queries_vis_abstractors_slow,
1410
+ num_grids[1:],
1411
+ image_sizes,
1412
+ is_videos,
1413
+ first_last_frames_slows,
1414
+ is_first_images,
1415
+ is_last_images,
1416
+ ):
1417
+
1418
+ if not first_last_frames_slow and num_query_slow > 0: # Process all image in slowfast mode
1419
+ assert is_video # slowfast mode is only applied to videos
1420
+
1421
+ this_group_ids = [group_ids[-1][-1] + 1 if group_ids else 0]
1422
+
1423
+ # slow frame (first grid)
1424
+ new_num_grids.append(new_num_grids[-1] + 1)
1425
+ new_num_queries.append(num_query_slow)
1426
+ new_image_sizes.append(image_size)
1427
+ new_is_videos.append(is_video)
1428
+
1429
+ if num_grid >= 2:
1430
+ # fast frames
1431
+ new_num_grids.append(new_num_grids[-1] + num_grid - 1)
1432
+ new_num_queries.append(num_query)
1433
+ new_image_sizes.append(image_size)
1434
+ new_is_videos.append(is_video)
1435
+ this_group_ids.append(this_group_ids[-1] + 1)
1436
+
1437
+ group_ids.append(this_group_ids)
1438
+ elif (
1439
+ first_last_frames_slow and num_query_slow > 0 and (is_first_image or is_last_image)
1440
+ ): # Process only first/last image in slowfast mode
1441
+ # Case for special treatment of first/last frames in slow mode
1442
+ assert is_video # slowfast mode is only applied to videos
1443
+
1444
+ this_group_ids = [group_ids[-1][-1] + 1 if group_ids else 0]
1445
+
1446
+ if num_grid == 1:
1447
+ # Simply process with slow since there's only one grid
1448
+ new_num_grids.append(new_num_grids[-1] + 1)
1449
+ new_num_queries.append(num_query_slow)
1450
+ new_image_sizes.append(image_size)
1451
+ new_is_videos.append(is_video)
1452
+
1453
+ if num_grid >= 2:
1454
+ # Special treatment for first or last grid depending on is_first_image or is_last_image
1455
+
1456
+ if is_first_image: # includes both first and last
1457
+ # slow frame (first grid)
1458
+ new_num_grids.append(new_num_grids[-1] + 1)
1459
+ new_num_queries.append(num_query_slow)
1460
+ new_image_sizes.append(image_size)
1461
+ new_is_videos.append(is_video)
1462
+ # fast frames
1463
+ new_num_grids.append(new_num_grids[-1] + num_grid - 1)
1464
+ new_num_queries.append(num_query)
1465
+ new_image_sizes.append(image_size)
1466
+ new_is_videos.append(is_video)
1467
+ this_group_ids.append(this_group_ids[-1] + 1)
1468
+ elif is_last_image:
1469
+ # fast frames
1470
+ new_num_grids.append(new_num_grids[-1] + num_grid - 1)
1471
+ new_num_queries.append(num_query)
1472
+ new_image_sizes.append(image_size)
1473
+ new_is_videos.append(is_video)
1474
+ # slow frame (last grid)
1475
+ new_num_grids.append(new_num_grids[-1] + 1)
1476
+ new_num_queries.append(num_query_slow)
1477
+ new_image_sizes.append(image_size)
1478
+ new_is_videos.append(is_video)
1479
+ this_group_ids.append(this_group_ids[-1] + 1)
1480
+ else:
1481
+ raise Exception("This case should not be reached.")
1482
+ group_ids.append(this_group_ids)
1483
+ else:
1484
+ # Not in slowfast mode, so reduce all by num_query (fast)
1485
+ new_num_grids.append(new_num_grids[-1] + num_grid)
1486
+ new_num_queries.append(num_query)
1487
+ new_image_sizes.append(image_size)
1488
+ new_is_videos.append(is_video)
1489
+
1490
+ start_group_id = group_ids[-1][-1] + 1 if group_ids else 0
1491
+ group_ids.append([start_group_id])
1492
+
1493
+ num_grids = new_num_grids
1494
+ num_queries_vis_abstractors = new_num_queries
1495
+ image_sizes = new_image_sizes
1496
+ is_videos = new_is_videos
1497
  else:
1498
+ num_grids = [sum(num_grids[:i]) for i in range(1, len(num_grids) + 1)]
1499
+ group_ids = [[group_id] for group_id in range(len(is_videos))]
1500
 
1501
+ return num_queries_vis_abstractors, num_grids, image_sizes, is_videos, group_ids
 
 
 
 
1502
 
1503
 
1504
  class HCXVisionCAbstractor(nn.Module):
 
1570
  ) -> torch.Tensor:
1571
  # x: [B, L, dim]
1572
  B, L, dim = x.shape
1573
+ hw = int(L ** 0.5)
1574
  x = rearrange(x, "b (h w) d -> b d h w", h=hw, w=hw)
1575
 
1576
  if num_queries_vis_abstractors is not None:
 
1596
  for i, num_queries in enumerate(num_queries_vis_abstractors):
1597
  hw = int(num_queries**0.5)
1598
  sampler = nn.AdaptiveAvgPool2d((hw, hw))
1599
+ out = sampler(x[num_grids[i]:num_grids[i + 1], :])
1600
  out = self.net[2](out) # s2
1601
 
1602
  out = rearrange(out, "b d h w -> b (h w) d")
 
1614
  depth: int = 3,
1615
  mlp_depth: int = 2,
1616
  ):
1617
+ assert (n_queries ** 0.5).is_integer(), f"n_queries must be square number. n_queries: {n_queries}"
1618
+ hw = int(n_queries ** 0.5)
1619
 
1620
  # RegBlock = ResBlock + SE
1621
  RegBlock = partial(
 
1653
  layers.append(nn.Linear(output_hidden_size, output_hidden_size))
1654
  return nn.Sequential(*layers)
1655
 
1656
+ def load_sharded_checkpoint(
1657
+ model, folder, pick_prefix="", replace_prefix_list=[], replace_prefix_dict={}, print_info=True
1658
+ ):
1659
+ if folder is None:
1660
+ return {}
1661
+
1662
+ files = os.listdir(folder)
1663
+
1664
+ # find relevant files
1665
+ pytorch_bin_files = [file for file in files if file.startswith("pytorch_model") and file.endswith(".bin")]
1666
+ safetensor_files = [file for file in files if file.endswith(".safetensors")]
1667
+ shard_index_file = [file for file in files if file.endswith(".index.json")]
1668
+
1669
+ # check if sharded
1670
+ index_present = len(shard_index_file) > 0
1671
+ index_file = os.path.join(folder, shard_index_file[0]) if index_present else []
1672
+
1673
+ # check if safetensor
1674
+ is_safetensor = len(safetensor_files) > 0
1675
+
1676
+ model_keys = model.state_dict().keys()
1677
+
1678
+ if is_safetensor:
1679
+ from safetensors.torch import load_file
1680
+
1681
+ load_function = load_file
1682
+ shard_files = safetensor_files
1683
+ else:
1684
+ load_function = partial(torch.load, map_location="cpu")
1685
+ shard_files = pytorch_bin_files
1686
+
1687
+ # sharded case
1688
+ if index_present:
1689
+ with open(index_file, "r", encoding="utf-8") as f:
1690
+ index = json.load(f)
1691
+ loaded_keys = index["weight_map"].keys()
1692
+ if pick_prefix:
1693
+ loaded_keys = [k[len(pick_prefix) :] for k in loaded_keys if k.startswith(pick_prefix)]
1694
+ if replace_prefix_list:
1695
+ for rep_prefix in replace_prefix_list:
1696
+ loaded_keys = [k[len(rep_prefix) :] if k.startswith(rep_prefix) else k for k in loaded_keys]
1697
+ if replace_prefix_dict:
1698
+ for rep_prefix in replace_prefix_dict:
1699
+ loaded_keys = [
1700
+ k.replace(rep_prefix, replace_prefix_dict[rep_prefix]) if k.startswith(rep_prefix) else k
1701
+ for k in loaded_keys
1702
+ ]
1703
+
1704
+ for i, shard_file in enumerate(shard_files):
1705
+ state_dict = load_function(os.path.join(folder, shard_file))
1706
+
1707
+ # if pick_prefix, use only pick
1708
+ if pick_prefix:
1709
+ state_dict = {k[len(pick_prefix) :]: v for k, v in state_dict.items() if k.startswith(pick_prefix)}
1710
+
1711
+ for rep_prefix in replace_prefix_list:
1712
+ state_dict = {k[len(rep_prefix) :] if k.startswith(rep_prefix) else k: v for k, v in state_dict.items()}
1713
+
1714
+ for rep_prefix in replace_prefix_dict:
1715
+ state_dict = {
1716
+ k.replace(rep_prefix, replace_prefix_dict[rep_prefix]) if k.startswith(rep_prefix) else k: v
1717
+ for k, v in state_dict.items()
1718
+ }
1719
+
1720
+ if is_fsdp_enabled():
1721
+ if is_local_dist_rank_0():
1722
+ model.load_state_dict(state_dict, strict=False)
1723
+ else:
1724
+ model.load_state_dict(state_dict, strict=False)
1725
+ # Make sure memory is freed before we load the next state dict.
1726
+
1727
+ if not index_present:
1728
+ loaded_keys = state_dict.keys()
1729
+
1730
+ del state_dict
1731
+ gc.collect()
1732
+
1733
+ # missing keys
1734
+ missing_keys = [key for key in model_keys if key not in loaded_keys]
1735
+ unexpected_keys = [key for key in loaded_keys if key not in model_keys]
1736
+
1737
+ if get_rank() == 0 and print_info:
1738
+ print(f"[info] missing_keys: {missing_keys}")
1739
+ print(f"[info] unexpected_keys: {unexpected_keys}")
1740
+
1741
+ return {"missing_keys": missing_keys, "unexpected_keys": unexpected_keys}
image_processing_hyperclovax.py → preprocessor.py RENAMED
@@ -1,14 +1,22 @@
 
1
  import copy
 
2
  import math
3
  import os
 
4
  from typing import Dict, List, Optional, Union
 
5
 
 
 
6
  import numpy as np
 
7
  import torch
8
- from PIL import Image
9
- from transformers.feature_extraction_utils import BatchFeature
10
  from transformers.image_processing_utils import (
11
  BaseImageProcessor,
 
12
  get_size_dict,
13
  )
14
  from transformers.image_transforms import (
@@ -35,16 +43,401 @@ from transformers.utils import TensorType, logging
35
  logger = logging.get_logger(__name__)
36
 
37
 
38
- class HCXImageProcessor(BaseImageProcessor):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
  r"""
40
- Constructs a VLM image processor. Based on [`CLIPImageProcessor`] with incorporation of additional techniques for processing high resolution images.
 
 
 
 
 
41
  Args:
42
- anyres: (bool) anyres 기능을 사용할지 안할지
43
- unpad: (bool) anyres 사용시, unpad 기능 (순수 pad 영역에 해당하는 visual tokens LLM input 에서 제거) 을 사용할지 안할지
44
- num_queries_vis_abstractor: (int) grid 대해서 resampler 사용하는 경우, visual query
45
- possible_resolutions: (List) anyres 기능 사용시, 가능한 resolution 조합, 예: [[336, 336], [336, 672], [672, 336]]
46
- patch_size: (int) ViT patch size
47
- pad_to_square: (bool) 정사각형으로 padding 수행할지, 안할지를 결정. False 이면 정사각형이 아니기 때문에 center crop 을 거쳐 ViT 의 입력으로 들어감
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
  """
49
 
50
  model_input_names = ["pixel_values"]
@@ -55,10 +448,11 @@ class HCXImageProcessor(BaseImageProcessor):
55
  size: Dict[str, int] = None,
56
  anyres: bool = False,
57
  unpad: bool = False,
58
- num_queries_vis_abstractor_image: int = 81,
59
- num_queries_vis_abstractor_video_slow: int = 81,
60
- num_queries_vis_abstractor_video_fast: int = 9,
61
- first_last_frames_slow_video: bool = False,
 
62
  possible_resolutions: List = [],
63
  patch_size: int = 14,
64
  pad_to_square: bool = True,
@@ -71,22 +465,24 @@ class HCXImageProcessor(BaseImageProcessor):
71
  image_mean: Optional[Union[float, List[float]]] = None,
72
  image_std: Optional[Union[float, List[float]]] = None,
73
  do_convert_rgb: bool = True,
 
74
  **kwargs,
75
  ) -> None:
76
  super().__init__(**kwargs)
77
- size = size if size is not None else {"shortest_edge": 336}
78
  size = get_size_dict(size, default_to_square=False)
79
- crop_size = crop_size if crop_size is not None else {"height": 336, "width": 336}
80
  crop_size = get_size_dict(crop_size, default_to_square=True, param_name="crop_size")
81
 
82
  self.do_resize = do_resize
83
  self.size = size
84
  self.anyres = anyres
85
  self.unpad = unpad
86
- self.num_queries_vis_abstractor_image = num_queries_vis_abstractor_image
87
- self.num_queries_vis_abstractor_video_slow = num_queries_vis_abstractor_video_slow
 
88
  self.num_queries_vis_abstractor_video_fast = num_queries_vis_abstractor_video_fast
89
- self.first_last_frames_slow_video = first_last_frames_slow_video
90
  self.possible_resolutions = [_resolution for _resolution in possible_resolutions]
91
  self.patch_size = patch_size
92
  self.pad_to_square = pad_to_square
@@ -99,6 +495,9 @@ class HCXImageProcessor(BaseImageProcessor):
99
  self.image_mean = image_mean if image_mean is not None else OPENAI_CLIP_MEAN
100
  self.image_std = image_std if image_std is not None else OPENAI_CLIP_STD
101
  self.do_convert_rgb = do_convert_rgb
 
 
 
102
 
103
  def resize(
104
  self,
@@ -109,6 +508,20 @@ class HCXImageProcessor(BaseImageProcessor):
109
  input_data_format: Optional[Union[str, ChannelDimension]] = None,
110
  **kwargs,
111
  ) -> np.ndarray:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
112
  default_to_square = True
113
  if "shortest_edge" in size:
114
  size = size["shortest_edge"]
@@ -150,11 +563,40 @@ class HCXImageProcessor(BaseImageProcessor):
150
  data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
151
  input_data_format: Optional[Union[str, ChannelDimension]] = None,
152
  ) -> Image.Image:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
153
  images = make_list_of_images(images)
154
 
155
  if do_resize:
156
  images = [
157
- self.resize(image=image, size=size, resample=resample, input_data_format=input_data_format)
 
 
 
 
 
158
  for image in images
159
  ]
160
 
@@ -165,12 +607,22 @@ class HCXImageProcessor(BaseImageProcessor):
165
 
166
  if do_rescale:
167
  images = [
168
- self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format) for image in images
 
 
 
 
 
169
  ]
170
 
171
  if do_normalize:
172
  images = [
173
- self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format)
 
 
 
 
 
174
  for image in images
175
  ]
176
 
@@ -181,20 +633,59 @@ class HCXImageProcessor(BaseImageProcessor):
181
  return images
182
 
183
  def _resize_for_local_grids(
184
- self, image: np.array, target_resolution: tuple, resample, input_data_format: ChannelDimension
 
 
 
 
185
  ) -> np.array:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
186
  new_height, new_width = _get_local_grids_output_size(image, target_resolution, input_data_format)
187
 
188
  # Resize the image
189
- resized_image = resize(image, (new_height, new_width), resample=resample, input_data_format=input_data_format)
 
 
 
 
 
190
 
191
  return resized_image
192
 
193
  def _pad_for_patching(
194
- self, image: np.array, target_resolution: tuple, input_data_format: ChannelDimension
 
 
 
195
  ) -> np.array:
196
  """
197
- Pad an image to a target resolution while maintaining aspect ratio.
 
 
 
 
 
 
 
 
 
 
 
198
  """
199
  target_height, target_width = target_resolution
200
 
@@ -217,13 +708,34 @@ class HCXImageProcessor(BaseImageProcessor):
217
  data_format: ChannelDimension,
218
  input_data_format: ChannelDimension,
219
  ) -> List[np.array]:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
220
  if not isinstance(possible_resolutions, list):
221
  raise ValueError("possible_resolutions must be a list of possible resolutions.")
222
 
223
  image_size = get_image_size(image, channel_dim=input_data_format)
224
  best_resolution = select_best_resolution(image_size, possible_resolutions)
225
  resized_image = self._resize_for_local_grids(
226
- image, best_resolution, resample=resample, input_data_format=input_data_format
 
 
 
227
  )
228
  padded_image = self._pad_for_patching(resized_image, best_resolution, input_data_format=input_data_format)
229
  local_grids = divide_to_grids(padded_image, grid_size=grid_size, input_data_format=input_data_format)
@@ -243,11 +755,7 @@ class HCXImageProcessor(BaseImageProcessor):
243
  size: Dict[str, int] = None,
244
  anyres: bool = None,
245
  unpad: bool = None,
246
- is_video: bool = False,
247
- num_queries_vis_abstractor_image: int = None,
248
- num_queries_vis_abstractor_video_slow: int = None,
249
- num_queries_vis_abstractor_video_fast: int = None,
250
- first_last_frames_slow_video: bool = None,
251
  possible_resolutions: List = None,
252
  patch_size: int = None,
253
  pad_to_square: bool = None,
@@ -263,43 +771,52 @@ class HCXImageProcessor(BaseImageProcessor):
263
  return_tensors: Optional[Union[str, TensorType]] = None,
264
  data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
265
  input_data_format: Optional[Union[str, ChannelDimension]] = None,
266
- return_dummy_image: bool = False,
267
- first_last_frames_slow: bool = False,
268
- is_first_or_last_frames: bool = False,
269
- **kwargs,
270
  ):
271
  """
272
- HCXVisionImageProcessor image tensor, original image size (width, height), visual tokens
273
- :return pixel_values: List of 4D tensor 로 image tensor
274
- :return image_sizes: List of Dict image width, height [{"width": image 1 의 width, "height": image 1 의 height}, {"width": image 2 의 width, "height": image 2 의 height}, ...]
275
- :return vision_query_lengths: List of int image LLM 입력으로 전달될때 변환되는 visual token 수
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
276
  """
277
-
278
  do_resize = do_resize if do_resize is not None else self.do_resize
279
  size = size if size is not None else self.size
280
  size = get_size_dict(size, param_name="size", default_to_square=False)
281
  anyres = anyres if anyres is not None else self.anyres
282
  unpad = unpad if unpad is not None else self.unpad
283
- num_queries_vis_abstractor_image = (
284
- num_queries_vis_abstractor_image
285
- if num_queries_vis_abstractor_image is not None
286
- else self.num_queries_vis_abstractor_image
287
- )
288
- num_queries_vis_abstractor_video_slow = (
289
- num_queries_vis_abstractor_video_slow
290
- if num_queries_vis_abstractor_video_slow is not None
291
- else self.num_queries_vis_abstractor_video_slow
292
- )
293
- num_queries_vis_abstractor_video_fast = (
294
- num_queries_vis_abstractor_video_fast
295
- if num_queries_vis_abstractor_video_fast is not None
296
- else self.num_queries_vis_abstractor_video_fast
297
- )
298
- first_last_frames_slow_video = (
299
- first_last_frames_slow_video
300
- if first_last_frames_slow_video is not None
301
- else self.first_last_frames_slow_video
302
- )
303
  possible_resolutions = possible_resolutions if possible_resolutions is not None else self.possible_resolutions
304
  patch_size = patch_size if patch_size is not None else self.patch_size
305
  pad_to_square = pad_to_square if pad_to_square is not None else self.pad_to_square
@@ -314,17 +831,6 @@ class HCXImageProcessor(BaseImageProcessor):
314
  image_std = image_std if image_std is not None else self.image_std
315
  do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
316
 
317
- if is_video:
318
- num_queries_vis_abstractor = num_queries_vis_abstractor_video_fast
319
- num_queries_vis_abstractor_slow = num_queries_vis_abstractor_video_slow
320
- unpad = False
321
- else:
322
- num_queries_vis_abstractor = num_queries_vis_abstractor_image
323
- num_queries_vis_abstractor_slow = 0
324
-
325
- if return_dummy_image:
326
- images = Image.new("RGB", (224, 224), (0, 0, 0))
327
-
328
  images = make_list_of_images(images)
329
 
330
  if not valid_images(images):
@@ -355,25 +861,38 @@ class HCXImageProcessor(BaseImageProcessor):
355
 
356
  assert crop_size["height"] == crop_size["width"]
357
 
358
- # global image padding 연산은, image original width, height bottleneck 있음
359
- # 장축의 길이를 size["shortest_edge"] resize 먼저 뒤에, padding
 
360
  if anyres:
361
  anyres_global_images = copy.deepcopy(images)
362
  if pad_to_square:
363
  background_color = tuple(int(x * 255) for x in self.image_mean)
364
  anyres_global_images = [
365
- resize_longside(copy.deepcopy(image), size["shortest_edge"], resample, input_data_format)
 
 
 
 
 
366
  for image in anyres_global_images
367
  ]
368
  anyres_global_images = [
369
- expand2square(image, background_color=background_color, input_data_format=input_data_format)[0]
 
 
 
 
370
  for image in anyres_global_images
371
  ]
372
  else:
373
  anyres_global_images = [
374
  self.resize(
375
  image=image,
376
- size={"height": size["shortest_edge"], "width": size["shortest_edge"]},
 
 
 
377
  resample=resample,
378
  input_data_format=input_data_format,
379
  )
@@ -387,11 +906,32 @@ class HCXImageProcessor(BaseImageProcessor):
387
  resize_longside(image, size["shortest_edge"], resample, input_data_format) for image in images
388
  ]
389
  images = [
390
- expand2square(image, background_color=background_color, input_data_format=input_data_format)[0]
 
 
 
 
391
  for image in images
392
  ]
393
 
394
- for image, anyres_global_image, image_size in zip(images, anyres_global_images, image_sizes):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
395
  if anyres:
396
  # convert image into a list of grids
397
  # we intentially use the same data format as the input data format
@@ -403,7 +943,7 @@ class HCXImageProcessor(BaseImageProcessor):
403
  data_format=input_data_format,
404
  input_data_format=input_data_format,
405
  )
406
- # video 에 대해서는 global image (thumbnail) 사용하지 않음
407
  if not is_video:
408
  image_grids = [anyres_global_image] + image_grids
409
  else:
@@ -428,362 +968,617 @@ class HCXImageProcessor(BaseImageProcessor):
428
  pixel_values = np.array(pixel_values)
429
  new_images.append(pixel_values)
430
 
 
 
431
  vision_query_length = determine_anyres_num_vision_patches(
 
432
  image_size=image_size,
433
  grid_size=crop_size["height"],
434
  patch_size=patch_size,
435
  possible_resolutions=possible_resolutions,
436
  anyres=anyres,
437
- unpad=unpad,
438
  num_queries_vis_abstractor=num_queries_vis_abstractor,
439
  num_queries_vis_abstractor_slow=num_queries_vis_abstractor_slow,
440
  is_video=is_video,
441
- first_last_frames_slow=first_last_frames_slow,
442
- is_first_or_last_frames=is_first_or_last_frames,
443
  )
444
 
445
  vision_query_lengths.append(vision_query_length)
446
 
447
- if return_dummy_image:
448
- vision_query_lengths = []
449
-
450
  data = {
451
- "pixel_values": [torch.tensor(new_image) for new_image in new_images],
452
- "image_sizes": [{"width": image_size[1], "height": image_size[0]} for image_size in image_sizes],
453
- "vision_query_lengths": vision_query_lengths,
 
 
 
 
454
  }
455
 
456
- return BatchFeature(data=data, tensor_type=return_tensors)
457
-
458
- def save_pretrained(
459
- self,
460
- save_directory: Union[str, os.PathLike],
461
- *args,
462
- **kwargs,
463
- ):
464
- self.register_for_auto_class()
465
- super().save_pretrained(save_directory, *args, **kwargs)
466
-
467
 
468
- def determine_anyres_num_vision_patches(
469
- image_size,
470
- grid_size,
471
- patch_size,
472
- possible_resolutions,
473
- anyres=False,
474
- unpad=True,
475
- num_queries_vis_abstractor=0,
476
- num_queries_vis_abstractor_slow=0,
477
- is_video=False,
478
- first_last_frames_slow=False, # sample-wise option
479
- is_first_or_last_frames=False, # grid-wise option
480
- ):
481
- """
482
- Computes the number of visual tokens (patches) based on image resolution, grid configuration, and patch size.
483
-
484
- This function supports both fixed-size and any-resolution settings, as well as video-specific configurations
485
- such as handling slow frames and frame position flags.
486
-
487
- Args:
488
- num_grids (int): Number of grids per image (e.g., 1 for 1x1, 4 for 2x2, etc.).
489
- image_size (tuple): The original image size as (height, width).
490
- grid_size (int): Size of each grid in pixels (e.g., 336).
491
- patch_size (int): Size of each vision patch (e.g., 14 for ViT models).
492
- possible_resolutions (list): List of possible resolution tuples [(h1, w1), (h2, w2), ...].
493
- anyres (bool, optional): Whether to use any-resolution mode. Defaults to False.
494
- unpad (bool, optional): Whether to unpad the image before computing patches. Defaults to True.
495
- num_queries_vis_abstractor (int, optional): Number of query tokens for vision abstractor (fast path).
496
- num_queries_vis_abstractor_slow (int, optional): Number of query tokens for vision abstractor (slow path).
497
- is_video (bool, optional): Whether the input is a video. Defaults to False.
498
- first_last_frames_slow (bool, optional): Whether to treat first/last video frames as "slow". Defaults to False.
499
- is_first_or_last_frames (bool, optional): Whether current grid corresponds to first/last frame. Defaults to False.
500
 
501
- Returns:
502
- int: Total number of visual tokens (patches) after processing.
503
- """
504
 
505
- if not anyres:
506
- return num_queries_vis_abstractor if num_queries_vis_abstractor > 0 else (grid_size // patch_size) ** 2
 
 
507
 
508
- if num_queries_vis_abstractor > 0:
509
- num_patch_per_grid = int(num_queries_vis_abstractor**0.5)
510
- else:
511
- num_patch_per_grid = grid_size // patch_size
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
512
 
513
- num_global_per_grid = num_patch_per_grid
 
514
 
515
- # In anyres mode, a global image is included, so there are always at least 2 grids.
516
- # However, for video inputs, there is no global image, so it's possible to have only 1 grid.
517
- # Therefore, the assertion below is commented out:
518
- # assert num_grids > 1
 
519
 
520
- # Compute the number of vision patches.
521
- height, width = select_best_resolution(image_size, possible_resolutions)
522
 
523
- num_patch_height = (height // grid_size) * num_patch_per_grid
524
- num_patch_width = (width // grid_size) * num_patch_per_grid
 
525
 
526
- # local images
527
- if unpad:
528
- original_height, original_width = image_size
529
 
530
- original_aspect_ratio = original_width / original_height
531
- current_aspect_ratio = num_patch_width / num_patch_height
532
 
533
- if original_aspect_ratio > current_aspect_ratio:
534
- scale_factor = num_patch_width / original_width
535
- new_height = int(original_height * scale_factor)
536
- padding = (num_patch_height - new_height) // 2
537
- num_patch_height = num_patch_height - padding * 2
538
- else:
539
- scale_factor = num_patch_height / original_height
540
- new_width = int(original_width * scale_factor)
541
- padding = (num_patch_width - new_width) // 2
542
- num_patch_width = num_patch_width - padding * 2
543
 
544
- num_patches = num_patch_width * num_patch_height + num_patch_height
545
- else:
546
- num_patches = num_patch_width * num_patch_height
547
 
548
- # In the "slow" strategy, when applying to first and last frames only, it is applied exclusively to those two frames.
549
- if num_queries_vis_abstractor_slow > 0:
550
- if first_last_frames_slow:
551
- if is_first_or_last_frames:
552
- num_patches += num_queries_vis_abstractor_slow - num_queries_vis_abstractor
553
- else:
554
- num_patches += num_queries_vis_abstractor_slow - num_queries_vis_abstractor
555
- # The slowfast feature is only applicable when unpad is set to False.
556
- assert unpad is False
557
 
558
- # Global image is not included for video inputs.
559
- if not is_video:
560
- num_patches += num_global_per_grid**2
 
 
561
 
562
- return num_patches
 
 
 
 
 
 
 
 
563
 
 
564
 
565
- def divide_to_grids(image: np.array, grid_size: int, input_data_format=None) -> List[np.array]:
 
566
  """
567
- Divides a local image into grids of size (grid_size x grid_size).
 
568
 
569
  Args:
570
- image (np.array): Input image as a NumPy array.
571
- grid_size (int): The size (in pixels) of each square grid.
572
- input_data_format (optional): Optional format specifier (e.g., "channels_first" or "channels_last").
 
 
573
 
574
  Returns:
575
- List[np.array]: A list of image patches, each of size (grid_size x grid_size).
 
 
 
 
576
  """
577
- grids = []
578
- height, width = get_image_size(image, channel_dim=input_data_format)
579
- for i in range(0, height, grid_size):
580
- for j in range(0, width, grid_size):
581
- if input_data_format == ChannelDimension.LAST:
582
- grid = image[i : i + grid_size, j : j + grid_size]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
583
  else:
584
- grid = image[:, i : i + grid_size, j : j + grid_size]
585
- grids.append(grid)
586
 
587
- return grids
 
588
 
 
 
 
 
 
 
 
 
 
589
 
590
- def pad(
591
- image: np.array,
592
- target_size: tuple,
593
- background_color=(127, 127, 127),
594
- input_data_format=None,
595
- ) -> np.array:
596
  """
597
- Pads the input image on the sides (top/bottom and left/right) to match the target height and width.
 
598
 
599
  Args:
600
- image (np.array): Input image as a NumPy array.
601
- target_size (tuple): Target size as (target_height, target_width).
602
- background_color (tuple, optional): RGB color value used for padding. Defaults to (127, 127, 127).
603
- input_data_format (optional): Optional format specifier (e.g., "channels_first" or "channels_last").
 
604
 
605
  Returns:
606
- np.array: The padded image with the specified target size.
 
 
 
 
 
607
  """
608
- target_height, target_width = target_size
609
- height, width = get_image_size(image, channel_dim=input_data_format)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
610
 
611
- # result = np.ones((target_height, target_width, image.shape[2]), dtype=image.dtype) * background_color
612
- result = np.empty((target_height, target_width, image.shape[2]), dtype=image.dtype)
613
- for i in range(image.shape[2]):
614
- result[..., i].fill(background_color[i])
615
 
616
- paste_x = (target_width - width) // 2
617
- paste_y = (target_height - height) // 2
618
 
619
- result[paste_y : paste_y + height, paste_x : paste_x + width, :] = image
 
 
 
 
 
 
 
 
 
620
 
621
- return result
622
 
 
 
 
623
 
624
- def expand2square(
625
- image: np.array,
626
- bboxes_dict=None,
627
- background_color=(127, 127, 127),
628
- input_data_format=None,
629
- ) -> np.array:
 
 
 
 
630
  """
631
- Expands the input image to a square shape by placing it at the center of a new square canvas,
632
- with padding added to the shorter side (either top/bottom or left/right).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
633
 
634
- The image is always centered on the new canvas, and padding is applied symmetrically.
 
635
 
636
  Args:
637
- image (np.array): Input image as a NumPy array.
638
- bboxes_dict (dict, optional): A dictionary of bounding boxes, where each value is an NDArray of shape (N, 4, 2)
639
- with box coordinates in the format [[xtl, ytl], [xtr, ytr], [xbr, ybr], [xbl, ybl]].
640
- Supports multiple categories (e.g., "ocr", "html") simultaneously.
641
- background_color (tuple, optional): RGB color to fill the padding area. Defaults to (127, 127, 127).
642
- input_data_format (optional): Optional format specifier for image data (e.g., "channels_first" or "channels_last").
643
 
644
  Returns:
645
- np.array: A square-shaped image with the original image centered and padded as needed.
646
-
647
- Example:
648
- >>> _img = np.ones((80, 100), dtype=np.uint8) * 100
649
- >>> _bboxes_dict = {"words": np.array([[[10, 10], [20, 10], [20, 20], [10, 20]],
650
- ... [[30, 30], [40, 30], [40, 40], [30, 40]]])}
651
- >>> _img, _bboxes_dict = expand2square(_img, _bboxes_dict, (255, 255, 255))
652
- >>> _img.shape
653
- (100, 100)
654
- >>> guessed_ocr_bboxes = np.array([[[20, 10], [30, 10], [30, 20], [20, 20]],
655
- ... [[40, 30], [50, 30], [50, 40], [40, 40]]])
656
- >>> np.testing.assert_array_almost_equal(_bboxes_dict["words"], guessed_ocr_bboxes) is None
657
- True
658
  """
659
- height, width = get_image_size(image, channel_dim=input_data_format)
660
- if width == height:
661
- return image, bboxes_dict
662
- elif width > height:
663
- # result = np.ones((width, width, image.shape[2]), dtype=image.dtype) * background_color
664
- result = np.empty((width, width, image.shape[2]), dtype=image.dtype)
665
- for i in range(image.shape[2]):
666
- result[..., i].fill(background_color[i])
667
 
668
- result[(width - height) // 2 : (width - height) // 2 + height, :] = image
669
- if bboxes_dict is not None:
670
- for key in bboxes_dict:
671
- bboxes_dict[key][:, :, 1] += (width - height) // 2
672
- return result, bboxes_dict
673
- else:
674
- # result = np.ones((height, height, image.shape[2]), dtype=image.dtype) * background_color
675
- result = np.empty((height, height, image.shape[2]), dtype=image.dtype)
676
- for i in range(image.shape[2]):
677
- result[..., i].fill(background_color[i])
678
 
679
- result[:, (height - width) // 2 : (height - width) // 2 + width] = image
680
- if bboxes_dict is not None:
681
- for key in bboxes_dict:
682
- bboxes_dict[key][:, :, 0] += (height - width) // 2
683
- return result, bboxes_dict
684
 
 
685
 
686
- def resize_longside(
687
- image: np.array,
688
- size: int,
689
- resample: PILImageResampling = PILImageResampling.BICUBIC, # type: ignore
690
- data_format: Optional[Union[str, ChannelDimension]] = None,
691
- input_data_format: Optional[Union[str, ChannelDimension]] = None,
692
- ):
693
  """
694
- Resizes the image so that its longer side matches the specified size, maintaining the original aspect ratio.
 
 
 
695
 
696
  Args:
697
- image (np.array): Input image as a NumPy array.
698
- size (int): Target size for the longer side of the image.
699
- resample (PILImageResampling, optional): Resampling method to use during resizing. Defaults to BICUBIC.
700
- data_format (str or ChannelDimension, optional): Output data format (e.g., "channels_first" or "channels_last").
701
- input_data_format (str or ChannelDimension, optional): Input data format of the image.
702
 
703
  Returns:
704
- np.array: The resized image with its aspect ratio preserved.
705
  """
706
- height, width = get_image_size(image, channel_dim=input_data_format)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
707
 
708
- if width == height:
709
- target_height, target_width = size, size
710
- elif width > height:
711
- target_width = size
712
- target_height = math.ceil(height / width * size)
713
- else:
714
- target_width = math.ceil(width / height * size)
715
- target_height = size
716
 
717
- return resize(
718
- image,
719
- size=(target_height, target_width),
720
- resample=resample,
721
- data_format=data_format,
722
- input_data_format=input_data_format,
723
- )
724
 
 
 
 
 
 
 
725
 
726
- def _get_local_grids_output_size(image: np.array, target_resolution: tuple, input_data_format=None):
 
 
 
 
 
 
 
 
 
 
 
727
  """
728
- Computes the number of local grids (patches) along the height and width when resizing an image
729
- to the target resolution.
 
 
730
 
731
  Args:
732
- image (np.array): Input image as a NumPy array.
733
- target_resolution (tuple): Target resolution in the format (target_height, target_width).
734
- input_data_format (optional): Optional format specifier (e.g., "channels_first" or "channels_last").
 
735
 
736
  Returns:
737
- tuple: A tuple (grid_h, grid_w) representing the number of grids along the height and width.
 
 
 
738
  """
739
- original_height, original_width = get_image_size(image, channel_dim=input_data_format)
740
- target_height, target_width = target_resolution
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
741
 
742
- scale_w = target_width / original_width
743
- scale_h = target_height / original_height
 
744
 
745
- if scale_w < scale_h:
746
- new_width = target_width
747
- new_height = min(math.ceil(original_height * scale_w), target_height)
748
- else:
749
- new_height = target_height
750
- new_width = min(math.ceil(original_width * scale_h), target_width)
751
 
752
- return new_height, new_width
 
753
 
 
 
 
 
754
 
755
- def select_best_resolution(original_size: tuple, possible_resolutions: list) -> tuple:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
756
  """
757
- Selects the best-fit resolution from a list of possible resolutions based on the original image size.
758
 
759
- This function, adapted from LLaVA-Next
760
- (https://github.com/huggingface/transformers/blob/v4.40.2/src/transformers/models/llava_next/image_processing_llava_next.py),
761
- evaluates each resolution by computing its effective and wasted area compared to the original size.
762
- The optimal resolution is the one that maximizes the effective area while minimizing unused (wasted) space.
763
 
764
  Args:
765
- original_size (tuple): The original image size in the format (height, width).
766
- possible_resolutions (list): A list of candidate resolutions in the format [(height1, width1), (height2, width2), ...].
 
 
 
 
767
 
768
  Returns:
769
- tuple: The best-fit resolution in the format (height, width).
 
 
770
  """
771
- original_height, original_width = original_size
772
- best_fit = None
773
- max_effective_resolution = 0
774
- min_wasted_resolution = float("inf")
775
 
776
- for height, width in possible_resolutions:
777
- scale = min(width / original_width, height / original_height)
778
- downscaled_width, downscaled_height = int(original_width * scale), int(original_height * scale)
779
- effective_resolution = min(downscaled_width * downscaled_height, original_width * original_height)
780
- wasted_resolution = (width * height) - effective_resolution
781
 
782
- if effective_resolution > max_effective_resolution or (
783
- effective_resolution == max_effective_resolution and wasted_resolution < min_wasted_resolution
784
- ):
785
- max_effective_resolution = effective_resolution
786
- min_wasted_resolution = wasted_resolution
787
- best_fit = (height, width)
788
 
789
- return best_fit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import base64
2
  import copy
3
+ import io
4
  import math
5
  import os
6
+ import uuid
7
  from typing import Dict, List, Optional, Union
8
+ from urllib.parse import urlparse
9
 
10
+ import av
11
+ import cv2
12
  import numpy as np
13
+ import requests
14
  import torch
15
+ from decord import VideoReader, cpu
16
+ from PIL import Image, UnidentifiedImageError
17
  from transformers.image_processing_utils import (
18
  BaseImageProcessor,
19
+ BatchFeature,
20
  get_size_dict,
21
  )
22
  from transformers.image_transforms import (
 
43
  logger = logging.get_logger(__name__)
44
 
45
 
46
+ def determine_possible_resolutions(anyres: bool, max_num_grids: int, grid_size: int, use_1x1_grid: bool = False):
47
+ """
48
+ Finds and returns possible resolution combinations with a total number of grids less than or equal to max_num_grids.
49
+
50
+ For example, if max_num_grids is 4, the possible grid combinations are:
51
+ [1x1, 1x2, 1x3, 1x4, 2x1, 2x2, 3x1, 4x1], and the resolutions are calculated accordingly.
52
+
53
+ Example:
54
+ >>> possible_resolutions = determine_possible_resolutions(anyres=True, max_num_grids=4, grid_size=336)
55
+ >>> print(possible_resolutions)
56
+ [[336, 336], [336, 672], [336, 1008], [336, 1344], [672, 336], [672, 672], [1008, 336], [1344, 336]]
57
+
58
+ Args:
59
+ anyres (bool): Whether to allow any resolution combinations up to the maximum grid count.
60
+ max_num_grids (int): The maximum number of grids allowed (height x width must be ≤ this value).
61
+ grid_size (int): The size of each grid in pixels (e.g., 336).
62
+ use_1x1_grid (bool, optional): Whether to include the 1x1 grid as a valid resolution. Defaults to False.
63
+
64
+ Returns:
65
+ List[List[int]]: A list of possible [height, width] resolution pairs.
66
+ """
67
+ possible_resolutions = []
68
+ if anyres:
69
+ assert max_num_grids > 0
70
+ for i in range(1, max_num_grids + 1):
71
+ for j in range(1, max_num_grids + 1):
72
+ if i == 1 and j == 1 and not use_1x1_grid:
73
+ continue
74
+ if i * j <= max_num_grids:
75
+ possible_resolutions.append([i, j])
76
+
77
+ possible_resolutions = [[ys * grid_size, xs * grid_size] for ys, xs in possible_resolutions]
78
+
79
+ return possible_resolutions
80
+
81
+
82
+ def divide_to_grids(image: np.array, grid_size: int, input_data_format=None) -> List[np.array]:
83
+ """
84
+ Divides a local image into grids of size (grid_size x grid_size).
85
+
86
+ Args:
87
+ image (np.array): Input image as a NumPy array.
88
+ grid_size (int): The size (in pixels) of each square grid.
89
+ input_data_format (optional): Optional format specifier (e.g., "channels_first" or "channels_last").
90
+
91
+ Returns:
92
+ List[np.array]: A list of image patches, each of size (grid_size x grid_size).
93
+ """
94
+ grids = []
95
+ height, width = get_image_size(image, channel_dim=input_data_format)
96
+ for i in range(0, height, grid_size):
97
+ for j in range(0, width, grid_size):
98
+ if input_data_format == ChannelDimension.LAST:
99
+ grid = image[i : i + grid_size, j : j + grid_size]
100
+ else:
101
+ grid = image[:, i : i + grid_size, j : j + grid_size]
102
+ grids.append(grid)
103
+
104
+ return grids
105
+
106
+
107
+ def pad(
108
+ image: np.array,
109
+ target_size: tuple,
110
+ background_color=(127, 127, 127),
111
+ input_data_format=None,
112
+ ) -> np.array:
113
+ """
114
+ Pads the input image on the sides (top/bottom and left/right) to match the target height and width.
115
+
116
+ Args:
117
+ image (np.array): Input image as a NumPy array.
118
+ target_size (tuple): Target size as (target_height, target_width).
119
+ background_color (tuple, optional): RGB color value used for padding. Defaults to (127, 127, 127).
120
+ input_data_format (optional): Optional format specifier (e.g., "channels_first" or "channels_last").
121
+
122
+ Returns:
123
+ np.array: The padded image with the specified target size.
124
+ """
125
+ target_height, target_width = target_size
126
+ height, width = get_image_size(image, channel_dim=input_data_format)
127
+
128
+ # result = np.ones((target_height, target_width, image.shape[2]), dtype=image.dtype) * background_color
129
+ result = np.empty((target_height, target_width, image.shape[2]), dtype=image.dtype)
130
+ for i in range(image.shape[2]):
131
+ result[..., i].fill(background_color[i])
132
+
133
+ paste_x = (target_width - width) // 2
134
+ paste_y = (target_height - height) // 2
135
+
136
+ result[paste_y : paste_y + height, paste_x : paste_x + width, :] = image
137
+
138
+ return result
139
+
140
+
141
+ def expand2square(
142
+ image: np.array,
143
+ bboxes_dict=None,
144
+ background_color=(127, 127, 127),
145
+ input_data_format=None,
146
+ ) -> np.array:
147
+ """
148
+ Expands the input image to a square shape by placing it at the center of a new square canvas,
149
+ with padding added to the shorter side (either top/bottom or left/right).
150
+
151
+ The image is always centered on the new canvas, and padding is applied symmetrically.
152
+
153
+ Args:
154
+ image (np.array): Input image as a NumPy array.
155
+ bboxes_dict (dict, optional): A dictionary of bounding boxes, where each value is an NDArray of shape (N, 4, 2)
156
+ with box coordinates in the format [[xtl, ytl], [xtr, ytr], [xbr, ybr], [xbl, ybl]].
157
+ Supports multiple categories (e.g., "ocr", "html") simultaneously.
158
+ background_color (tuple, optional): RGB color to fill the padding area. Defaults to (127, 127, 127).
159
+ input_data_format (optional): Optional format specifier for image data (e.g., "channels_first" or "channels_last").
160
+
161
+ Returns:
162
+ np.array: A square-shaped image with the original image centered and padded as needed.
163
+
164
+ Example:
165
+ >>> _img = np.ones((80, 100), dtype=np.uint8) * 100
166
+ >>> _bboxes_dict = {"words": np.array([[[10, 10], [20, 10], [20, 20], [10, 20]],
167
+ ... [[30, 30], [40, 30], [40, 40], [30, 40]]])}
168
+ >>> _img, _bboxes_dict = expand2square(_img, _bboxes_dict, (255, 255, 255))
169
+ >>> _img.shape
170
+ (100, 100)
171
+ >>> guessed_ocr_bboxes = np.array([[[20, 10], [30, 10], [30, 20], [20, 20]],
172
+ ... [[40, 30], [50, 30], [50, 40], [40, 40]]])
173
+ >>> np.testing.assert_array_almost_equal(_bboxes_dict["words"], guessed_ocr_bboxes) is None
174
+ True
175
+ """
176
+ height, width = get_image_size(image, channel_dim=input_data_format)
177
+ if width == height:
178
+ return image, bboxes_dict
179
+ elif width > height:
180
+ # result = np.ones((width, width, image.shape[2]), dtype=image.dtype) * background_color
181
+ result = np.empty((width, width, image.shape[2]), dtype=image.dtype)
182
+ for i in range(image.shape[2]):
183
+ result[..., i].fill(background_color[i])
184
+
185
+ result[(width - height) // 2 : (width - height) // 2 + height, :] = image
186
+ if bboxes_dict is not None:
187
+ for key in bboxes_dict:
188
+ bboxes_dict[key][:, :, 1] += (width - height) // 2
189
+ return result, bboxes_dict
190
+ else:
191
+ # result = np.ones((height, height, image.shape[2]), dtype=image.dtype) * background_color
192
+ result = np.empty((height, height, image.shape[2]), dtype=image.dtype)
193
+ for i in range(image.shape[2]):
194
+ result[..., i].fill(background_color[i])
195
+
196
+ result[:, (height - width) // 2 : (height - width) // 2 + width] = image
197
+ if bboxes_dict is not None:
198
+ for key in bboxes_dict:
199
+ bboxes_dict[key][:, :, 0] += (height - width) // 2
200
+ return result, bboxes_dict
201
+
202
+
203
+ def resize_longside(
204
+ image: np.array,
205
+ size: int,
206
+ resample: PILImageResampling = PILImageResampling.BICUBIC,
207
+ data_format: Optional[Union[str, ChannelDimension]] = None,
208
+ input_data_format: Optional[Union[str, ChannelDimension]] = None,
209
+ ):
210
+ """
211
+ Resizes the image so that its longer side matches the specified size, maintaining the original aspect ratio.
212
+
213
+ Args:
214
+ image (np.array): Input image as a NumPy array.
215
+ size (int): Target size for the longer side of the image.
216
+ resample (PILImageResampling, optional): Resampling method to use during resizing. Defaults to BICUBIC.
217
+ data_format (str or ChannelDimension, optional): Output data format (e.g., "channels_first" or "channels_last").
218
+ input_data_format (str or ChannelDimension, optional): Input data format of the image.
219
+
220
+ Returns:
221
+ np.array: The resized image with its aspect ratio preserved.
222
+ """
223
+ height, width = get_image_size(image, channel_dim=input_data_format)
224
+
225
+ if width == height:
226
+ target_height, target_width = size, size
227
+ elif width > height:
228
+ target_width = size
229
+ target_height = math.ceil(height / width * size)
230
+ else:
231
+ target_width = math.ceil(width / height * size)
232
+ target_height = size
233
+
234
+ return resize(
235
+ image,
236
+ size=(target_height, target_width),
237
+ resample=resample,
238
+ data_format=data_format,
239
+ input_data_format=input_data_format,
240
+ )
241
+
242
+
243
+ def select_best_resolution(original_size: tuple, possible_resolutions: list) -> tuple:
244
+ """
245
+ Selects the best-fit resolution from a list of possible resolutions based on the original image size.
246
+ This function evaluates each resolution by computing its effective and wasted area compared to the original size.
247
+ The optimal resolution is the one that maximizes the effective area while minimizing unused (wasted) space.
248
+
249
+ Args:
250
+ original_size (tuple): The original image size in the format (height, width).
251
+ possible_resolutions (list): A list of candidate resolutions in the format [(height1, width1), (height2, width2), ...].
252
+
253
+ Returns:
254
+ tuple: The best-fit resolution in the format (height, width).
255
+
256
+ This function includes code adapted from the file image_processing_llava_next.py in the LLaVA-Next
257
+ project(https://github.com/huggingface/transformers/blob/v4.40.2/src/transformers/models/llava_next/image_processing_llava_next.py),
258
+ which is licensed under apache-2.0.
259
+ """
260
+ original_height, original_width = original_size
261
+ best_fit = None
262
+ max_effective_resolution = 0
263
+ min_wasted_resolution = float("inf")
264
+
265
+ for height, width in possible_resolutions:
266
+ scale = min(width / original_width, height / original_height)
267
+ downscaled_width, downscaled_height = int(original_width * scale), int(original_height * scale)
268
+ effective_resolution = min(downscaled_width * downscaled_height, original_width * original_height)
269
+ wasted_resolution = (width * height) - effective_resolution
270
+
271
+ if effective_resolution > max_effective_resolution or (
272
+ effective_resolution == max_effective_resolution and wasted_resolution < min_wasted_resolution
273
+ ):
274
+ max_effective_resolution = effective_resolution
275
+ min_wasted_resolution = wasted_resolution
276
+ best_fit = (height, width)
277
+
278
+ return best_fit
279
+
280
+
281
+ def _get_local_grids_output_size(image: np.array, target_resolution: tuple, input_data_format=None):
282
+ """
283
+ Computes the number of local grids (patches) along the height and width when resizing an image
284
+ to the target resolution.
285
+
286
+ Args:
287
+ image (np.array): Input image as a NumPy array.
288
+ target_resolution (tuple): Target resolution in the format (target_height, target_width).
289
+ input_data_format (optional): Optional format specifier (e.g., "channels_first" or "channels_last").
290
+
291
+ Returns:
292
+ tuple: A tuple (grid_h, grid_w) representing the number of grids along the height and width.
293
+ """
294
+ original_height, original_width = get_image_size(image, channel_dim=input_data_format)
295
+ target_height, target_width = target_resolution
296
+
297
+ scale_w = target_width / original_width
298
+ scale_h = target_height / original_height
299
+
300
+ if scale_w < scale_h:
301
+ new_width = target_width
302
+ new_height = min(math.ceil(original_height * scale_w), target_height)
303
+ else:
304
+ new_height = target_height
305
+ new_width = min(math.ceil(original_width * scale_h), target_width)
306
+
307
+ return new_height, new_width
308
+
309
+
310
+ def determine_anyres_num_vision_patches(
311
+ num_grids,
312
+ image_size,
313
+ grid_size,
314
+ patch_size,
315
+ possible_resolutions,
316
+ anyres=False,
317
+ unpad=True,
318
+ num_queries_vis_abstractor=0,
319
+ num_queries_vis_abstractor_slow=0,
320
+ is_video=False,
321
+ first_last_frames_slow=False, # sample-wise option
322
+ is_first_or_last_frames=False, # grid-wise option
323
+ ):
324
+ """
325
+ Computes the number of visual tokens (patches) based on image resolution, grid configuration, and patch size.
326
+
327
+ This function supports both fixed-size and any-resolution settings, as well as video-specific configurations
328
+ such as handling slow frames and frame position flags.
329
+
330
+ Args:
331
+ num_grids (int): Number of grids per image (e.g., 1 for 1x1, 4 for 2x2, etc.).
332
+ image_size (tuple): The original image size as (height, width).
333
+ grid_size (int): Size of each grid in pixels (e.g., 336).
334
+ patch_size (int): Size of each vision patch (e.g., 14 for ViT models).
335
+ possible_resolutions (list): List of possible resolution tuples [(h1, w1), (h2, w2), ...].
336
+ anyres (bool, optional): Whether to use any-resolution mode. Defaults to False.
337
+ unpad (bool, optional): Whether to unpad the image before computing patches. Defaults to True.
338
+ num_queries_vis_abstractor (int, optional): Number of query tokens for vision abstractor (fast path).
339
+ num_queries_vis_abstractor_slow (int, optional): Number of query tokens for vision abstractor (slow path).
340
+ is_video (bool, optional): Whether the input is a video. Defaults to False.
341
+ first_last_frames_slow (bool, optional): Whether to treat first/last video frames as "slow". Defaults to False.
342
+ is_first_or_last_frames (bool, optional): Whether current grid corresponds to first/last frame. Defaults to False.
343
+
344
+ Returns:
345
+ int: Total number of visual tokens (patches) after processing.
346
+ """
347
+ if not anyres:
348
+ return num_queries_vis_abstractor if num_queries_vis_abstractor > 0 else (grid_size // patch_size) ** 2
349
+
350
+ if num_queries_vis_abstractor > 0:
351
+ num_patch_per_grid = int(num_queries_vis_abstractor**0.5)
352
+ else:
353
+ num_patch_per_grid = grid_size // patch_size
354
+
355
+ num_global_per_grid = num_patch_per_grid
356
+
357
+ # In anyres mode, a global image is included, so there are always at least 2 grids.
358
+ # However, for video inputs, there is no global image, so it's possible to have only 1 grid.
359
+ # Therefore, the assertion below is commented out:
360
+ # assert num_grids > 1
361
+
362
+ # Compute the number of vision patches.
363
+ height, width = select_best_resolution(image_size, possible_resolutions)
364
+
365
+ num_patch_height = (height // grid_size) * num_patch_per_grid
366
+ num_patch_width = (width // grid_size) * num_patch_per_grid
367
+
368
+ # local images
369
+ if unpad:
370
+ original_height, original_width = image_size
371
+
372
+ original_aspect_ratio = original_width / original_height
373
+ current_aspect_ratio = num_patch_width / num_patch_height
374
+
375
+ if original_aspect_ratio > current_aspect_ratio:
376
+ scale_factor = num_patch_width / original_width
377
+ new_height = int(original_height * scale_factor)
378
+ padding = (num_patch_height - new_height) // 2
379
+ num_patch_height = num_patch_height - padding * 2
380
+ else:
381
+ scale_factor = num_patch_height / original_height
382
+ new_width = int(original_width * scale_factor)
383
+ padding = (num_patch_width - new_width) // 2
384
+ num_patch_width = num_patch_width - padding * 2
385
+
386
+ num_patches = num_patch_width * num_patch_height + num_patch_height
387
+ else:
388
+ num_patches = num_patch_width * num_patch_height
389
+
390
+ # In the "slow" strategy, when applying to first and last frames only, it is applied exclusively to those two frames.
391
+ if num_queries_vis_abstractor_slow > 0:
392
+ if first_last_frames_slow:
393
+ if is_first_or_last_frames:
394
+ num_patches += num_queries_vis_abstractor_slow - num_queries_vis_abstractor
395
+ else:
396
+ num_patches += num_queries_vis_abstractor_slow - num_queries_vis_abstractor
397
+ # The slowfast feature is only applicable when unpad is set to False.
398
+ assert unpad is False
399
+
400
+ # Global image is not included for video inputs.
401
+ if not is_video:
402
+ num_patches += num_global_per_grid**2
403
+
404
+ return num_patches
405
+
406
+
407
+ class HCXVisionProcessor(BaseImageProcessor):
408
  r"""
409
+ Constructs a VLM image processor.
410
+
411
+ This processor is based on [`CLIPImageProcessor`] and incorporates additional techniques
412
+ for handling high-resolution images, such as flexible resolution support (`anyres`), unpadding,
413
+ square padding, and multi-grid patching strategies.
414
+
415
  Args:
416
+ do_resize (bool): Whether to resize the image.
417
+ size (Dict[str, int], optional): Target size for resizing, typically with keys `"height"` and `"width"`.
418
+ anyres (bool): Whether to enable the any-resolution (`anyres`) feature, which allows flexible resolution handling via grid division.
419
+ unpad (bool): When `anyres` is enabled, whether to remove visual tokens corresponding to pure padding regions.
420
+ max_num_grids (int): Maximum number of grids allowed per image.
421
+ max_image_cnt (int): Maximum number of images that can be processed at once (used for batching).
422
+ num_queries_vis_abstractor (int): Number of visual query tokens per grid when using a visual resampler (e.g., Perceiver).
423
+ num_queries_vis_abstractor_video_fast (int): Number of visual queries for fast-path video frames.
424
+ num_queries_vis_abstractor_video_slow (int): Number of visual queries for slow-path video frames (e.g., first/last).
425
+ possible_resolutions (List): List of allowed resolution pairs when `anyres` is enabled. Example: [[336, 336], [336, 672], [672, 336]].
426
+ patch_size (int): Patch size for the Vision Transformer (ViT).
427
+ pad_to_square (bool): Whether to pad images to a square shape. If `False`, a center crop is applied to fit ViT input.
428
+ resample (PILImageResampling): Resampling method to use for resizing. Default is `BICUBIC`.
429
+ do_center_crop (bool): Whether to apply center cropping.
430
+ crop_size (Dict[str, int], optional): Size for center cropping.
431
+ do_rescale (bool): Whether to rescale pixel values.
432
+ rescale_factor (float or int): Factor to use for rescaling pixel values (typically `1/255`).
433
+ do_normalize (bool): Whether to normalize pixel values using `image_mean` and `image_std`.
434
+ image_mean (float or List[float], optional): Mean values for normalization. Can be a single float or list of floats per channel.
435
+ image_std (float or List[float], optional): Standard deviation values for normalization. Can be a single float or list of floats per channel.
436
+ do_convert_rgb (bool): Whether to convert the input image to RGB.
437
+ first_last_frames_slow (bool): Whether to treat the first and last frames of a video as “slow path” (processed differently).
438
+
439
+ Attributes:
440
+ model_input_names (List[str]): Names of the expected model inputs. Defaults to `["pixel_values"]`.
441
  """
442
 
443
  model_input_names = ["pixel_values"]
 
448
  size: Dict[str, int] = None,
449
  anyres: bool = False,
450
  unpad: bool = False,
451
+ max_num_grids: int = 9,
452
+ max_image_cnt: int = 12,
453
+ num_queries_vis_abstractor: int = 0,
454
+ num_queries_vis_abstractor_video_fast: int = 0,
455
+ num_queries_vis_abstractor_video_slow: int = 0,
456
  possible_resolutions: List = [],
457
  patch_size: int = 14,
458
  pad_to_square: bool = True,
 
465
  image_mean: Optional[Union[float, List[float]]] = None,
466
  image_std: Optional[Union[float, List[float]]] = None,
467
  do_convert_rgb: bool = True,
468
+ first_last_frames_slow: bool = False,
469
  **kwargs,
470
  ) -> None:
471
  super().__init__(**kwargs)
472
+ size = size if size is not None else {"shortest_edge": 512}
473
  size = get_size_dict(size, default_to_square=False)
474
+ crop_size = crop_size if crop_size is not None else {"height": 512, "width": 512}
475
  crop_size = get_size_dict(crop_size, default_to_square=True, param_name="crop_size")
476
 
477
  self.do_resize = do_resize
478
  self.size = size
479
  self.anyres = anyres
480
  self.unpad = unpad
481
+ self.max_num_grids = max_num_grids
482
+ self.max_image_cnt = max_image_cnt
483
+ self.num_queries_vis_abstractor = num_queries_vis_abstractor
484
  self.num_queries_vis_abstractor_video_fast = num_queries_vis_abstractor_video_fast
485
+ self.num_queries_vis_abstractor_video_slow = num_queries_vis_abstractor_video_slow
486
  self.possible_resolutions = [_resolution for _resolution in possible_resolutions]
487
  self.patch_size = patch_size
488
  self.pad_to_square = pad_to_square
 
495
  self.image_mean = image_mean if image_mean is not None else OPENAI_CLIP_MEAN
496
  self.image_std = image_std if image_std is not None else OPENAI_CLIP_STD
497
  self.do_convert_rgb = do_convert_rgb
498
+ self.first_last_frames_slow = first_last_frames_slow
499
+
500
+ assert self.crop_size["height"] == self.crop_size["width"]
501
 
502
  def resize(
503
  self,
 
508
  input_data_format: Optional[Union[str, ChannelDimension]] = None,
509
  **kwargs,
510
  ) -> np.ndarray:
511
+ """
512
+ Resizes the input image to the specified target size.
513
+
514
+ Args:
515
+ image (np.ndarray): The input image to resize.
516
+ size (Dict[str, int]): A dictionary specifying the target size with keys `"height"` and `"width"`.
517
+ resample (PILImageResampling, optional): The resampling filter to use. Defaults to `BICUBIC`.
518
+ data_format (str or ChannelDimension, optional): The desired output data format (e.g., "channels_last").
519
+ input_data_format (str or ChannelDimension, optional): The input data format of the image.
520
+ **kwargs: Additional keyword arguments, if any.
521
+
522
+ Returns:
523
+ np.ndarray: The resized image as a NumPy array.
524
+ """
525
  default_to_square = True
526
  if "shortest_edge" in size:
527
  size = size["shortest_edge"]
 
563
  data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
564
  input_data_format: Optional[Union[str, ChannelDimension]] = None,
565
  ) -> Image.Image:
566
+ """
567
+ Applies a sequence of preprocessing operations to the input image(s), including resizing, cropping, rescaling,
568
+ normalization, and format conversion.
569
+
570
+ This method is typically used internally to prepare images for model input.
571
+
572
+ Args:
573
+ images (ImageInput): A single image or a batch of images to preprocess.
574
+ do_resize (bool, optional): Whether to resize the image(s).
575
+ size (Dict[str, int], optional): Target size for resizing, with keys `"height"` and `"width"`.
576
+ resample (PILImageResampling, optional): Resampling method to use for resizing.
577
+ do_center_crop (bool, optional): Whether to apply center cropping.
578
+ crop_size (int, optional): Size of the center crop (applied to both height and width).
579
+ do_rescale (bool, optional): Whether to rescale the image pixel values.
580
+ rescale_factor (float, optional): Factor to use when rescaling pixel values (e.g., 1/255).
581
+ do_normalize (bool, optional): Whether to normalize the image using `image_mean` and `image_std`.
582
+ image_mean (float or List[float], optional): Mean value(s) used for normalization.
583
+ image_std (float or List[float], optional): Standard deviation value(s) used for normalization.
584
+ data_format (ChannelDimension, optional): The desired output data format (e.g., `ChannelDimension.FIRST`).
585
+ input_data_format (str or ChannelDimension, optional): The format of the input image(s).
586
+
587
+ Returns:
588
+ Image.Image: The preprocessed image or batch of images, ready for model input.
589
+ """
590
  images = make_list_of_images(images)
591
 
592
  if do_resize:
593
  images = [
594
+ self.resize(
595
+ image=image,
596
+ size=size,
597
+ resample=resample,
598
+ input_data_format=input_data_format,
599
+ )
600
  for image in images
601
  ]
602
 
 
607
 
608
  if do_rescale:
609
  images = [
610
+ self.rescale(
611
+ image=image,
612
+ scale=rescale_factor,
613
+ input_data_format=input_data_format,
614
+ )
615
+ for image in images
616
  ]
617
 
618
  if do_normalize:
619
  images = [
620
+ self.normalize(
621
+ image=image,
622
+ mean=image_mean,
623
+ std=image_std,
624
+ input_data_format=input_data_format,
625
+ )
626
  for image in images
627
  ]
628
 
 
633
  return images
634
 
635
  def _resize_for_local_grids(
636
+ self,
637
+ image: np.array,
638
+ target_resolution: tuple,
639
+ resample,
640
+ input_data_format: ChannelDimension,
641
  ) -> np.array:
642
+ """
643
+ Resizes the image to the given target resolution for use in local grid processing.
644
+
645
+ This function ensures that the image is properly resized to match the (height, width) specified
646
+ in `target_resolution`, using the provided resampling method. It supports channel-first and
647
+ channel-last formats based on `input_data_format`.
648
+
649
+ Args:
650
+ image (np.array): Input image as a NumPy array.
651
+ target_resolution (tuple): Target resolution as (height, width) for resizing.
652
+ resample: Resampling method to use (e.g., `PILImageResampling.BICUBIC`).
653
+ input_data_format (ChannelDimension): Format of the input image (e.g., `ChannelDimension.FIRST` or `LAST`).
654
+
655
+ Returns:
656
+ np.array: The resized image in NumPy array format.
657
+ """
658
  new_height, new_width = _get_local_grids_output_size(image, target_resolution, input_data_format)
659
 
660
  # Resize the image
661
+ resized_image = resize(
662
+ image,
663
+ (new_height, new_width),
664
+ resample=resample,
665
+ input_data_format=input_data_format,
666
+ )
667
 
668
  return resized_image
669
 
670
  def _pad_for_patching(
671
+ self,
672
+ image: np.array,
673
+ target_resolution: tuple,
674
+ input_data_format: ChannelDimension,
675
  ) -> np.array:
676
  """
677
+ Pads the image to match the target resolution, ensuring compatibility with patch-based models.
678
+
679
+ This is typically used to make sure the image dimensions are divisible by the patch size or to
680
+ meet specific model input requirements. Padding is applied symmetrically where needed.
681
+
682
+ Args:
683
+ image (np.array): Input image as a NumPy array.
684
+ target_resolution (tuple): The desired resolution after padding, in the format (height, width).
685
+ input_data_format (ChannelDimension): Format of the input image (e.g., `ChannelDimension.FIRST` or `LAST`).
686
+
687
+ Returns:
688
+ np.array: The padded image as a NumPy array.
689
  """
690
  target_height, target_width = target_resolution
691
 
 
708
  data_format: ChannelDimension,
709
  input_data_format: ChannelDimension,
710
  ) -> List[np.array]:
711
+ """
712
+ Splits the input image into multiple local grids based on possible resolutions and grid size.
713
+
714
+ The function selects the best resolution from the provided list, resizes the image accordingly,
715
+ and divides it into non-overlapping grid patches of size (grid_size x grid_size). It is commonly
716
+ used for any-resolution (anyres) visual processing.
717
+
718
+ Args:
719
+ image (np.array): Input image as a NumPy array.
720
+ possible_resolutions (List[Tuple[int, int]]): List of allowed resolutions to choose from.
721
+ grid_size (int): The size of each grid patch (e.g., 336 pixels).
722
+ resample (PILImageResampling): Resampling method used during resizing.
723
+ data_format (ChannelDimension): Output data format (e.g., `ChannelDimension.FIRST`).
724
+ input_data_format (ChannelDimension): Input data format of the image.
725
+
726
+ Returns:
727
+ List[np.array]: A list of grid image patches as NumPy arrays.
728
+ """
729
  if not isinstance(possible_resolutions, list):
730
  raise ValueError("possible_resolutions must be a list of possible resolutions.")
731
 
732
  image_size = get_image_size(image, channel_dim=input_data_format)
733
  best_resolution = select_best_resolution(image_size, possible_resolutions)
734
  resized_image = self._resize_for_local_grids(
735
+ image,
736
+ best_resolution,
737
+ resample=resample,
738
+ input_data_format=input_data_format,
739
  )
740
  padded_image = self._pad_for_patching(resized_image, best_resolution, input_data_format=input_data_format)
741
  local_grids = divide_to_grids(padded_image, grid_size=grid_size, input_data_format=input_data_format)
 
755
  size: Dict[str, int] = None,
756
  anyres: bool = None,
757
  unpad: bool = None,
758
+ is_video_list: List[bool] = None,
 
 
 
 
759
  possible_resolutions: List = None,
760
  patch_size: int = None,
761
  pad_to_square: bool = None,
 
771
  return_tensors: Optional[Union[str, TensorType]] = None,
772
  data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
773
  input_data_format: Optional[Union[str, ChannelDimension]] = None,
774
+ is_first_or_last_frames: List[bool] = False,
 
 
 
775
  ):
776
  """
777
+ Preprocesses images using HCXVisionProcessor.
778
+
779
+ This method prepares images for visual language models by applying resizing, padding, cropping,
780
+ normalization, and tokenization into visual patches. In video mode, each frame is converted to
781
+ a 1D sequence of patches. The `unpad` option is disabled when processing videos.
782
+
783
+ Args:
784
+ images (ImageInput): A single image or a batch of images (PIL, NumPy, or tensor format).
785
+ do_resize (bool, optional): Whether to resize the image(s).
786
+ size (Dict[str, int], optional): Resize target with keys `"height"` and `"width"`.
787
+ anyres (bool, optional): Whether to use any-resolution processing with grid splitting.
788
+ unpad (bool, optional): Whether to remove visual tokens that belong to padding areas (only in non-video mode).
789
+ is_video_list (List[bool], optional): A list indicating which inputs are video frames.
790
+ possible_resolutions (List, optional): List of resolution pairs allowed in `anyres` mode.
791
+ patch_size (int, optional): Patch size for the Vision Transformer (ViT).
792
+ pad_to_square (bool, optional): Whether to pad the image to a square.
793
+ resample (PILImageResampling, optional): Resampling method to use for resizing.
794
+ do_center_crop (bool, optional): Whether to apply center cropping.
795
+ crop_size (int, optional): Target crop size for center cropping.
796
+ do_rescale (bool, optional): Whether to rescale image pixel values.
797
+ rescale_factor (float, optional): Factor for pixel rescaling, e.g., `1/255`.
798
+ do_normalize (bool, optional): Whether to normalize using mean and std.
799
+ image_mean (float or List[float], optional): Mean value(s) for normalization.
800
+ image_std (float or List[float], optional): Standard deviation(s) for normalization.
801
+ do_convert_rgb (bool, optional): Whether to convert the image to RGB.
802
+ return_tensors (str or TensorType, optional): Desired output tensor type (e.g., "pt" for PyTorch).
803
+ data_format (ChannelDimension, optional): Output data format (e.g., `ChannelDimension.FIRST`).
804
+ input_data_format (str or ChannelDimension, optional): Format of the input image.
805
+ is_first_or_last_frames (List[bool], optional): Flags indicating whether each image is a first/last video frame.
806
+
807
+ Returns:
808
+ Tuple:
809
+ pixel_values (List[torch.Tensor]): A list of 4D image tensors ready for model input.
810
+ image_sizes (List[List[int]]): A list of list containing the original width and height [width, height]
811
+ of each image, e.g., `[[width, height], ...]`.
812
+ vision_query_lengths (List[int]): A list of integers representing the number of visual tokens
813
+ each image contributes to the LLM input.
814
  """
 
815
  do_resize = do_resize if do_resize is not None else self.do_resize
816
  size = size if size is not None else self.size
817
  size = get_size_dict(size, param_name="size", default_to_square=False)
818
  anyres = anyres if anyres is not None else self.anyres
819
  unpad = unpad if unpad is not None else self.unpad
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
820
  possible_resolutions = possible_resolutions if possible_resolutions is not None else self.possible_resolutions
821
  patch_size = patch_size if patch_size is not None else self.patch_size
822
  pad_to_square = pad_to_square if pad_to_square is not None else self.pad_to_square
 
831
  image_std = image_std if image_std is not None else self.image_std
832
  do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
833
 
 
 
 
 
 
 
 
 
 
 
 
834
  images = make_list_of_images(images)
835
 
836
  if not valid_images(images):
 
861
 
862
  assert crop_size["height"] == crop_size["width"]
863
 
864
+ # Padding operations for the global image can become a bottleneck when the original image width or height is large.
865
+ # To mitigate this, the image is first resized such that the longest side is scaled proportionally based on size["shortest_edge"],
866
+ # and then padding is applied to reach the target dimensions.
867
  if anyres:
868
  anyres_global_images = copy.deepcopy(images)
869
  if pad_to_square:
870
  background_color = tuple(int(x * 255) for x in self.image_mean)
871
  anyres_global_images = [
872
+ resize_longside(
873
+ copy.deepcopy(image),
874
+ size["shortest_edge"],
875
+ resample,
876
+ input_data_format,
877
+ )
878
  for image in anyres_global_images
879
  ]
880
  anyres_global_images = [
881
+ expand2square(
882
+ image,
883
+ background_color=background_color,
884
+ input_data_format=input_data_format,
885
+ )[0]
886
  for image in anyres_global_images
887
  ]
888
  else:
889
  anyres_global_images = [
890
  self.resize(
891
  image=image,
892
+ size={
893
+ "height": size["shortest_edge"],
894
+ "width": size["shortest_edge"],
895
+ },
896
  resample=resample,
897
  input_data_format=input_data_format,
898
  )
 
906
  resize_longside(image, size["shortest_edge"], resample, input_data_format) for image in images
907
  ]
908
  images = [
909
+ expand2square(
910
+ image,
911
+ background_color=background_color,
912
+ input_data_format=input_data_format,
913
+ )[0]
914
  for image in images
915
  ]
916
 
917
+ num_queries_vis_abstractors = []
918
+ num_queries_vis_abstractors_slow = []
919
+ first_last_frames_slows = []
920
+
921
+ for image, is_video, anyres_global_image, image_size in zip(
922
+ images, is_video_list, anyres_global_images, image_sizes
923
+ ):
924
+ if is_video:
925
+ num_queries_vis_abstractor = self.num_queries_vis_abstractor_video_fast
926
+ num_queries_vis_abstractor_slow = self.num_queries_vis_abstractor_video_slow
927
+ else:
928
+ num_queries_vis_abstractor = self.num_queries_vis_abstractor
929
+ num_queries_vis_abstractor_slow = 0
930
+
931
+ num_queries_vis_abstractors.append(num_queries_vis_abstractor)
932
+ num_queries_vis_abstractors_slow.append(num_queries_vis_abstractor_slow)
933
+ first_last_frames_slows.append(self.first_last_frames_slow)
934
+
935
  if anyres:
936
  # convert image into a list of grids
937
  # we intentially use the same data format as the input data format
 
943
  data_format=input_data_format,
944
  input_data_format=input_data_format,
945
  )
946
+ # Global image (thumbnail) is not used for video inputs.
947
  if not is_video:
948
  image_grids = [anyres_global_image] + image_grids
949
  else:
 
968
  pixel_values = np.array(pixel_values)
969
  new_images.append(pixel_values)
970
 
971
+ num_grids = pixel_values.shape[0]
972
+
973
  vision_query_length = determine_anyres_num_vision_patches(
974
+ num_grids=num_grids,
975
  image_size=image_size,
976
  grid_size=crop_size["height"],
977
  patch_size=patch_size,
978
  possible_resolutions=possible_resolutions,
979
  anyres=anyres,
980
+ unpad=False if is_video else unpad,
981
  num_queries_vis_abstractor=num_queries_vis_abstractor,
982
  num_queries_vis_abstractor_slow=num_queries_vis_abstractor_slow,
983
  is_video=is_video,
984
+ first_last_frames_slow=self.first_last_frames_slow,
985
+ is_first_or_last_frames=self.first_last_frames_slow,
986
  )
987
 
988
  vision_query_lengths.append(vision_query_length)
989
 
 
 
 
990
  data = {
991
+ "pixel_values": [[torch.tensor(new_image) for new_image in new_images]],
992
+ "image_sizes": [[[image_size[1], image_size[0]] for image_size in image_sizes]],
993
+ "vision_query_lengths": [vision_query_lengths],
994
+ "is_videos": [is_video_list],
995
+ "num_queries_vis_abstractors": [num_queries_vis_abstractors],
996
+ "num_queries_vis_abstractors_slow": [num_queries_vis_abstractors_slow],
997
+ "first_last_frames_slows": [first_last_frames_slows],
998
  }
999
 
1000
+ return BatchFeature(data=data)
 
 
 
 
 
 
 
 
 
 
1001
 
1002
+ def load_images_videos(self, vlm_chat):
1003
+ """
1004
+ Loads and prepares images or video frames from a VLM chat input.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1005
 
1006
+ This function parses the input `vlm_chat` object, extracts image or video sources,
1007
+ and loads them into memory as PIL or NumPy images, ready for preprocessing.
 
1008
 
1009
+ Args:
1010
+ vlm_chat: A VLM chat input structure containing multimodal elements
1011
+ (e.g., images, videos, URLs, or file paths). The format is typically a list of messages
1012
+ with associated media fields.
1013
 
1014
+ Returns:
1015
+ List[Union[PIL.Image.Image, List[PIL.Image.Image]]]:
1016
+ A list of loaded images. For video entries, a list of frames is returned instead of a single image.
1017
+ """
1018
+ vlm_chat = copy.deepcopy(vlm_chat)
1019
+
1020
+ new_vlm_chat = []
1021
+ all_images = [] # images + images_from_videos
1022
+ is_video_list = []
1023
+
1024
+ for line in vlm_chat:
1025
+ if "content" in line:
1026
+ content = line["content"]
1027
+
1028
+ if "image" in content:
1029
+ if "filename" not in content:
1030
+ content["filename"] = f"{uuid.uuid4().hex}.jpg"
1031
+ image_pil = load_image(content["image"])
1032
+ all_images.append(image_pil)
1033
+ is_video_list.append(False)
1034
+ new_vlm_chat.append(line)
1035
+
1036
+ elif "video" in content:
1037
+ video_bytesio = load_video_to_bytesio(content["video"])
1038
+ pil_img_frames, video_time_stamp = process_video(
1039
+ video_bytesio, self.max_num_grids, self.max_image_cnt, self.crop_size["width"]
1040
+ )
1041
+ all_images.extend(pil_img_frames)
1042
+ is_video_list.extend([True] * len(pil_img_frames))
1043
 
1044
+ if "filename" not in content:
1045
+ content["filename"] = f"{uuid.uuid4().hex}.mp4"
1046
 
1047
+ for i, image_time_stamp in enumerate(video_time_stamp):
1048
+ new_line = copy.deepcopy(line)
1049
+ basename, ext = os.path.splitext(content["filename"])
1050
+ new_line["content"]["filename"] = f"{basename}-{i}{ext}"
1051
+ new_line["content"]["video_time_stamp"] = image_time_stamp
1052
 
1053
+ if i == len(video_time_stamp) - 1:
1054
+ new_line["content"]["is_final_grid"] = True
1055
 
1056
+ for last_frame_target_key in ["lens_keywords", "lens_local_keywords", "speech_to_text"]:
1057
+ if last_frame_target_key in content:
1058
+ new_line["content"][last_frame_target_key] = content[last_frame_target_key]
1059
 
1060
+ new_vlm_chat.append(new_line)
1061
+ else:
1062
+ new_vlm_chat.append(line)
1063
 
1064
+ return new_vlm_chat, all_images, is_video_list
 
1065
 
 
 
 
 
 
 
 
 
 
 
1066
 
1067
+ def process_video(video_bytesio, max_num_grids, max_image_cnt, vit_input_size):
1068
+ """
1069
+ Processes a video file and extracts frames suitable for vision transformer (ViT) input.
1070
 
1071
+ The function reads video data from a BytesIO object, extracts a limited number of frames
1072
+ based on `max_num_grids` and `max_image_cnt`, and resizes them to the appropriate ViT input size.
 
 
 
 
 
 
 
1073
 
1074
+ Args:
1075
+ video_bytesio (io.BytesIO): A BytesIO object containing the raw video file data.
1076
+ max_num_grids (int): The maximum number of grids allowed (e.g., for tiling or patching).
1077
+ max_image_cnt (int): The maximum number of frames to extract from the video.
1078
+ vit_input_size (int): The desired input size (height and width) for the ViT model.
1079
 
1080
+ Returns:
1081
+ List[np.ndarray]: A list of processed video frames as NumPy arrays, each resized to (vit_input_size, vit_input_size).
1082
+ """
1083
+ frames, time_interval = video_decoder(
1084
+ video_bytesio, max_num_grids=max_num_grids, max_image_cnt=max_image_cnt, default_interval=0.4
1085
+ )
1086
+ pil_img_frames, video_time_stamp = combine_frames_into_images(
1087
+ frames, time_interval, max_grid_shape=(max_num_grids, 1), vit_input_size=vit_input_size
1088
+ )
1089
 
1090
+ return pil_img_frames, video_time_stamp
1091
 
1092
+
1093
+ def load_image(image_src):
1094
  """
1095
+ Loads an image from various sources (file path, URL, base64 string, or raw bytes)
1096
+ and returns it as a PIL Image object.
1097
 
1098
  Args:
1099
+ image_src (str or bytes): The image source. It can be:
1100
+ - A local file path
1101
+ - A URL
1102
+ - A base64-encoded string
1103
+ - Raw image bytes
1104
 
1105
  Returns:
1106
+ PIL.Image.Image: The loaded image as a PIL Image object.
1107
+
1108
+ Raises:
1109
+ ValueError: If the image cannot be loaded or the format is unsupported.
1110
+ TypeError: If the input is not of type str or bytes.
1111
  """
1112
+ try:
1113
+ # 1. If input is bytes type
1114
+ if isinstance(image_src, bytes):
1115
+ return Image.open(io.BytesIO(image_src))
1116
+
1117
+ # 2. If input is str type (path, URL, base64)
1118
+ if isinstance(image_src, str):
1119
+ # 2a. Check if it's a Base64 data URI format ('data:image/...')
1120
+ if image_src.startswith("data:image"):
1121
+ try:
1122
+ # Remove the 'data:image/...;base64,' part and decode
1123
+ header, encoded = image_src.split(",", 1)
1124
+ image_bytes = base64.b64decode(encoded)
1125
+ return Image.open(io.BytesIO(image_bytes))
1126
+ except (ValueError, base64.binascii.Error) as e:
1127
+ raise ValueError(f"Invalid base64 data URI format: {e}") from e
1128
+
1129
+ # 2b. Check if it's a URL format ('http://' or 'https://')
1130
+ elif image_src.startswith("http://") or image_src.startswith("https://"):
1131
+ try:
1132
+ response = requests.get(image_src, stream=True, timeout=10)
1133
+ response.raise_for_status() # Raise an exception for HTTP errors
1134
+ image_bytes = response.content
1135
+ return Image.open(io.BytesIO(image_bytes))
1136
+ except requests.exceptions.RequestException as e:
1137
+ raise ValueError(f"Error loading image from URL '{image_src}': {e}") from e
1138
+
1139
+ # 2c. Assume it's a local file path
1140
  else:
1141
+ return Image.open(image_src)
 
1142
 
1143
+ else:
1144
+ raise TypeError(f"Unsupported image_src type: {type(image_src)}")
1145
 
1146
+ # Common exception handling
1147
+ except FileNotFoundError:
1148
+ raise ValueError(f"Image loading error: File not found '{image_src}'")
1149
+ except UnidentifiedImageError:
1150
+ raise ValueError("Image loading error: Cannot identify image file format.")
1151
+ except IOError as e:
1152
+ raise ValueError(f"Image loading error (I/O): {e}") from e
1153
+ except Exception as e:
1154
+ raise ValueError(f"Unexpected error during image loading: {e}") from e
1155
 
1156
+
1157
+ def load_video_to_bytesio(video_src):
 
 
 
 
1158
  """
1159
+ Loads video data from various sources (file path, URL, base64 string, or raw bytes)
1160
+ and returns an `io.BytesIO` object containing the raw video content.
1161
 
1162
  Args:
1163
+ video_src (str or bytes): The video source. Supported formats include:
1164
+ - Local file path
1165
+ - URL
1166
+ - Base64-encoded data URI string
1167
+ - Raw video bytes
1168
 
1169
  Returns:
1170
+ io.BytesIO: A `BytesIO` object containing the loaded video data.
1171
+
1172
+ Raises:
1173
+ ValueError: If the video cannot be loaded due to issues such as an invalid path,
1174
+ URL failure, malformed base64 string, or unsupported format.
1175
+ TypeError: If the input is not a `str` or `bytes` object.
1176
  """
1177
+ video_bytes = None
1178
+ try:
1179
+ # 1. If input is bytes type
1180
+ if isinstance(video_src, bytes):
1181
+ video_bytes = video_src
1182
+
1183
+ # 2. If input is str type (path, URL, base64)
1184
+ elif isinstance(video_src, str):
1185
+ # 2a. Check if it's a Base64 data URI format ('data:video/...')
1186
+ if video_src.startswith("data:video"):
1187
+ try:
1188
+ # Remove the 'data:video/...;base64,' part and decode
1189
+ header, encoded = video_src.split(",", 1)
1190
+ video_bytes = base64.b64decode(encoded)
1191
+ except (ValueError, base64.binascii.Error) as e:
1192
+ raise ValueError(f"Invalid base64 data URI format: {e}") from e
1193
+
1194
+ # 2b. Check if it looks like a URL
1195
+ elif urlparse(video_src).scheme in ("http", "https"):
1196
+ try:
1197
+ response = requests.get(
1198
+ video_src, stream=True, timeout=30
1199
+ ) # Increased timeout for potentially large videos
1200
+ response.raise_for_status() # Raise an exception for HTTP errors (4xx or 5xx)
1201
+ # Read all content from the stream into bytes
1202
+ video_bytes = response.content
1203
+ except requests.exceptions.MissingSchema:
1204
+ # If urlparse thinks it's a scheme but requests disagrees (e.g., "http:/example.com")
1205
+ # Treat it as a potential file path below.
1206
+ pass
1207
+ except requests.exceptions.RequestException as e:
1208
+ raise ValueError(f"Error loading video from URL '{video_src}': {e}") from e
1209
+
1210
+ # 2c. Assume it's a local file path if not base64 or confirmed URL
1211
+ if video_bytes is None: # Only attempt file read if not already loaded as base64 or URL failed gracefully
1212
+ # Check if it could potentially be a file path
1213
+ # Note: This check is basic. A string like "http:/path/file" might incorrectly be treated as a path here
1214
+ # if the requests call failed due to MissingSchema. More robust path validation could be added.
1215
+ if (
1216
+ os.path.exists(video_src) or "/" in video_src or "\\" in video_src
1217
+ ): # Basic check if it resembles a path
1218
+ try:
1219
+ with open(video_src, "rb") as f:
1220
+ video_bytes = f.read()
1221
+ except FileNotFoundError:
1222
+ raise ValueError(f"Video loading error: File not found at path '{video_src}'")
1223
+ except IsADirectoryError:
1224
+ raise ValueError(f"Video loading error: Path '{video_src}' is a directory, not a file.")
1225
+ except IOError as e:
1226
+ raise ValueError(f"Video loading error (I/O) for path '{video_src}': {e}") from e
1227
+ else:
1228
+ # If it's not base64, not a valid downloadable URL, and doesn't look like a path/doesn't exist
1229
+ raise ValueError(f"Unsupported string input format or resource not found: '{video_src}'")
1230
+
1231
+ # 3. If the type is unsupported
1232
+ else:
1233
+ raise TypeError(f"Unsupported video_src type: {type(video_src)}")
1234
 
1235
+ # Final check if video_bytes was successfully obtained
1236
+ if video_bytes is None:
1237
+ raise ValueError(f"Could not load video data from the provided source: {video_src}")
 
1238
 
1239
+ # Return the bytes wrapped in BytesIO
1240
+ return io.BytesIO(video_bytes)
1241
 
1242
+ # Catch specific exceptions first for better error reporting
1243
+ except FileNotFoundError as e: # Should be caught above, but as a safeguard
1244
+ raise ValueError(f"Video loading error: File not found '{video_src}'") from e
1245
+ except requests.exceptions.RequestException as e: # Already handled, but for clarity
1246
+ raise ValueError(f"Video loading error (Network): {e}") from e
1247
+ except (ValueError, TypeError) as e: # Re-raise ValueErrors/TypeErrors raised intentionally within the try block
1248
+ raise e
1249
+ except Exception as e:
1250
+ # Catch any other unexpected errors during processing
1251
+ raise ValueError(f"Unexpected error during video loading from source '{video_src}': {e}") from e
1252
 
 
1253
 
1254
+ def video_decoder(video_bytesio, max_num_grids, max_image_cnt, default_interval=0.4):
1255
+ """
1256
+ Decodes video data from a BytesIO object and returns a list of extracted frames.
1257
 
1258
+ Args:
1259
+ video_bytesio (io.BytesIO): A BytesIO object containing the raw video data.
1260
+ max_num_grids (int): Maximum number of grids allowed per image. Used to determine how many frames to extract.
1261
+ max_image_cnt (int): Maximum number of frames to extract from the video.
1262
+ default_interval (float, optional): Default time interval (in seconds) between frames. Used when frame rate info is unavailable. TODO: make configurable.
1263
+
1264
+ Returns:
1265
+ Tuple:
1266
+ frames (List[PIL.Image.Image]): A list of extracted frames as PIL Images.
1267
+ time_interval (float): Time interval (in seconds) between selected frames.
1268
  """
1269
+ error_messages = []
1270
+ frames = []
1271
+
1272
+ # 1. Try decoding the video using Decord.
1273
+ try:
1274
+ vr = VideoReader(video_bytesio, ctx=cpu(0), num_threads=8)
1275
+ fps = vr.get_avg_fps()
1276
+ play_time = len(vr) / fps
1277
+ total_frames = len(vr)
1278
+ frame_indices, time_interval = extract_frame_indices(
1279
+ play_time, total_frames, fps, max_num_grids, max_image_cnt, default_interval=default_interval
1280
+ ) # Sample every 0.4 seconds; if the video is too long, apply uniform sampling instead.
1281
+ if frame_indices is None:
1282
+ frame_indices = range(len(vr)) # Convert all frames.
1283
+ batch_frames = vr.get_batch(frame_indices).asnumpy()
1284
+ frames = [Image.fromarray(frame).convert("RGB") for frame in batch_frames]
1285
+ return frames, time_interval
1286
+ except Exception as e:
1287
+ print("error with decord")
1288
+ error_messages.append(f"Decord 실패: {e}")
1289
+
1290
+ # 2. Fallback: Try decoding the video using PyAV.
1291
+ try:
1292
+ container = av.open(video_bytesio)
1293
+ fps = container.streams.video[0].average_rate
1294
+ play_time = len(container) / fps
1295
+ total_frames = len(container)
1296
+ frame_indices, time_interval = extract_frame_indices(
1297
+ play_time, total_frames, fps, max_num_grids, max_image_cnt, default_interval=default_interval
1298
+ ) # Sample frames every 0.4 seconds. If the video is long, use uniform sampling to limit the number of frames.
1299
+ # Even if frame_indices were assigned using Decord, reprocess them to be compatible with PyAV.
1300
+ target_indices = None if frame_indices is None else set(frame_indices)
1301
+ frames = []
1302
+ for i, frame in enumerate(container.decode(video=0)):
1303
+ if target_indices is not None and i not in target_indices:
1304
+ continue # Skip frames that are not in the required indices.
1305
+ pil_frame = Image.fromarray(frame.to_ndarray(format="rgb24")).convert("RGB")
1306
+ frames.append(pil_frame)
1307
+ if frames:
1308
+ return frames, time_interval
1309
+ else:
1310
+ raise Exception("Decoding with PyAV succeeded, but no frames were extracted.")
1311
+ except Exception as e:
1312
+ error_messages.append(f"PyAV failed: {e}")
1313
+
1314
+ # 3. Fallback: Try decoding the video using OpenCV.
1315
+ try:
1316
+ byte_data = np.frombuffer(video_bytesio.getvalue(), dtype=np.uint8)
1317
+ video = cv2.imdecode(byte_data, cv2.IMREAD_UNCHANGED)
1318
+
1319
+ cap = cv2.VideoCapture(video)
1320
+ total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
1321
+ fps = cap.get(cv2.CAP_PROP_FPS)
1322
+ play_time = total_frames / fps
1323
+ frame_indices, time_interval = extract_frame_indices(
1324
+ play_time, total_frames, fps, max_num_grids, max_image_cnt, default_interval=default_interval
1325
+ ) # Sample frames every 0.4 seconds; if the video is too long, apply uniform sampling to limit the total number of frames.
1326
+ if frame_indices is None:
1327
+ frame_indices = range(total_frames) # Convert all frames.
1328
+
1329
+ index_set = set(frame_indices) # Convert to a set for faster lookup.
1330
+ current_index = 0
1331
+
1332
+ while cap.isOpened():
1333
+ ret, frame = cap.read()
1334
+ if not ret:
1335
+ break
1336
+ if current_index in index_set:
1337
+ frames.append(Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)).convert("RGB"))
1338
+ current_index += 1
1339
+ if current_index > max(index_set): # Stop processing once all required indices have been handled.
1340
+ break
1341
+
1342
+ cap.release()
1343
+ if frames:
1344
+ return frames, time_interval
1345
+ except Exception as e:
1346
+ error_messages.append(f"OpenCV failed: {e}")
1347
+
1348
+ if error_messages:
1349
+ raise Exception(f"All decoding attempts have failed.: {error_messages}")
1350
+
1351
+
1352
+ def convert_format_for_multi_image(img, json, convert_key_list=["words", "text", "objects", "entities"]):
1353
+ """
1354
+ Converts the format of image and annotation data from a single-image dataset to a multi-image dataset format.
1355
 
1356
+ Single-image datasets typically return a single image and its associated annotation as individual objects.
1357
+ This function wraps them in a dictionary format used by multi-image datasets.
1358
 
1359
  Args:
1360
+ img: The input image (e.g., a PIL Image or NumPy array).
1361
+ json: The annotation data associated with the image.
1362
+ convert_key_list (List[str], optional): A list of keys to extract and convert from the original JSON.
1363
+ Defaults to ["words", "text", "objects", "entities"].
 
 
1364
 
1365
  Returns:
1366
+ Tuple[Dict, Dict]:
1367
+ - A dictionary mapping image IDs to images (e.g., {"image_0": img}).
1368
+ - A dictionary mapping image IDs to corresponding annotation JSONs (with filtered keys).
 
 
 
 
 
 
 
 
 
 
1369
  """
1370
+ is_multi_image_dataset = isinstance(img, dict)
1371
+ if not is_multi_image_dataset:
1372
+ img = {"00": img}
 
 
 
 
 
1373
 
1374
+ for convert_key in convert_key_list:
1375
+ if convert_key in json:
1376
+ json[convert_key] = {"00": json[convert_key]}
 
 
 
 
 
 
 
1377
 
1378
+ for json_key in json:
1379
+ if "region" in json_key:
1380
+ json[json_key] = {"00": json[json_key]}
 
 
1381
 
1382
+ return is_multi_image_dataset, img, json
1383
 
1384
+
1385
+ def convert_tags_for_video(img, json):
 
 
 
 
 
1386
  """
1387
+ Converts <video_00> tags to <image_xx> tags based on the number of video frames.
1388
+
1389
+ In video datasets, annotations often use a generic <video_00> tag. This function replaces that tag
1390
+ with frame-specific tags such as <image_00>, <image_01>, ..., <image_NN> based on the number of frames in `img`.
1391
 
1392
  Args:
1393
+ img: A list of video frames (e.g., list of PIL Images or NumPy arrays).
1394
+ json: The annotation data containing <video_00> tags to be replaced.
 
 
 
1395
 
1396
  Returns:
1397
+ Dict: The updated annotation JSON with frame-specific <image_xx> tags.
1398
  """
1399
+ image_tag = "".join([f"<image_{idx:02d}>" for idx in range(len(img))])
1400
+ # image_tag = "<image_00>" # Use this format to construct and insert image-specific tags.
1401
+ for json_key in json:
1402
+ if "qa_pairs" in json_key:
1403
+ new_qa_pairs = []
1404
+ for qa_pair in json[json_key]:
1405
+ question = qa_pair[0]
1406
+ # Replace <video_00> tags with corresponding <image_xx> tags.
1407
+ question = question.replace("<video_00>", image_tag)
1408
+ new_qa_pairs.append([question, qa_pair[1]])
1409
+ json[json_key] = new_qa_pairs
1410
+
1411
+ return img, json
1412
+
1413
+
1414
+ def split_list(input_list, split_value):
1415
+ """
1416
+ Splits a list into sublists using a specified delimiter value.
1417
 
1418
+ Each time `split_value` is encountered in `input_list`, a new sublist is started.
1419
+ The delimiter itself is not included in the output.
 
 
 
 
 
 
1420
 
1421
+ Args:
1422
+ input_list (List[Any]): The input list to split.
1423
+ split_value (Any): The value used as the delimiter for splitting.
1424
+
1425
+ Returns:
1426
+ List[List[Any]]: A list of sublists, split by the specified delimiter.
 
1427
 
1428
+ Example:
1429
+ >>> split_list(["a", "b", "|", "c", "d", "|", "e"], "|")
1430
+ [['a', 'b'], ['c', 'd'], ['e']]
1431
+ """
1432
+ temp_list = []
1433
+ result = []
1434
 
1435
+ for value in input_list:
1436
+ if value == split_value:
1437
+ result.append(temp_list)
1438
+ temp_list = []
1439
+ else:
1440
+ temp_list.append(value)
1441
+ result.append(temp_list)
1442
+
1443
+ return result
1444
+
1445
+
1446
+ def combine_frames_into_images(frames, time_interval, max_grid_shape=(3, 3), vit_input_size=378):
1447
  """
1448
+ Combines a sequence of video frames into grid-based images and generates corresponding time range labels.
1449
+
1450
+ Frames are grouped and arranged into a grid (e.g., 3x3) such that each combined image contains up to
1451
+ `max_grid_shape[0] * max_grid_shape[1]` frames. Each combined image is resized to the given ViT input size.
1452
 
1453
  Args:
1454
+ frames (List[PIL.Image.Image]): A list of frames extracted from a video.
1455
+ time_interval (float): Time interval (in seconds) between consecutive frames.
1456
+ max_grid_shape (Tuple[int, int], optional): The maximum grid shape as (rows, cols). Defaults to (3, 3).
1457
+ vit_input_size (int, optional): The target size (height and width) for the Vision Transformer input. Defaults to 378.
1458
 
1459
  Returns:
1460
+ Tuple:
1461
+ image_list (List[PIL.Image.Image]): A list of grid-combined images.
1462
+ image_time_stamps (List[str]): A list of time span labels for each combined image,
1463
+ e.g., ["0.00s~1.50s", "1.50s~3.00s", ...].
1464
  """
1465
+ # grid_size = int(np.sqrt(max_num_grids))
1466
+ # assert grid_size**2 == max_num_grids, "max_num_grids must be a perfect square."
1467
+ max_num_grids = max_grid_shape[0] * max_grid_shape[1]
1468
+ assert (
1469
+ max_grid_shape[1] == 1
1470
+ ), f"For video processing, decided to concatenate frames horizontally into a wide image."
1471
+
1472
+ # List to store the resulting combined images.
1473
+ image_list = []
1474
+
1475
+ # Calculate the number of canvases needed.
1476
+ num_frames = len(frames)
1477
+ num_canvases = num_frames // max_num_grids
1478
+ leftover_frames = num_frames % max_num_grids
1479
+
1480
+ time_stamp = 0 # second
1481
+ image_time_stamps = []
1482
+
1483
+ for canvas_idx in range(num_canvases):
1484
+ # Initialize the current canvas.
1485
+ combined_image = Image.new(
1486
+ "RGB", (vit_input_size * max_grid_shape[0], vit_input_size * max_grid_shape[1]), color=(0, 0, 0)
1487
+ )
1488
 
1489
+ # Determine the frames to fill in the current canvas.
1490
+ start_idx = canvas_idx * max_num_grids
1491
+ end_idx = min(start_idx + max_num_grids, num_frames)
1492
 
1493
+ for idx in range(start_idx, end_idx):
1494
+ img = frames[idx]
 
 
 
 
1495
 
1496
+ # Resize each frame to a square shape.
1497
+ img_resized = img.resize((vit_input_size, vit_input_size))
1498
 
1499
+ # Calculate the (row, column) position to place the frame within the grid layout.
1500
+ local_idx = idx - start_idx
1501
+ x_offset = (local_idx % max_grid_shape[0]) * vit_input_size
1502
+ y_offset = (local_idx // max_grid_shape[0]) * vit_input_size
1503
 
1504
+ # Calculate the position to place the frame in the grid.
1505
+ combined_image.paste(img_resized, (x_offset, y_offset))
1506
+
1507
+ # Append the current canvas to the result list.
1508
+ image_list.append(combined_image)
1509
+ frame_cnt = end_idx - start_idx
1510
+ image_time_stamps.append(f"{time_stamp:.2f}s~{time_stamp + frame_cnt * time_interval:.2f}s")
1511
+ time_stamp += frame_cnt * time_interval
1512
+
1513
+ if leftover_frames > 0:
1514
+ # canvas_idx might be undefined; default to 0 if not previously assigned to avoid "referenced before assignment" error.
1515
+ canvas_idx = num_canvases
1516
+ # Add the remaining frames to the final canvas.
1517
+ combined_image = Image.new("RGB", (vit_input_size * leftover_frames, vit_input_size * 1), color=(0, 0, 0))
1518
+
1519
+ for idx in range(leftover_frames):
1520
+ img = frames[num_canvases * max_num_grids + idx]
1521
+
1522
+ # Resize the frame to a square (equal width and height).
1523
+ img_resized = img.resize((vit_input_size, vit_input_size))
1524
+
1525
+ # Calculate the (row, column) position to place the frame within the grid layout.
1526
+ x_offset = (idx % leftover_frames) * vit_input_size
1527
+ y_offset = (idx // leftover_frames) * vit_input_size
1528
+
1529
+ # Calculate the position to place the frame within the grid layout.
1530
+ combined_image.paste(img_resized, (x_offset, y_offset))
1531
+
1532
+ # Add the current canvas to the list of combined images.
1533
+ image_list.append(combined_image)
1534
+ frame_cnt = leftover_frames
1535
+ image_time_stamps.append(f"{time_stamp:.2f}s~{time_stamp + frame_cnt * time_interval:.2f}s")
1536
+ time_stamp += frame_cnt * time_interval
1537
+
1538
+ return image_list, image_time_stamps
1539
+
1540
+
1541
+ def extract_frame_indices(play_time, total_frames, fps, max_num_grids, max_image_cnt, default_interval=0.4):
1542
  """
1543
+ Extracts specific frame indices from a video based on duration, frame count, and sampling strategy.
1544
 
1545
+ The function determines which frames to extract given the video duration (`play_time`),
1546
+ total frame count, and frame rate. It samples frames at regular intervals (default: 0.4s),
1547
+ but if the number of frames exceeds the limit defined by `max_num_grids * max_image_cnt`,
1548
+ it performs uniform sampling to stay within that limit.
1549
 
1550
  Args:
1551
+ play_time (float): Total play time of the video in seconds.
1552
+ total_frames (int): Total number of frames in the video.
1553
+ fps (float): Frames per second of the video.
1554
+ max_num_grids (int): Maximum number of grids to display.
1555
+ max_image_cnt (int): Maximum number of images per grid.
1556
+ default_interval (float, optional): Interval in seconds between frame samples. Defaults to 0.4.
1557
 
1558
  Returns:
1559
+ Tuple:
1560
+ frame_indices (List[int]): A list of selected frame indices.
1561
+ time_interval (float): Time interval between selected frames (in seconds).
1562
  """
 
 
 
 
1563
 
1564
+ # Calculate how many frames to extract with the default interval
1565
+ default_frame_count = int(play_time / default_interval)
 
 
 
1566
 
1567
+ # Maximum frames allowed based on max_num_grids and max_image_cnt
1568
+ max_frames_allowed = max_num_grids * max_image_cnt
 
 
 
 
1569
 
1570
+ # Determine whether we can use the default interval or need uniform sampling
1571
+ if default_frame_count <= max_frames_allowed:
1572
+ # Default interval is sufficient, extract frames every 0.4 seconds
1573
+ frame_interval = int(total_frames / default_frame_count)
1574
+ else:
1575
+ # Use uniform sampling to fit within max_frames_allowed
1576
+ frame_interval = int(total_frames / max_frames_allowed)
1577
+
1578
+ # Extract frame indices at the calculated interval
1579
+ selected_indices = list(range(0, total_frames, frame_interval))
1580
+
1581
+ time_interval = frame_interval / fps
1582
+
1583
+ # Ensure the number of selected indices does not exceed max_frames_allowed
1584
+ return selected_indices[:max_frames_allowed], time_interval
preprocessor_config.json CHANGED
@@ -1,9 +1,10 @@
1
  {
2
- "anyres": true,
3
  "auto_map": {
4
- "AutoImageProcessor": "image_processing_hyperclovax.HCXImageProcessor",
5
- "AutoProcessor": "processing_hyperclovax.HCXProcessor"
6
  },
 
7
  "crop_size": {
8
  "height": 378,
9
  "width": 378
@@ -13,22 +14,23 @@
13
  "do_normalize": true,
14
  "do_rescale": true,
15
  "do_resize": true,
 
 
 
 
 
 
16
  "image_mean": [
17
  0.5,
18
  0.5,
19
  0.5
20
  ],
21
- "image_processor_class": "AutoImageProcessor",
22
- "image_processor_type": "HCXImageProcessor",
23
  "image_std": [
24
  0.5,
25
  0.5,
26
  0.5
27
  ],
28
- "num_queries_vis_abstractor_image": 81,
29
- "num_queries_vis_abstractor_video_slow": 81,
30
- "num_queries_vis_abstractor_video_fast": 9,
31
- "first_last_frames_slow_video": false,
32
  "pad_to_square": true,
33
  "patch_size": 14,
34
  "possible_resolutions": [
@@ -125,7 +127,6 @@
125
  378
126
  ]
127
  ],
128
- "processor_class": "HCXProcessor",
129
  "resample": 2,
130
  "rescale_factor": 0.00392156862745098,
131
  "size": {
 
1
  {
2
+ "processor_class": "HCXVisionProcessor",
3
  "auto_map": {
4
+ "AutoProcessor": "preprocessor.HCXVisionProcessor",
5
+ "AutoImageProcessor": "preprocessor.HCXVisionProcessor"
6
  },
7
+ "anyres": true,
8
  "crop_size": {
9
  "height": 378,
10
  "width": 378
 
14
  "do_normalize": true,
15
  "do_rescale": true,
16
  "do_resize": true,
17
+ "max_num_grids": 9,
18
+ "max_image_cnt": 12,
19
+ "num_queries_vis_abstractor": 81,
20
+ "num_queries_vis_abstractor_video_fast": 9,
21
+ "num_queries_vis_abstractor_video_slow": 81,
22
+ "first_last_frames_slow": false,
23
  "image_mean": [
24
  0.5,
25
  0.5,
26
  0.5
27
  ],
28
+ "image_processor_type": "HCXVisionProcessor",
 
29
  "image_std": [
30
  0.5,
31
  0.5,
32
  0.5
33
  ],
 
 
 
 
34
  "pad_to_square": true,
35
  "patch_size": 14,
36
  "possible_resolutions": [
 
127
  378
128
  ]
129
  ],
 
130
  "resample": 2,
131
  "rescale_factor": 0.00392156862745098,
132
  "size": {
processing_hyperclovax.py DELETED
@@ -1,912 +0,0 @@
1
- import copy
2
- import os
3
- import re
4
- import uuid
5
- from typing import Dict, List, Optional, Union
6
-
7
- import numpy as np
8
- import PIL
9
- from PIL import Image
10
- import torch
11
- from transformers.feature_extraction_utils import BatchFeature
12
- from transformers.image_utils import ImageInput, load_image
13
- from transformers.processing_utils import (
14
- AllKwargsForChatTemplate,
15
- ChatTemplateLoadKwargs,
16
- ProcessingKwargs,
17
- ProcessorMixin,
18
- Unpack,
19
- )
20
- from transformers.tokenization_utils_base import AudioInput, TextInput
21
- from transformers.utils import (
22
- is_torch_device,
23
- is_torch_dtype,
24
- logging,
25
- requires_backends,
26
- )
27
- from transformers.utils.chat_template_utils import render_jinja_template
28
- from transformers.video_utils import VideoInput, VideoMetadata, load_video
29
-
30
- logger = logging.get_logger(__name__)
31
-
32
-
33
- class HCXBatchFeature(BatchFeature):
34
- def to(self, *args, **kwargs) -> "BatchFeature":
35
- """
36
- Send all values to device by calling `v.to(*args, **kwargs)` (PyTorch only). This should support casting in
37
- different `dtypes` and sending the `BatchFeature` to a different `device`.
38
-
39
- Args:
40
- args (`Tuple`):
41
- Will be passed to the `to(...)` function of the tensors.
42
- kwargs (`Dict`, *optional*):
43
- Will be passed to the `to(...)` function of the tensors.
44
- To enable asynchronous data transfer, set the `non_blocking` flag in `kwargs` (defaults to `False`).
45
-
46
- Returns:
47
- [`BatchFeature`]: The same instance after modification.
48
- """
49
- requires_backends(self, ["torch"])
50
- import torch # noqa
51
-
52
- new_data = {}
53
- device = kwargs.get("device")
54
- non_blocking = kwargs.get("non_blocking", False)
55
- # Check if the args are a device or a dtype
56
- if device is None and len(args) > 0:
57
- # device should be always the first argument
58
- arg = args[0]
59
- if is_torch_dtype(arg):
60
- # The first argument is a dtype
61
- pass
62
- elif isinstance(arg, str) or is_torch_device(arg) or isinstance(arg, int):
63
- device = arg
64
- else:
65
- # it's something else
66
- raise ValueError(f"Attempting to cast a BatchFeature to type {str(arg)}. This is not supported.")
67
- # We cast only floating point tensors to avoid issues with tokenizers casting `LongTensor` to `FloatTensor`
68
- for k, v in self.items():
69
- # check if v is a floating point
70
- if isinstance(v, torch.Tensor) and torch.is_floating_point(v):
71
- # cast and send to device
72
- new_data[k] = v.to(*args, **kwargs)
73
- elif isinstance(v, torch.Tensor) and device is not None:
74
- new_data[k] = v.to(device=device, non_blocking=non_blocking)
75
- elif "pixel_values" in k:
76
- new_pixel_values_batch = []
77
- for _v in v:
78
- pixel_values = [pixel_value.to(device=device, non_blocking=non_blocking) for pixel_value in _v]
79
- new_pixel_values_batch.append(pixel_values)
80
- new_data[k] = new_pixel_values_batch
81
- else:
82
- new_data[k] = v
83
- self.data = new_data
84
- return self
85
-
86
-
87
- class HCXProcessorKwargs(ProcessingKwargs, total=False):
88
- _defaults = {
89
- "text_kwargs": {
90
- "return_tensors": "pt",
91
- "calc_non_vision_query_lengths": False,
92
- },
93
- "images_kwargs": {},
94
- "audio_kwargs": {},
95
- "videos_kwargs": {
96
- "max_image_cnt": 12,
97
- "max_num_grids": 9,
98
- },
99
- }
100
-
101
-
102
- class HCXProcessor(ProcessorMixin):
103
- attributes = ["image_processor", "tokenizer"]
104
- valid_kwargs = ["chat_template"]
105
-
106
- image_processor_class = "AutoImageProcessor"
107
- tokenizer_class = ("GPT2Tokenizer", "GPT2TokenizerFast")
108
-
109
- def __init__(self, image_processor=None, tokenizer=None, chat_template=None, **kwargs):
110
- self.image_token = "<|dummy3|>"
111
- self.video_token = "<|_unuse_missing_100270|>"
112
- self.image_token_pattern = re.compile(r"<\|dummy3\|>")
113
- self.video_token_pattern = re.compile(r"<\|_unuse_missing_100270\|>")
114
- self.image_video_token_pattern = re.compile(r"<\|dummy3\|>|<\|_unuse_missing_100270\|>")
115
- self.image_token_id = (
116
- tokenizer.image_token_id
117
- if getattr(tokenizer, "image_token_id", None)
118
- else tokenizer.convert_tokens_to_ids(self.image_token)
119
- )
120
- self.video_token_id = (
121
- tokenizer.video_token_id
122
- if getattr(tokenizer, "video_token_id", None)
123
- else tokenizer.convert_tokens_to_ids(self.video_token)
124
- )
125
- super().__init__(image_processor, tokenizer, chat_template=chat_template)
126
-
127
- def apply_chat_template(
128
- self,
129
- conversation: Union[list[dict[str, str]], list[list[dict[str, str]]]],
130
- chat_template: Optional[str] = None,
131
- **kwargs: Unpack[AllKwargsForChatTemplate],
132
- ) -> str:
133
- """
134
- Similar to the `apply_chat_template` method on tokenizers, this method applies a Jinja template to input
135
- conversations to turn them into a single tokenizable string.
136
-
137
- The input is expected to be in the following format, where each message content is a list consisting of text and
138
- optionally image or video inputs. One can also provide an image, video, URL or local path which will be used to form
139
- `pixel_values` when `return_dict=True`. If not provided, one will get only the formatted text, optionally tokenized text.
140
-
141
- conversation = [
142
- {
143
- "role": "user",
144
- "content": [
145
- {"type": "image", "image": "https://www.ilankelman.org/stopsigns/australia.jpg"},
146
- {"type": "text", "text": "Please describe this image in detail."},
147
- ],
148
- },
149
- ]
150
-
151
- Args:
152
- conversation (`Union[List[Dict, [str, str]], List[List[Dict[str, str]]]]`):
153
- The conversation to format.
154
- chat_template (`Optional[str]`, *optional*):
155
- The Jinja template to use for formatting the conversation. If not provided, the tokenizer's
156
- chat template is used.
157
- """
158
-
159
- if chat_template is None:
160
- if isinstance(self.chat_template, dict) and "default" in self.chat_template:
161
- chat_template = self.chat_template["default"]
162
- elif isinstance(self.chat_template, dict):
163
- raise ValueError(
164
- 'The processor has multiple chat templates but none of them are named "default". You need to specify'
165
- " which one to use by passing the `chat_template` argument. Available templates are: "
166
- f"{', '.join(self.chat_template.keys())}"
167
- )
168
- elif self.chat_template is not None:
169
- chat_template = self.chat_template
170
- else:
171
- raise ValueError(
172
- "Cannot use apply_chat_template because this processor does not have a chat template."
173
- )
174
- else:
175
- if isinstance(self.chat_template, dict) and chat_template in self.chat_template:
176
- # It's the name of a template, not a full template string
177
- chat_template = self.chat_template[chat_template]
178
- else:
179
- # It's a template string, render it directly
180
- chat_template = chat_template
181
-
182
- if kwargs.get("continue_final_message", False):
183
- if kwargs.get("add_generation_prompt", False):
184
- raise ValueError(
185
- "continue_final_message and add_generation_prompt are not compatible. Use continue_final_message when you want the model to continue the final message, and add_generation_prompt when you want to add a header that will prompt it to start a new assistant message instead."
186
- )
187
- if kwargs.get("return_assistant_tokens_mask", False):
188
- raise ValueError("continue_final_message is not compatible with return_assistant_tokens_mask.")
189
-
190
- # Fill sets of kwargs that should be used by different parts of template
191
- processed_kwargs = {
192
- "mm_load_kwargs": {},
193
- "template_kwargs": {},
194
- }
195
-
196
- for kwarg_type in processed_kwargs:
197
- for key in AllKwargsForChatTemplate.__annotations__[kwarg_type].__annotations__.keys():
198
- kwarg_type_defaults = AllKwargsForChatTemplate.__annotations__[kwarg_type]
199
- default_value = getattr(kwarg_type_defaults, key, None)
200
- value = kwargs.pop(key, default_value)
201
- if value is not None and not isinstance(value, dict):
202
- processed_kwargs[kwarg_type][key] = value
203
-
204
- # Pass unprocessed custom kwargs
205
- processed_kwargs["template_kwargs"].update(kwargs)
206
-
207
- if isinstance(conversation, (list, tuple)) and (
208
- isinstance(conversation[0], (list, tuple)) or hasattr(conversation[0], "content")
209
- ):
210
- is_batched = True
211
- conversations = conversation
212
- else:
213
- is_batched = False
214
- conversations = [conversation]
215
-
216
- tokenize = processed_kwargs["template_kwargs"].pop("tokenize", False)
217
- return_dict = processed_kwargs["template_kwargs"].pop("return_dict", False)
218
- mm_load_kwargs = processed_kwargs["mm_load_kwargs"]
219
-
220
- if tokenize:
221
- batch_images, batch_videos = [], []
222
- batch_audios = []
223
- batch_video_metadata = []
224
- for conversation in conversations:
225
- images, videos = [], []
226
- video_metadata = []
227
- for message in conversation:
228
- visuals = [content for content in message["content"] if content["type"] in ["image", "video"]]
229
- audio_fnames = [
230
- content[key]
231
- for content in message["content"]
232
- for key in ["audio", "url", "path"]
233
- if key in content and content["type"] == "audio"
234
- ]
235
- image_fnames = [
236
- vision_info[key]
237
- for vision_info in visuals
238
- for key in ["image", "url", "path", "base64"]
239
- if key in vision_info and vision_info["type"] == "image"
240
- ]
241
- video_fnames = [
242
- vision_info[key]
243
- for vision_info in visuals
244
- for key in ["video", "url", "path"]
245
- if key in vision_info and vision_info["type"] == "video"
246
- ]
247
-
248
- for fname in image_fnames:
249
- images.append(load_image(fname))
250
-
251
- # Audio models do not accept nested list of audios (yet!) so we construct a flat input audio list
252
- if not mm_load_kwargs["load_audio_from_video"]:
253
- for fname in audio_fnames:
254
- batch_audios.append(load_audio(fname, sampling_rate=mm_load_kwargs["sampling_rate"]))
255
- else:
256
- for fname in video_fnames:
257
- batch_audios.append(load_audio(fname, sampling_rate=mm_load_kwargs["sampling_rate"]))
258
-
259
- for fname in video_fnames:
260
- if isinstance(fname, (list, tuple)) and isinstance(fname[0], str):
261
- video = [np.array(load_image(image_fname)) for image_fname in fname]
262
- # create a 4D video because `load_video` always returns a 4D array
263
- video = np.stack(video)
264
- metadata = None
265
- logger.warning(
266
- "When loading the video from list of images, we cannot infer metadata such as `fps` or `duration`. "
267
- "If your model uses this metadata during processing, please load the whole video and let the model sample frames instead."
268
- )
269
- else:
270
- # TODO: raushan, should be `self.video_processor.load_video_for_model` when API is added
271
- video, metadata = self._load_video_for_model(
272
- fname,
273
- num_frames=mm_load_kwargs.get("num_frames", None),
274
- fps=mm_load_kwargs.get("video_fps", None),
275
- backend=mm_load_kwargs["video_load_backend"],
276
- **kwargs,
277
- )
278
- videos.append(video)
279
- video_metadata.append(metadata)
280
-
281
- # Currently all processors can accept nested list of batches, but not flat list of visuals
282
- # So we'll make a batched list of images and let the processor handle it
283
- if images:
284
- batch_images.append(images)
285
- if videos:
286
- batch_videos.append(videos)
287
- batch_video_metadata.append(video_metadata)
288
-
289
- # Process conversation with video/image information if needed. Then convert into a prompt using Jinja template
290
- conversations = self._process_messages_for_chat_template(
291
- conversations,
292
- batch_images=batch_images,
293
- batch_videos=batch_videos,
294
- batch_video_metadata=batch_video_metadata,
295
- **processed_kwargs["mm_load_kwargs"],
296
- )
297
-
298
- prompt, generation_indices = render_jinja_template(
299
- conversations=conversations,
300
- chat_template=chat_template,
301
- **processed_kwargs["template_kwargs"], # different flags such as `return_assistant_mask`
302
- **self.tokenizer.special_tokens_map, # tokenizer special tokens are used by some templates
303
- )
304
-
305
- if not is_batched:
306
- prompt = prompt[0]
307
-
308
- if tokenize:
309
- # Tokenizer's `apply_chat_template` never adds special tokens when tokenizing
310
- # But processor's `apply_chat_template` didn't have an option to tokenize, so users had to format the prompt
311
- # and pass it to the processor. Users thus never worried about special tokens relying on processor handling
312
- # everything internally. The below line is to keep BC for that and be able to work with model that have
313
- # special tokens in the template (consistent with tokenizers). We dont want to raise warning, it will flood command line
314
- # without actionable solution for users
315
- single_prompt = prompt[0] if is_batched else prompt
316
- if self.tokenizer.bos_token is not None and single_prompt.startswith(self.tokenizer.bos_token):
317
- kwargs["add_special_tokens"] = False
318
-
319
- out = self(
320
- text=prompt,
321
- images=batch_images if batch_images else None,
322
- videos=batch_videos if batch_videos else None,
323
- audio=batch_audios if batch_audios else None,
324
- **kwargs,
325
- )
326
- if return_dict:
327
- if processed_kwargs["template_kwargs"].get("return_assistant_tokens_mask", False):
328
- assistant_masks = []
329
- input_ids = out["input_ids"]
330
- for i in range(len(input_ids)):
331
- current_mask = [0] * len(input_ids[i])
332
- for assistant_start_char, assistant_end_char in generation_indices[i]:
333
- start_token = out.char_to_token(i, assistant_start_char)
334
- end_token = out.char_to_token(i, assistant_end_char - 1)
335
- if start_token is None:
336
- # start_token is out of bounds maybe due to truncation.
337
- break
338
- for token_id in range(start_token, end_token + 1 if end_token else len(input_ids[i])):
339
- current_mask[token_id] = 1
340
- assistant_masks.append(current_mask)
341
- out["assistant_masks"] = assistant_masks
342
- out.convert_to_tensors(tensor_type=kwargs.get("return_tensors", None))
343
-
344
- # vllm needs vision_query_lengths, but hf model doesn't need it
345
- del out["vision_query_lengths_images"]
346
- del out["vision_query_lengths_videos"]
347
- return out
348
- else:
349
- return out["input_ids"]
350
-
351
- def repeat_dummy_tokens(self, input_ids, target_token_id, vision_query_lengths):
352
- input_ids = input_ids.clone().detach()
353
- batch_indices, target_indices = torch.where(input_ids == target_token_id)
354
- batch_size = input_ids.shape[0]
355
-
356
- new_input_ids = [[] for _ in range(batch_size)]
357
- start_indices = [0 for _ in range(batch_size)]
358
- counter = [0 for _ in range(batch_size)]
359
- for batch_idx, target_idx in zip(batch_indices, target_indices):
360
- start_idx = start_indices[batch_idx]
361
- new_input_ids[batch_idx].append(input_ids[batch_idx][start_idx:target_idx])
362
- query_length = vision_query_lengths[batch_idx][counter[batch_idx]]
363
- new_input_ids[batch_idx].append(input_ids[batch_idx][target_idx].repeat(query_length))
364
- start_indices[batch_idx] = target_idx + 1
365
- counter[batch_idx] += 1
366
-
367
- for batch_idx in range(batch_size):
368
- start_idx = start_indices[batch_idx]
369
- new_input_ids[batch_idx].append(input_ids[batch_idx][start_idx:]) # append remaining tokens
370
- new_input_ids[batch_idx] = torch.cat(new_input_ids[batch_idx], dim=0)
371
-
372
- new_input_ids = torch.stack(new_input_ids)
373
- return new_input_ids
374
-
375
- def _load_video_for_model(
376
- self,
377
- video: str,
378
- num_frames: Optional[int] = None,
379
- fps: Optional[int] = None,
380
- backend: str = "opencv",
381
- **kwargs: Unpack[HCXProcessorKwargs],
382
- ) -> List[ImageInput]:
383
- """
384
- Overrided function.
385
-
386
- Loads `video` to a List[PIL.Image] (llava style)
387
-
388
- Args:
389
- video (`str`):
390
- The video to convert to the numpy array format. Can be a link to video or local path.
391
- num_frames (`int`, *optional*):
392
- Number of frames to sample uniformly. If not passed, the whole video is loaded.
393
- fps (`int`, *optional*):
394
- Number of frames to sample per second. Should be passed only when `num_frames=None`.
395
- If not specified and `num_frames==None`, all frames are sampled.
396
- backend (`str`, *optional*, defaults to `"opencv"`):
397
- The backend to use when loading the video. Can be any of ["decord", "pyav", "opencv", "torchvision"]. Defaults to "opencv".
398
-
399
- Returns:
400
- Tuple[`np.array`, Dict]: A tuple containing:
401
- - List[PIL.Image] of frames in RGB.
402
- - Metadata dictionary.
403
- """
404
- output_kwargs = self._merge_kwargs(
405
- HCXProcessorKwargs,
406
- tokenizer_init_kwargs=self.tokenizer.init_kwargs,
407
- **kwargs,
408
- )
409
-
410
- logger.warning_once(f"num_frames control via argument is not supported yet. Ignored num_frames: {num_frames}.")
411
- logger.warning_once(f"fps control via argument is not supported yet. Ignored fps: {fps}.")
412
- logger.warning_once(f"backend control via argument is not supported yet. Ignored backend: {backend}.")
413
-
414
- # video_loaded, video_metadata = load_video(
415
- # video, backend="decord", num_frames=32
416
- # )
417
- # frame_interval = int(video_metadata.total_num_frames / 32)
418
- # time_interval = frame_interval / video_metadata.fps
419
- # video_metadata.time_interval = time_interval
420
-
421
- def _hcx_sample_indices_fn(metadata: VideoMetadata, num_frames=None, fps=None, **kwargs):
422
- max_num_grids = output_kwargs["videos_kwargs"]["max_num_grids"]
423
- max_image_cnt = output_kwargs["videos_kwargs"]["max_image_cnt"]
424
- frame_indices, time_interval = extract_frame_indices(
425
- metadata.duration,
426
- metadata.total_num_frames,
427
- metadata.fps,
428
- max_num_grids,
429
- max_image_cnt,
430
- default_interval=0.4,
431
- )
432
- metadata.time_interval = time_interval
433
- return np.array(frame_indices)
434
-
435
- video_loaded, video_metadata = None, None
436
- for backend in ["decord", "pyav", "opencv", "torchvision"]:
437
- try:
438
- video_loaded, video_metadata = load_video(
439
- video, sample_indices_fn=_hcx_sample_indices_fn, backend=backend
440
- )
441
- break
442
- except Exception as e:
443
- logger.error(f"Error loading video with {backend} backend: {e}")
444
- continue
445
-
446
- assert video_loaded is not None, "Failed to load video with any backend"
447
-
448
- return video_loaded, video_metadata
449
-
450
- def _process_messages_for_chat_template(
451
- self,
452
- conversation: List[List[Dict[str, str]]],
453
- batch_images: List[List[ImageInput]],
454
- batch_videos: List[List[VideoInput]],
455
- batch_video_metadata: List[List[Dict[str, any]]],
456
- **mm_load_kwargs: Unpack[ChatTemplateLoadKwargs],
457
- ):
458
- """
459
- Overrided function.
460
- Used within `apply_chat_template` when a model has a special way to process conversation history. For example,
461
- video models might want to specify in the prompt the duration of video or which frame indices at which timestamps
462
- were sampled. This information cannot be accessed before the video is loaded.
463
-
464
- For most models it is a no-op, and must be overridden by model processors which require special processing.
465
-
466
- Args:
467
- conversation (`List[Dict, str, str]`):
468
- The conversation to process. Always comes in batched format.
469
- batch_images (`List[List[ImageInput]]`):
470
- Batch of images that were loaded from url/path defined in the conversation. The images
471
- are ordered in the same way as in the conversation. Comes in nested list format, one list of `PIL` images
472
- per batch.
473
- batch_videos (`List[List[ImageInput]]`):
474
- Batch of videos that were loaded from url/path defined in the conversation. The videos
475
- are ordered in the same way as in the conversation. Comes in nested list format, one list of `PIL.Image`
476
- per batch.
477
- batch_video_metadata (`List[List[Dict[[str, any]]]]`):
478
- Batch of metadata returned from loading videos. That includes video fps, duration and total number of framer in original video.
479
- Metadata are ordered in the same way as `batch_videos`. Comes in nested list format, one list of `Dict`
480
- per batch.
481
- """
482
-
483
- is_video_in_conversation = False
484
- for batch_idx, messages in enumerate(conversation):
485
- is_video_in_messages = False
486
- is_image_in_messages = False
487
- for message in messages:
488
- for content in message["content"]:
489
- if content["type"] == "video":
490
- is_video_in_messages = True
491
- elif content["type"] == "image":
492
- is_image_in_messages = True
493
- if not is_video_in_messages:
494
- batch_videos.insert(batch_idx, [])
495
- batch_video_metadata.insert(batch_idx, [])
496
- if not is_image_in_messages:
497
- batch_images.insert(batch_idx, [])
498
-
499
- is_video_in_conversation = is_video_in_conversation or is_video_in_messages
500
-
501
- if not is_video_in_conversation:
502
- return conversation
503
-
504
- # conversation processing
505
- new_conversation = []
506
- for batch_idx, messages in enumerate(conversation):
507
- video_counter = 0
508
- new_messages = []
509
-
510
- for message in messages:
511
- new_message = {
512
- "role": message["role"],
513
- "content": [],
514
- }
515
- for content in message["content"]:
516
- if content["type"] == "video":
517
- video = batch_videos[batch_idx][video_counter]
518
- video_meta = batch_video_metadata[batch_idx][video_counter]
519
-
520
- time_stamps = calc_timestamp_video_grids(video, video_meta.time_interval, max_grid_shape=(3, 3))
521
- video_counter += 1
522
-
523
- if "filename" in content:
524
- filename = content["filename"]
525
- else:
526
- filename = content["video"].split("/")[-1]
527
- if len(filename) > 50:
528
- filename = f"{uuid.uuid4().hex}.mp4"
529
- basename, ext = os.path.splitext(filename)
530
- if ext == "":
531
- ext = ".mp4"
532
-
533
- for frame_idx, time_stamp in enumerate(time_stamps):
534
- if frame_idx == len(video) - 1:
535
- # final_grid
536
- new_content = {
537
- "filename": f"{basename}-{frame_idx}{ext}",
538
- "video": content["video"],
539
- "type": "video",
540
- "video_time_stamp": time_stamp,
541
- "lens_keywords": content["lens_keywords"],
542
- "lens_local_keywords": content["lens_local_keywords"],
543
- "speech_to_text": content["speech_to_text"],
544
- "is_final_grid": True,
545
- }
546
- new_message["content"].append(new_content)
547
- else:
548
- new_content = {
549
- "filename": f"{basename}-{frame_idx}{ext}",
550
- "video": content["video"],
551
- "type": "video",
552
- "video_time_stamp": time_stamp,
553
- }
554
- new_message["content"].append(new_content)
555
- else:
556
- new_message["content"].append(copy.deepcopy(content))
557
- new_messages.append(new_message)
558
- new_conversation.append(new_messages)
559
-
560
- return new_conversation
561
-
562
- def __call__(
563
- self,
564
- text: TextInput = None,
565
- images: List[List[ImageInput]] = None,
566
- videos: List[List[VideoInput]] = None,
567
- audio: AudioInput = None,
568
- **kwargs: Unpack[HCXProcessorKwargs],
569
- ):
570
- output_kwargs = self._merge_kwargs(
571
- HCXProcessorKwargs,
572
- tokenizer_init_kwargs=self.tokenizer.init_kwargs,
573
- **kwargs,
574
- )
575
-
576
- # prepare model inputs
577
- mm_inputs = {
578
- "pixel_values_images": [],
579
- "image_sizes_images": [],
580
- "vision_query_lengths_images": [],
581
- "pixel_values_videos": [],
582
- # "image_sizes_videos": [],
583
- "vision_query_lengths_videos": [],
584
- }
585
- calc_non_vision_query_lengths = output_kwargs["text_kwargs"].pop("calc_non_vision_query_lengths")
586
- if calc_non_vision_query_lengths:
587
- mm_inputs["non_vision_query_lengths"] = []
588
-
589
- # video processing
590
- if videos is not None:
591
- vit_input_size = self.image_processor.crop_size["width"]
592
-
593
- video_kwargs = copy.deepcopy(output_kwargs["videos_kwargs"])
594
-
595
- for videos_in_single_conversation in videos:
596
- pixel_values_videos = []
597
- vision_query_lengths_videos = []
598
-
599
- for video_frames in videos_in_single_conversation:
600
- if len(video_frames) == 0:
601
- mm_inputs["pixel_values_videos"].append([])
602
- mm_inputs["vision_query_lengths_videos"].append([])
603
- continue
604
- video_frames_combined = combine_frames_into_images(
605
- video_frames, max_grid_shape=(3, 3), vit_input_size=vit_input_size
606
- )
607
- video_kwargs["is_video"] = True
608
- video_kwargs["return_tensors"] = None
609
-
610
- frames_processed = self.image_processor(images=video_frames_combined, **video_kwargs)
611
- sizes = [(size["width"], size["height"]) for size in frames_processed["image_sizes"]]
612
-
613
- pixel_values_videos.extend(frames_processed["pixel_values"])
614
- vision_query_lengths_videos.extend(frames_processed["vision_query_lengths"])
615
-
616
- mm_inputs["pixel_values_videos"].append(pixel_values_videos)
617
- mm_inputs["vision_query_lengths_videos"].append(vision_query_lengths_videos)
618
-
619
- # image processing
620
- if images is not None:
621
- image_kwargs = copy.deepcopy(output_kwargs["images_kwargs"])
622
- image_kwargs["is_video"] = False
623
- image_kwargs["return_tensors"] = None
624
-
625
- for images_in_single_conversation in images:
626
- if isinstance(images_in_single_conversation, PIL.Image.Image): # single item to batch
627
- images_in_single_conversation = [images_in_single_conversation, ]
628
- if len(images_in_single_conversation) == 0:
629
- mm_inputs["pixel_values_images"].append([])
630
- mm_inputs["image_sizes_images"].append([])
631
- mm_inputs["vision_query_lengths_images"].append([])
632
- continue
633
- images_processed = self.image_processor(images=images_in_single_conversation, **image_kwargs)
634
- sizes = [(size["width"], size["height"]) for size in images_processed["image_sizes"]]
635
-
636
- mm_inputs["pixel_values_images"].append(images_processed["pixel_values"])
637
- mm_inputs["image_sizes_images"].append(sizes)
638
- mm_inputs["vision_query_lengths_images"].append(images_processed["vision_query_lengths"])
639
-
640
- # text processing
641
- def _create_replacer(_target_token, _replacements):
642
- _iterator = iter(_replacements)
643
-
644
- def _replacer(match_obj):
645
- # return self.image_token
646
- num_query_tokens = next(_iterator)
647
- return "".join([_target_token for _ in range(num_query_tokens)])
648
- return _replacer
649
-
650
- text_inputs = {}
651
- if text is not None:
652
- if not isinstance(text, list):
653
- text = [text]
654
-
655
- if images is not None:
656
- new_texts = []
657
- for batch_idx, text_in_single_conversation in enumerate(text):
658
- new_text = self.image_token_pattern.sub(
659
- _create_replacer(self.image_token, mm_inputs["vision_query_lengths_images"][batch_idx]),
660
- text_in_single_conversation,
661
- )
662
- new_texts.append(new_text)
663
- text = new_texts
664
-
665
- if videos is not None:
666
- new_texts = []
667
- for batch_idx, text_in_single_conversation in enumerate(text):
668
- new_text = self.video_token_pattern.sub(
669
- _create_replacer(self.video_token, mm_inputs["vision_query_lengths_videos"][batch_idx]),
670
- text_in_single_conversation,
671
- )
672
- new_texts.append(new_text)
673
- text = new_texts
674
-
675
- text_inputs = self.tokenizer(text, **output_kwargs["text_kwargs"])
676
-
677
- # audio processing
678
- if audio is not None:
679
- raise NotImplementedError("Audio processing is not supported yet.")
680
-
681
- return HCXBatchFeature(data={**text_inputs, **mm_inputs})
682
-
683
- def decode(self, *args, **kwargs):
684
- """
685
- This method forwards all its arguments to Siglip2Tokenizer's [`~PreTrainedTokenizer.decode`]. Please refer to
686
- the docstring of this method for more information.
687
- """
688
- return self.tokenizer.decode(*args, **kwargs)
689
-
690
- def batch_decode(self, *args, **kwargs):
691
- """
692
- This method forwards all its arguments to Siglip2Tokenizer's [`~PreTrainedTokenizer.batch_decode`]. Please
693
- refer to the docstring of this method for more information.
694
- """
695
- return self.tokenizer.batch_decode(*args, **kwargs)
696
-
697
- def post_process_image_text_to_text(
698
- self, generated_outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False, **kwargs
699
- ):
700
- """
701
- Post-process the output of the model to decode the text.
702
-
703
- Args:
704
- generated_outputs (`torch.Tensor` or `np.ndarray`):
705
- The output of the model `generate` function. The output is expected to be a tensor of shape `(batch_size, sequence_length)`
706
- or `(sequence_length,)`.
707
- skip_special_tokens (`bool`, *optional*, defaults to `True`):
708
- Whether or not to remove special tokens in the output. Argument passed to the tokenizer's `batch_decode` method.
709
- Clean_up_tokenization_spaces (`bool`, *optional*, defaults to `False`):
710
- Whether or not to clean up the tokenization spaces. Argument passed to the tokenizer's `batch_decode` method.
711
- **kwargs:
712
- Additional arguments to be passed to the tokenizer's `batch_decode method`.
713
-
714
- Returns:
715
- `List[str]`: The decoded text.
716
- """
717
- return self.tokenizer.batch_decode(
718
- generated_outputs,
719
- skip_special_tokens=skip_special_tokens,
720
- clean_up_tokenization_spaces=clean_up_tokenization_spaces,
721
- **kwargs,
722
- )
723
-
724
- @property
725
- def model_input_names(self):
726
- tokenizer_input_names = self.tokenizer.model_input_names
727
- image_processor_input_names = self.image_processor.model_input_names
728
- names_from_processor = list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))
729
- return names_from_processor + []
730
-
731
-
732
- def extract_frame_indices(play_time, total_frames, fps, max_num_grids, max_image_cnt, default_interval=0.4):
733
- """
734
- Extracts specific frame indices from a video based on duration, frame count, and sampling strategy.
735
-
736
- The function determines which frames to extract given the video duration (`play_time`),
737
- total frame count, and frame rate. It samples frames at regular intervals (default: 0.4s),
738
- but if the number of frames exceeds the limit defined by `max_num_grids * max_image_cnt`,
739
- it performs uniform sampling to stay within that limit.
740
-
741
- Args:
742
- play_time (float): Total play time of the video in seconds.
743
- total_frames (int): Total number of frames in the video.
744
- fps (float): Frames per second of the video.
745
- max_num_grids (int): Maximum number of grids to display.
746
- max_image_cnt (int): Maximum number of images per grid.
747
- default_interval (float, optional): Interval in seconds between frame samples. Defaults to 0.4.
748
-
749
- Returns:
750
- Tuple:
751
- frame_indices (List[int]): A list of selected frame indices.
752
- time_interval (float): Time interval between selected frames (in seconds).
753
- """
754
-
755
- # Calculate how many frames to extract with the default interval
756
- default_frame_count = int(play_time / default_interval)
757
-
758
- # Maximum frames allowed based on max_num_grids and max_image_cnt
759
- max_frames_allowed = max_num_grids * max_image_cnt
760
-
761
- # Determine whether we can use the default interval or need uniform sampling
762
- if default_frame_count <= max_frames_allowed:
763
- # Default interval is sufficient, extract frames every 0.4 seconds
764
- frame_interval = int(total_frames / default_frame_count)
765
- else:
766
- # Use uniform sampling to fit within max_frames_allowed
767
- frame_interval = int(total_frames / max_frames_allowed)
768
-
769
- # Extract frame indices at the calculated interval
770
- selected_indices = list(range(0, total_frames, frame_interval))
771
-
772
- time_interval = frame_interval / fps
773
-
774
- # Ensure the number of selected indices does not exceed max_frames_allowed
775
- return selected_indices[:max_frames_allowed], time_interval
776
-
777
-
778
- def calc_timestamp_video_grids(frames, time_interval, max_grid_shape=(3, 3)):
779
- """
780
- Calculates the time range labels for each grid in a video.
781
-
782
- Args:
783
- frames (List[PIL.Image.Image]): A list of frames extracted from a video.
784
- time_interval (float): Time interval (in seconds) between consecutive frames.
785
- max_grid_shape (Tuple[int, int], optional): The maximum grid shape as (rows, cols). Defaults to (3, 3).
786
- vit_input_size (int, optional): The target size (height and width) for the Vision Transformer input. Defaults to 378.
787
-
788
- Returns:
789
- Tuple:
790
- image_time_stamps (List[str]): A list of time span labels for each combined image,
791
- e.g., ["0.00s~1.50s", "1.50s~3.00s", ...].
792
- """
793
- max_num_grids = max_grid_shape[0] * max_grid_shape[1]
794
- # assert (
795
- # max_grid_shape[1] == 1
796
- # ), f"For video processing, decided to concatenate frames horizontally into a wide image."
797
-
798
- # Calculate the number of canvases needed.
799
- num_frames = len(frames)
800
- num_canvases = num_frames // max_num_grids
801
- leftover_frames = num_frames % max_num_grids
802
-
803
- time_stamp = 0 # second
804
- image_time_stamps = []
805
-
806
- for canvas_idx in range(num_canvases):
807
- # Determine the frames to fill in the current canvas.
808
- start_idx = canvas_idx * max_num_grids
809
- end_idx = min(start_idx + max_num_grids, num_frames)
810
-
811
- # Append the current canvas to the result list.
812
- frame_cnt = end_idx - start_idx
813
- image_time_stamps.append(f"{time_stamp:.2f}s~{time_stamp + frame_cnt * time_interval:.2f}s")
814
- time_stamp += frame_cnt * time_interval
815
-
816
- if leftover_frames > 0:
817
- # Add the current canvas to the list of combined images.
818
- frame_cnt = leftover_frames
819
- image_time_stamps.append(f"{time_stamp:.2f}s~{time_stamp + frame_cnt * time_interval:.2f}s")
820
- time_stamp += frame_cnt * time_interval
821
-
822
- return image_time_stamps
823
-
824
-
825
- def combine_frames_into_images(frames, max_grid_shape=(3, 3), vit_input_size=378):
826
- """
827
- Combines a sequence of video frames into grid-based images and generates corresponding time range labels.
828
-
829
- Frames are grouped and arranged into a grid (e.g., 3x3) such that each combined image contains up to
830
- `max_grid_shape[0] * max_grid_shape[1]` frames. Each combined image is resized to the given ViT input size.
831
-
832
- Args:
833
- frames (NDArray): (num_frames, H, W, C) shape. A list of frames extracted from a video.
834
- time_interval (float): Time interval (in seconds) between consecutive frames.
835
- max_grid_shape (Tuple[int, int], optional): The maximum grid shape as (rows, cols). Defaults to (3, 3).
836
- vit_input_size (int, optional): The target size (height and width) for the Vision Transformer input. Defaults to 378.
837
-
838
- Returns:
839
- Tuple:
840
- image_list (List[PIL.Image.Image]): A list of grid-combined images.
841
- """
842
- max_num_grids = max_grid_shape[0] * max_grid_shape[1]
843
- # assert (
844
- # max_grid_shape[1] == 1
845
- # ), f"For video processing, decided to concatenate frames horizontally into a wide image."
846
-
847
- # List to store the resulting combined images.
848
- image_list = []
849
-
850
- # Calculate the number of canvases needed.
851
- num_frames = len(frames)
852
- num_canvases = num_frames // max_num_grids
853
- leftover_frames = num_frames % max_num_grids
854
-
855
- # change frames (4d numpy tensor) to List[PIL.Image.Image]
856
- frames = [Image.fromarray(frame) for frame in frames]
857
-
858
- for canvas_idx in range(num_canvases):
859
- # Initialize the current canvas.
860
- combined_image = Image.new(
861
- "RGB", (vit_input_size * max_grid_shape[0], vit_input_size * max_grid_shape[1]), color=(0, 0, 0)
862
- )
863
-
864
- # Determine the frames to fill in the current canvas.
865
- start_idx = canvas_idx * max_num_grids
866
- end_idx = min(start_idx + max_num_grids, num_frames)
867
-
868
- for idx in range(start_idx, end_idx):
869
- img = frames[idx]
870
-
871
- # Resize each frame to a square shape.
872
- img_resized = img.resize((vit_input_size, vit_input_size))
873
-
874
- # Calculate the (row, column) position to place the frame within the grid layout.
875
- local_idx = idx - start_idx
876
- x_offset = (local_idx % max_grid_shape[0]) * vit_input_size
877
- y_offset = (local_idx // max_grid_shape[0]) * vit_input_size
878
-
879
- # Calculate the position to place the frame in the grid.
880
- combined_image.paste(img_resized, (x_offset, y_offset))
881
-
882
- # Append the current canvas to the result list.
883
- image_list.append(combined_image)
884
-
885
- if leftover_frames > 0:
886
- # canvas_idx might be undefined; default to 0 if not previously assigned to avoid "referenced before assignment" error.
887
- canvas_idx = num_canvases
888
- # Add the remaining frames to the final canvas.
889
- # combined_image = Image.new("RGB", (vit_input_size * leftover_frames, vit_input_size * 1), color=(0, 0, 0)) # hsk
890
- combined_image = Image.new(
891
- "RGB", (vit_input_size * max_grid_shape[0], vit_input_size * max_grid_shape[1]), color=(0, 0, 0)
892
- )
893
-
894
- for idx in range(leftover_frames):
895
- img = frames[num_canvases * max_num_grids + idx]
896
-
897
- # Resize the frame to a square (equal width and height).
898
- img_resized = img.resize((vit_input_size, vit_input_size))
899
-
900
- # Calculate the (row, column) position to place the frame within the grid layout.
901
- # x_offset = (idx % leftover_frames) * vit_input_size # hsk
902
- # y_offset = (idx // leftover_frames) * vit_input_size # hsk
903
- x_offset = (idx % max_grid_shape[0]) * vit_input_size
904
- y_offset = (idx // max_grid_shape[0]) * vit_input_size
905
-
906
- # Calculate the position to place the frame within the grid layout.
907
- combined_image.paste(img_resized, (x_offset, y_offset))
908
-
909
- # Add the current canvas to the list of combined images.
910
- image_list.append(combined_image)
911
-
912
- return image_list
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
processor_config.json DELETED
@@ -1,6 +0,0 @@
1
- {
2
- "auto_map": {
3
- "AutoProcessor": "processing_hyperclovax.HCXProcessor"
4
- },
5
- "processor_class": "HCXProcessor"
6
- }
 
 
 
 
 
 
 
special_tokens_map.json CHANGED
@@ -62,13 +62,7 @@
62
  "rstrip": false,
63
  "single_word": false
64
  },
65
- "eos_token": {
66
- "content": "<|endofturn|>",
67
- "lstrip": false,
68
- "normalized": false,
69
- "rstrip": false,
70
- "single_word": false
71
- },
72
  "pad_token": {
73
  "content": "<|endoftext|>",
74
  "lstrip": false,
 
62
  "rstrip": false,
63
  "single_word": false
64
  },
65
+ "eos_token": "<|endofturn|>",
 
 
 
 
 
 
66
  "pad_token": {
67
  "content": "<|endoftext|>",
68
  "lstrip": false,
tokenizer_config.json CHANGED
@@ -1,5 +1,4 @@
1
  {
2
- "add_bos_token": false,
3
  "add_prefix_space": false,
4
  "added_tokens_decoder": {
5
  "100256": {
@@ -491,17 +490,18 @@
491
  "<KEY>",
492
  "<PASSWORD>"
493
  ],
494
- "auto_map": {
495
- "AutoProcessor": "processing_hyperclovax.HCXProcessor"
496
- },
497
  "bos_token": "<|endoftext|>",
 
 
 
 
 
 
498
  "clean_up_tokenization_spaces": true,
499
  "eos_token": "<|endofturn|>",
500
- "errors": "replace",
501
  "extra_special_tokens": {},
502
  "model_max_length": 1000000000000000019884624838656,
503
  "pad_token": "<|endoftext|>",
504
- "processor_class": "HCXProcessor",
505
  "tokenizer_class": "GPT2Tokenizer",
506
  "unk_token": "<|endoftext|>"
507
  }
 
1
  {
 
2
  "add_prefix_space": false,
3
  "added_tokens_decoder": {
4
  "100256": {
 
490
  "<KEY>",
491
  "<PASSWORD>"
492
  ],
 
 
 
493
  "bos_token": "<|endoftext|>",
494
+ "chat_template": [
495
+ {
496
+ "name": "default",
497
+ "template": "<|im_start|>tool_list\n<|im_end|>\n{% for message in messages %}\n{% set content = message['content'] %}\n{% set role = message['role'] %}\n{% if loop.first and role != 'system' %}\n<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}\n{% if message['content'] is string %}\n<|im_start|>{{ role }}\n{{ message['content'] }}<|im_end|>\n{% else %}\n{% if content['type'] == 'image' %}\n<|im_start|>{{ role }} (mime)\n{\"type\": \"image/jpeg\", \"filename\": \"{{ content['filename'] }}\"}<|im_end|>\n<|im_start|>{{ role }} (vector)\n<|dummy3|><|im_end|>\n<|im_start|>image/aux\n다음 중 ocr은 사진에서 검출된 글자이고, lens_keyword는 사진에서 추출된 keyword와 bbox 위치입니다. bbox는 0~1 사이로 정규화된 [x1, y1, x2, y2]의 형태입니다. 참고하여 답변하세요. {\"ocr\": \"{{ content['ocr'] or '' }}\", \"lens_keywords\": \"{{ content['lens_keywords'] or '' }}\", \"lens_local_keywords\": \"{{ content['lens_local_keywords'] or '' }}\"}<|im_end|>\n{% elif content['type'] == 'video' %}\n<|im_start|>{{ role }} (mime)\n{\"type\": \"video/mp4\", \"filename\": \"{{ content['filename'] }}\"}<|im_end|>\n<|im_start|>{{ role }} (vector)\n<|dummy3|><|im_end|>\n<|im_start|>image/aux\n{% if content.get('is_final_grid') %}\n다음 중 lens_keyword는 사진에서 추출된 keyword와 bbox 위치입니다. bbox는 0~1 사이로 정규화된 [x1, y1, x2, y2]의 형태입니다. video_time_stamp는 비디오에서 해당 구간의 시간 정보입니다. speech_to_text는 비디오 속에서의 대화, 음성, 소리, 대사, 그리고 말을 전부 글로 받아 적은 것 입니다. 참고하여 답변하세요. {\"video_time_stamp\": \"{{ content['video_time_stamp'] }}\", \"lens_keywords\": \"{{ content.get('lens_keywords', '') }}\", \"lens_local_keywords\": \"{{ content.get('lens_local_keywords', '') }}\", \"speech_to_text\": \"{{ content.get('speech_to_text', '') }}\"}\n{% else %}\n다음 중 video_time_stamp는 비디오에서 해당 구간의 시간 정보입니다. 참고하여 답변하세요. {\"video_time_stamp\": \"{{ content['video_time_stamp'] }}\"}\n{% endif %}<|im_end|>\n{% elif content['type'] == 'text' %}\n<|im_start|>{{ role }}\n{{ content['text'] }}<|im_end|>\n{% endif %}\n{% endif %}\n{% endfor %}\n{% if add_generation_prompt %}\n<|im_start|>assistant\n{% endif %}\n"
498
+ }
499
+ ],
500
  "clean_up_tokenization_spaces": true,
501
  "eos_token": "<|endofturn|>",
 
502
  "extra_special_tokens": {},
503
  "model_max_length": 1000000000000000019884624838656,
504
  "pad_token": "<|endoftext|>",
 
505
  "tokenizer_class": "GPT2Tokenizer",
506
  "unk_token": "<|endoftext|>"
507
  }
vocab.json DELETED
The diff for this file is too large to render. See raw diff