nielsr HF Staff commited on
Commit
5869b18
·
verified ·
1 Parent(s): d51fe15

Improve model card: add paper info, update license metadata, clean up content

Browse files

This PR enhances the model card for InternVL2_5-78B-MPO by:

- Updating the `license` metadata tag from `other` to `mit` to reflect the project's primary license as declared in the README content. The specific Qwen license for the language model component remains clarified via `license_name` and `license_link`.
- Removing the `

Files changed (1) hide show
  1. README.md +1052 -32
README.md CHANGED
@@ -1,23 +1,31 @@
1
  ---
2
- license: other
3
- license_name: qwen
4
- license_link: https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/LICENSE
5
- pipeline_tag: image-text-to-text
6
- library_name: transformers
7
  base_model:
8
- - OpenGVLab/InternVL2_5-78B
9
- base_model_relation: finetune
10
  datasets:
11
- - OpenGVLab/MMPR-v1.1
12
  language:
13
- - multilingual
 
 
 
 
 
14
  tags:
15
- - internvl
16
- - custom_code
 
17
  ---
18
 
19
  # InternVL2_5-78B-MPO
20
 
 
 
 
 
 
 
 
 
21
  [\[📂 GitHub\]](https://github.com/OpenGVLab/InternVL) [\[📜 InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[📜 InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[📜 InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[📜 InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442)
22
 
23
  [\[🆕 Blog\]](https://internvl.github.io/blog/) [\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/) [\[🤗 HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[🚀 Quick Start\]](#quick-start) [\[📖 Documents\]](https://internvl.readthedocs.io/en/latest/)
@@ -238,6 +246,7 @@ device_map = split_model('InternVL2_5-78B')
238
  model = AutoModel.from_pretrained(
239
  path,
240
  torch_dtype=torch.bfloat16,
 
241
  low_cpu_mem_usage=True,
242
  use_flash_attn=True,
243
  trust_remote_code=True,
@@ -378,40 +387,50 @@ generation_config = dict(max_new_tokens=1024, do_sample=True)
378
  # pure-text conversation (纯文本对话)
379
  question = 'Hello, who are you?'
380
  response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
381
- print(f'User: {question}\nAssistant: {response}')
 
382
 
383
  question = 'Can you tell me a story?'
384
  response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
385
- print(f'User: {question}\nAssistant: {response}')
 
386
 
387
  # single-image single-round conversation (单图单轮对话)
388
- question = '<image>\nPlease describe the image shortly.'
 
389
  response = model.chat(tokenizer, pixel_values, question, generation_config)
390
- print(f'User: {question}\nAssistant: {response}')
 
391
 
392
  # single-image multi-round conversation (单图多轮对话)
393
- question = '<image>\nPlease describe the image in detail.'
 
394
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
395
- print(f'User: {question}\nAssistant: {response}')
 
396
 
397
  question = 'Please write a poem according to the image.'
398
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
399
- print(f'User: {question}\nAssistant: {response}')
 
400
 
401
  # multi-image multi-round conversation, combined images (多图多轮对话,拼接图像)
402
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
403
  pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
404
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
405
 
406
- question = '<image>\nDescribe the two images in detail.'
 
407
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
408
  history=None, return_history=True)
409
- print(f'User: {question}\nAssistant: {response}')
 
410
 
411
  question = 'What are the similarities and differences between these two images.'
412
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
413
  history=history, return_history=True)
414
- print(f'User: {question}\nAssistant: {response}')
 
415
 
416
  # multi-image multi-round conversation, separate images (多图多轮对话,独立图像)
417
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
@@ -419,17 +438,21 @@ pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat1
419
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
420
  num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
421
 
422
- question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
 
 
423
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
424
  num_patches_list=num_patches_list,
425
  history=None, return_history=True)
426
- print(f'User: {question}\nAssistant: {response}')
 
427
 
428
  question = 'What are the similarities and differences between these two images.'
429
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
430
  num_patches_list=num_patches_list,
431
  history=history, return_history=True)
432
- print(f'User: {question}\nAssistant: {response}')
 
433
 
434
  # batch inference, single image per sample (单图批处理)
435
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
@@ -437,13 +460,15 @@ pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat1
437
  num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
438
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
439
 
440
- questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list)
 
441
  responses = model.batch_chat(tokenizer, pixel_values,
442
  num_patches_list=num_patches_list,
443
  questions=questions,
444
  generation_config=generation_config)
445
  for question, response in zip(questions, responses):
446
- print(f'User: {question}\nAssistant: {response}')
 
447
 
448
  # video multi-round conversation (视频多轮对话)
449
  def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
@@ -481,17 +506,24 @@ def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=3
481
  video_path = './examples/red-panda.mp4'
482
  pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
483
  pixel_values = pixel_values.to(torch.bfloat16).cuda()
484
- video_prefix = ''.join([f'Frame{i+1}: <image>\n' for i in range(len(num_patches_list))])
 
485
  question = video_prefix + 'What is the red panda doing?'
486
- # Frame1: <image>\nFrame2: <image>\n...\nFrame8: <image>\n{question}
 
 
 
 
487
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
488
  num_patches_list=num_patches_list, history=None, return_history=True)
489
- print(f'User: {question}\nAssistant: {response}')
 
490
 
491
  question = 'Describe this video in detail.'
492
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
493
  num_patches_list=num_patches_list, history=history, return_history=True)
494
- print(f'User: {question}\nAssistant: {response}')
 
495
  ```
496
 
497
  #### Streaming Output
@@ -573,7 +605,9 @@ image_urls=[
573
 
574
  images = [load_image(img_url) for img_url in image_urls]
575
  # Numbering images improves multi-image conversations
576
- response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
 
 
577
  print(response.text)
578
  ```
579
 
@@ -682,9 +716,699 @@ If you find this project useful in your research, please consider citing:
682
  @article{chen2024far,
683
  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
684
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
685
- journal={arXiv preprint arXiv:2404.16821},
 
 
 
 
 
 
 
 
 
 
 
686
  year={2024}
687
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
688
  @inproceedings{chen2024internvl,
689
  title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
690
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
@@ -693,3 +1417,299 @@ If you find this project useful in your research, please consider citing:
693
  year={2024}
694
  }
695
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
 
 
 
 
2
  base_model:
3
+ - OpenGVLab/InternVL2_5-78B
 
4
  datasets:
5
+ - OpenGVLab/MMPR-v1.1
6
  language:
7
+ - multilingual
8
+ library_name: transformers
9
+ license: mit
10
+ license_name: qwen
11
+ license_link: https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/LICENSE
12
+ pipeline_tag: image-text-to-text
13
  tags:
14
+ - internvl
15
+ - custom_code
16
+ base_model_relation: finetune
17
  ---
18
 
19
  # InternVL2_5-78B-MPO
20
 
21
+ ## Paper
22
+ The model was presented in the paper [Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling](https://huggingface.co/papers/2412.05271).
23
+
24
+ ### Abstract
25
+ We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision encoders, language models, dataset sizes, and test-time configurations. Through extensive evaluations on a wide range of benchmarks, including multi-discipline reasoning, document understanding, multi-image / video understanding, real-world comprehension, multimodal hallucination detection, visual grounding, multilingual capabilities, and pure language processing, InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet. Notably, our model is the first open-source MLLMs to surpass 70% on the MMMU benchmark, achieving a 3.7-point improvement through Chain-of-Thought (CoT) reasoning and showcasing strong potential for test-time scaling. We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems. HuggingFace demo see this https URL
26
+
27
+ ---
28
+
29
  [\[📂 GitHub\]](https://github.com/OpenGVLab/InternVL) [\[📜 InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[📜 InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[📜 InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[📜 InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442)
30
 
31
  [\[🆕 Blog\]](https://internvl.github.io/blog/) [\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/) [\[🤗 HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[🚀 Quick Start\]](#quick-start) [\[📖 Documents\]](https://internvl.readthedocs.io/en/latest/)
 
246
  model = AutoModel.from_pretrained(
247
  path,
248
  torch_dtype=torch.bfloat16,
249
+ load_in_8bit=False,
250
  low_cpu_mem_usage=True,
251
  use_flash_attn=True,
252
  trust_remote_code=True,
 
387
  # pure-text conversation (纯文本对话)
388
  question = 'Hello, who are you?'
389
  response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
390
+ print(f'User: {question}
391
+ Assistant: {response}')
392
 
393
  question = 'Can you tell me a story?'
394
  response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
395
+ print(f'User: {question}
396
+ Assistant: {response}')
397
 
398
  # single-image single-round conversation (单图单轮对话)
399
+ question = '<image>
400
+ Please describe the image shortly.'
401
  response = model.chat(tokenizer, pixel_values, question, generation_config)
402
+ print(f'User: {question}
403
+ Assistant: {response}')
404
 
405
  # single-image multi-round conversation (单图多轮对话)
406
+ question = '<image>
407
+ Please describe the image in detail.'
408
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
409
+ print(f'User: {question}
410
+ Assistant: {response}')
411
 
412
  question = 'Please write a poem according to the image.'
413
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
414
+ print(f'User: {question}
415
+ Assistant: {response}')
416
 
417
  # multi-image multi-round conversation, combined images (多图多轮对话,拼接图像)
418
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
419
  pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
420
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
421
 
422
+ question = '<image>
423
+ Describe the two images in detail.'
424
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
425
  history=None, return_history=True)
426
+ print(f'User: {question}
427
+ Assistant: {response}')
428
 
429
  question = 'What are the similarities and differences between these two images.'
430
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
431
  history=history, return_history=True)
432
+ print(f'User: {question}
433
+ Assistant: {response}')
434
 
435
  # multi-image multi-round conversation, separate images (多图多轮对话,独立图像)
436
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
 
438
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
439
  num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
440
 
441
+ question = 'Image-1: <image>
442
+ Image-2: <image>
443
+ Describe the two images in detail.'
444
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
445
  num_patches_list=num_patches_list,
446
  history=None, return_history=True)
447
+ print(f'User: {question}
448
+ Assistant: {response}')
449
 
450
  question = 'What are the similarities and differences between these two images.'
451
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
452
  num_patches_list=num_patches_list,
453
  history=history, return_history=True)
454
+ print(f'User: {question}
455
+ Assistant: {response}')
456
 
457
  # batch inference, single image per sample (单图批处理)
458
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
 
460
  num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
461
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
462
 
463
+ questions = ['<image>
464
+ Describe the image in detail.'] * len(num_patches_list)
465
  responses = model.batch_chat(tokenizer, pixel_values,
466
  num_patches_list=num_patches_list,
467
  questions=questions,
468
  generation_config=generation_config)
469
  for question, response in zip(questions, responses):
470
+ print(f'User: {question}
471
+ Assistant: {response}')
472
 
473
  # video multi-round conversation (视频多轮对话)
474
  def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
 
506
  video_path = './examples/red-panda.mp4'
507
  pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
508
  pixel_values = pixel_values.to(torch.bfloat16).cuda()
509
+ video_prefix = ''.join([f'Frame{i+1}: <image>
510
+ ' for i in range(len(num_patches_list))])
511
  question = video_prefix + 'What is the red panda doing?'
512
+ # Frame1: <image>
513
+ Frame2: <image>
514
+ ...
515
+ Frame8: <image>
516
+ {question}
517
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
518
  num_patches_list=num_patches_list, history=None, return_history=True)
519
+ print(f'User: {question}
520
+ Assistant: {response}')
521
 
522
  question = 'Describe this video in detail.'
523
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
524
  num_patches_list=num_patches_list, history=history, return_history=True)
525
+ print(f'User: {question}
526
+ Assistant: {response}')
527
  ```
528
 
529
  #### Streaming Output
 
605
 
606
  images = [load_image(img_url) for img_url in image_urls]
607
  # Numbering images improves multi-image conversations
608
+ response = pipe((f'Image-1: {IMAGE_TOKEN}
609
+ Image-2: {IMAGE_TOKEN}
610
+ describe these two images', images))
611
  print(response.text)
612
  ```
613
 
 
716
  @article{chen2024far,
717
  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
718
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
719
+ journal={Science China Information Sciences},
720
+ volume={67},
721
+ number={12},
722
+ pages={220101},
723
+ year={2024},
724
+ publisher={Springer}
725
+ }
726
+ @inproceedings{chen2024internvl,
727
+ title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
728
+ author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
729
+ booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
730
+ pages={24185--24198},
731
  year={2024}
732
  }
733
+ ```
734
+
735
+ ## What can InternVL do?
736
+
737
+ <details>
738
+ <summary>Visual Perception (click to expand)</summary>
739
+
740
+ - Linear-Probe Image Classification [[see details]](./classification#-evaluation)
741
+
742
+ ViT-22B uses the private JFT-3B dataset.
743
+
744
+ | method | #param | IN-1K | IN-ReaL | IN-V2 | IN-A | IN-R | IN-Sketch |
745
+ | ------------------- | :----: | :---: | :-----: | :---: | :---: | :---: | :-------: |
746
+ | OpenCLIP-G | 1.8B | 86.2 | 89.4 | 77.2 | 63.8 | 87.8 | 66.4 |
747
+ | DINOv2-g | 1.1B | 86.5 | 89.6 | 78.4 | 75.9 | 78.8 | 62.5 |
748
+ | EVA-01-CLIP-g | 1.1B | 86.5 | 89.3 | 77.4 | 70.5 | 87.7 | 63.1 |
749
+ | MAWS-ViT-6.5B | 6.5B | 87.8 | - | - | - | - | - |
750
+ | ViT-22B\* | 21.7B | 89.5 | 90.9 | 83.2 | 83.8 | 87.4 | - |
751
+ | InternViT-6B (ours) | 5.9B | 88.2 | 90.4 | 79.9 | 77.5 | 89.8 | 69.1 |
752
+
753
+ - Semantic Segmentation [[see details]](./segmentation#-evaluation)
754
+
755
+ | method | decoder | #param (train/total) | crop size | mIoU |
756
+ | --------------------- | :-----: | :------------------: | :-------: | ------------ |
757
+ | OpenCLIP-G (frozen) | Linear | 0.3M / 1.8B | 512 | 39.3 |
758
+ | ViT-22B (frozen) | Linear | 0.9M / 21.7B | 504 | 34.6 |
759
+ | InternViT-6B (frozen) | Linear | 0.5M / 5.9B | 504 | 47.2 (+12.6) |
760
+ | ViT-22B (frozen) | UperNet | 0.8B / 22.5B | 504 | 52.7 |
761
+ | InternViT-6B (frozen) | UperNet | 0.4B / 6.3B | 504 | 54.9 (+2.2) |
762
+ | ViT-22B | UperNet | 22.5B / 22.5B | 504 | 55.3 |
763
+ | InternViT-6B | UperNet | 6.3B / 6.3B | 504 | 58.9 (+3.6) |
764
+
765
+ - Zero-Shot Image Classification [[see details]](./clip_benchmark#imagenet-variants-and-objectnet)
766
+
767
+ | method | IN-1K | IN-A | IN-R | IN-V2 | IN-Sketch | ObjectNet |
768
+ | ----------------- | :---: | :---: | :---: | :---: | :-------: | :-------: |
769
+ | OpenCLIP-G | 80.1 | 69.3 | 92.1 | 73.6 | 68.9 | 73.0 |
770
+ | EVA-02-CLIP-E+ | 82.0 | 82.1 | 94.5 | 75.7 | 71.6 | 79.6 |
771
+ | ViT-22B\* | 85.9 | 90.1 | 96.0 | 80.9 | - | 87.6 |
772
+ | InternVL-C (ours) | 83.2 | 83.8 | 95.5 | 77.3 | 73.9 | 80.6 |
773
+
774
+ - Multilingual Zero-Shot Image Classification [[see details]](./clip_benchmark#multilingual-imagenet-1k)
775
+
776
+ EN: English, ZH: Chinese, JP: Japanese, Ar: Arabic, IT: Italian
777
+
778
+ | method | IN-1K (EN) | IN-1K (ZH) | IN-1K (JP) | IN-1K (AR) | IN-1K (IT) |
779
+ | ----------------- | :--------: | :--------: | :--------: | :--------: | :--------: |
780
+ | Taiyi-CLIP-ViT-H | - | 54.4 | - | - | - |
781
+ | WuKong-ViT-L-G | - | 57.5 | - | - | - |
782
+ | CN-CLIP-ViT-H | - | 59.6 | - | - | - |
783
+ | AltCLIP-ViT-L | 74.5 | 59.6 | - | - | - |
784
+ | EVA-02-CLIP-E+ | 82.0 | - | - | - | 41.2 |
785
+ | OpenCLIP-XLM-R-H | 77.0 | 55.7 | 53.1 | 37.0 | 56.8 |
786
+ | InternVL-C (ours) | 83.2 | 64.5 | 61.5 | 44.9 | 65.7 |
787
+
788
+ - Zero-Shot Video Classification
789
+
790
+ | method | #frame | K400 | K600 | K700 |
791
+ | ----------------- | :----: | :---: | :---: | :---: |
792
+ | OpenCLIP-G | 1 | 65.9 | 66.1 | 59.2 |
793
+ | EVA-02-CLIP-E+ | 1 | 69.8 | 69.3 | 63.4 |
794
+ | InternVL-C (ours) | 1 | 71.0 | 71.3 | 65.7 |
795
+ | ViCLIP | 8 | 75.7 | 73.5 | 66.4 |
796
+ | InternVL-C (ours) | 8 | 79.4 | 78.8 | 71.5 |
797
+
798
+ </details>
799
+
800
+ <details>
801
+ <summary>Cross-Modal Retrieval (click to expand)</summary>
802
+
803
+ - English Zero-Shot Image-Text Retrieval [[see details]](./clip_benchmark#flickr30k--coco)
804
+
805
+ <table>
806
+ <tr align=center>
807
+ <td rowspan="3" align=left><b>model</b></td>
808
+ <td colspan="6" align=center><b>Flickr30K</b></td>
809
+ <td colspan="6" align=center><b>COCO</b></td>
810
+ <td rowspan="3" align=center><b>avg</b></td>
811
+ </tr>
812
+ <tr align=center>
813
+ <td colspan="3" align=center><b>image-to-text</b></td>
814
+ <td colspan="3" align=center><b>text-to-image</b></td>
815
+ <td colspan="3" align=center><b>image-to-text</b></td>
816
+ <td colspan="3" align=center><b>text-to-image</b></td>
817
+ </tr>
818
+ <tr>
819
+ <td>R@1</td>
820
+ <td>R@5</td>
821
+ <td>R@10</td>
822
+ <td>R@1</td>
823
+ <td>R@5</td>
824
+ <td>R@10</td>
825
+ <td>R@1</td>
826
+ <td>R@5</td>
827
+ <td>R@10</td>
828
+ <td>R@1</td>
829
+ <td>R@5</td>
830
+ <td>R@10</td>
831
+ </tr>
832
+ <tr align=center>
833
+ <td align=left>OpenCLIP-G</td>
834
+ <td>92.9</td>
835
+ <td>99.3</td>
836
+ <td>99.8</td>
837
+ <td>79.5</td>
838
+ <td>95.0</td>
839
+ <td>97.1</td>
840
+ <td>67.3</td>
841
+ <td>86.9</td>
842
+ <td>92.6</td>
843
+ <td>51.4</td>
844
+ <td>74.9</td>
845
+ <td>83.0</td>
846
+ <td>85.0</td>
847
+ </tr>
848
+ <tr align=center>
849
+ <td align=left>EVA-02-CLIP-E+</td>
850
+ <td>93.9</td>
851
+ <td>99.4</td>
852
+ <td>99.8</td>
853
+ <td>78.8</td>
854
+ <td>94.2</td>
855
+ <td>96.8</td>
856
+ <td>68.8</td>
857
+ <td>87.8</td>
858
+ <td>92.8</td>
859
+ <td>51.1</td>
860
+ <td>75.0</td>
861
+ <td>82.7</td>
862
+ <td>85.1</td>
863
+ </tr>
864
+ <tr align=center>
865
+ <td align=left>EVA-CLIP-8B</td>
866
+ <td>95.6</td>
867
+ <td>99.6</td>
868
+ <td>99.9</td>
869
+ <td>80.8</td>
870
+ <td>95.5</td>
871
+ <td>97.6</td>
872
+ <td>70.3</td>
873
+ <td>89.3</td>
874
+ <td>93.9</td>
875
+ <td>53.0</td>
876
+ <td>76.0</td>
877
+ <td>83.4</td>
878
+ <td>86.2</td>
879
+ </tr>
880
+ <tr align=center>
881
+ <td align=left>InternVL-C (ours)</td>
882
+ <td>94.7</td>
883
+ <td>99.6</td>
884
+ <td>99.9</td>
885
+ <td>81.7</td>
886
+ <td>96.0</td>
887
+ <td>98.2</td>
888
+ <td>70.6</td>
889
+ <td>89.0</td>
890
+ <td>93.5</td>
891
+ <td>54.1</td>
892
+ <td>77.3</td>
893
+ <td>84.6</td>
894
+ <td>86.6</td>
895
+ </tr>
896
+ <tr align=center>
897
+ <td align=left>InternVL-G (ours)</td>
898
+ <td>95.7</td>
899
+ <td>99.7</td>
900
+ <td>99.9</td>
901
+ <td>85.0</td>
902
+ <td>97.0</td>
903
+ <td>98.6</td>
904
+ <td>74.9</td>
905
+ <td>91.3</td>
906
+ <td>95.2</td>
907
+ <td>58.6</td>
908
+ <td>81.3</td>
909
+ <td>88.0</td>
910
+ <td>88.8</td>
911
+ </tr>
912
+
913
+ </table>
914
+
915
+ - Chinese Zero-Shot Image-Text Retrieval [[see details]](./clip_benchmark#flickr30k-cn--coco-cn)
916
+
917
+ <table>
918
+ <tr align=center>
919
+ <td rowspan="3" align=left><b>model</b></td>
920
+ <td colspan="6" align=center><b>Flickr30K-CN</b></td>
921
+ <td colspan="6" align=center><b>COCO-CN</b></td>
922
+ <td rowspan="3" align=center><b>avg</b></td>
923
+
924
+ </tr>
925
+ <tr align=center>
926
+ <td colspan="3" align=center><b>image-to-text</b></td>
927
+ <td colspan="3" align=center><b>text-to-image</b></td>
928
+ <td colspan="3" align=center><b>image-to-text</b></td>
929
+ <td colspan="3" align=center><b>text-to-image</b></td>
930
+ </tr>
931
+ <tr>
932
+ <td>R@1</td>
933
+ <td>R@5</td>
934
+ <td>R@10</td>
935
+ <td>R@1</td>
936
+ <td>R@5</td>
937
+ <td>R@10</td>
938
+ <td>R@1</td>
939
+ <td>R@5</td>
940
+ <td>R@10</td>
941
+ <td>R@1</td>
942
+ <td>R@5</td>
943
+ <td>R@10</td>
944
+ </tr>
945
+
946
+ <tr align=center>
947
+ <td align=left>CN-CLIP-ViT-H</td>
948
+ <td>81.6</td>
949
+ <td>97.5</td>
950
+ <td>98.8</td>
951
+ <td>71.2</td>
952
+ <td>91.4</td>
953
+ <td>95.5</td>
954
+ <td>63.0</td>
955
+ <td>86.6</td>
956
+ <td>92.9</td>
957
+ <td>69.2</td>
958
+ <td>89.9</td>
959
+ <td>96.1</td>
960
+ <td>86.1</td>
961
+ </tr>
962
+
963
+ <tr align=center>
964
+ <td align=left>OpenCLIP-XLM-R-H</td>
965
+ <td>86.1</td>
966
+ <td>97.5</td>
967
+ <td>99.2</td>
968
+ <td>71.0</td>
969
+ <td>90.5</td>
970
+ <td>94.9</td>
971
+ <td>70.0</td>
972
+ <td>91.5</td>
973
+ <td>97.0</td>
974
+ <td>66.1</td>
975
+ <td>90.8</td>
976
+ <td>96.0</td>
977
+ <td>87.6</td>
978
+ </tr>
979
+
980
+ <tr align=center>
981
+ <td align=left>InternVL-C (ours)</td>
982
+ <td>90.3</td>
983
+ <td>98.8</td>
984
+ <td>99.7</td>
985
+ <td>75.1</td>
986
+ <td>92.9</td>
987
+ <td>96.4</td>
988
+ <td>68.8</td>
989
+ <td>92.0</td>
990
+ <td>96.7</td>
991
+ <td>68.9</td>
992
+ <td>91.9</td>
993
+ <td>96.5</td>
994
+ <td>89.0</td>
995
+ </tr>
996
+ <tr align=center>
997
+ <td align=left>InternVL-G (ours)</td>
998
+ <td>92.9</td>
999
+ <td>99.4</td>
1000
+ <td>99.8</td>
1001
+ <td>77.7</td>
1002
+ <td>94.8</td>
1003
+ <td>97.3</td>
1004
+ <td>71.4</td>
1005
+ <td>93.9</td>
1006
+ <td>97.7</td>
1007
+ <td>73.8</td>
1008
+ <td>94.4</td>
1009
+ <td>98.1</td>
1010
+ <td>90.9</td>
1011
+ </tr>
1012
+
1013
+ </table>
1014
+
1015
+ - Multilingual Zero-Shot Image-Text Retrieval on XTD [[see details]](./clip_benchmark#xtd)
1016
+
1017
+ | method | EN | ES | FR | ZH | IT | KO | RU | JP | average |
1018
+ | ----------------- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :-----: |
1019
+ | AltCLIP | 95.4 | 94.1 | 92.9 | 95.1 | 94.2 | 94.4 | 91.8 | 91.7 | 93.7 |
1020
+ | OpenCLIP-XLM-R-H | 97.3 | 96.1 | 94.5 | 94.7 | 96.0 | 90.2 | 93.9 | 94.0 | 94.6 |
1021
+ | InternVL-C (ours) | 97.3 | 95.7 | 95.1 | 95.6 | 96.0 | 92.2 | 93.3 | 95.5 | 95.1 |
1022
+ | InternVL-G (ours) | 98.6 | 97.7 | 96.5 | 96.7 | 96.9 | 95.1 | 94.8 | 96.1 | 96.6 |
1023
+
1024
+ </details>
1025
+
1026
+ <details>
1027
+ <summary>Multimodal Dialogue</summary>
1028
+
1029
+ </details>
1030
+
1031
+ ## Quick Start with HuggingFace
1032
+
1033
+ <details>
1034
+ <summary>using InternViT-6B for visual feature extraction (click to expand)</summary>
1035
+
1036
+ ```python
1037
+ import torch
1038
+ from PIL import Image
1039
+ from transformers import AutoModel, CLIPImageProcessor
1040
+
1041
+ model = AutoModel.from_pretrained(
1042
+ 'OpenGVLab/InternViT-6B-448px-V2_5',
1043
+ torch_dtype=torch.bfloat16,
1044
+ low_cpu_mem_usage=True,
1045
+ trust_remote_code=True).cuda().eval()
1046
+
1047
+ image = Image.open('./examples/image1.jpg').convert('RGB')
1048
+
1049
+ image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternViT-6B-448px-V1-5')
1050
+
1051
+ pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
1052
+ pixel_values = pixel_values.to(torch.bfloat16).cuda()
1053
+
1054
+ outputs = model(pixel_values)
1055
+ ```
1056
+
1057
+ </details>
1058
+
1059
+ <details>
1060
+ <summary>using InternVL-C(ontrastive) and InternVL-G(enerative) for cross-modal retrieval (click to expand)</summary>
1061
+
1062
+ ```python
1063
+ import torch
1064
+ from PIL import Image
1065
+ from transformers import AutoModel, CLIPImageProcessor
1066
+ from transformers import AutoTokenizer
1067
+
1068
+
1069
+ model = AutoModel.from_pretrained(
1070
+ 'OpenGVLab/InternVL-14B-224px',
1071
+ torch_dtype=torch.bfloat16,
1072
+ low_cpu_mem_usage=True,
1073
+ trust_remote_code=True).cuda().eval()
1074
+
1075
+ image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternVL-14B-224px')
1076
+
1077
+ tokenizer = AutoTokenizer.from_pretrained(
1078
+ 'OpenGVLab/InternVL-14B-224px', use_fast=False, add_eos_token=True)
1079
+ tokenizer.pad_token_id = 0 # set pad_token_id to 0
1080
+
1081
+ images = [
1082
+ Image.open('./examples/image1.jpg').convert('RGB'),
1083
+ Image.open('./examples/image2.jpg').convert('RGB'),
1084
+ Image.open('./examples/image3.jpg').convert('RGB')
1085
+ ]
1086
+ prefix = 'summarize:'
1087
+ texts = [
1088
+ prefix + 'a photo of a red panda', # English
1089
+ prefix + '一张熊猫的照片', # Chinese
1090
+ prefix + '二匹の猫の写真' # Japanese
1091
+ ]
1092
+
1093
+ pixel_values = image_processor(images=images, return_tensors='pt').pixel_values
1094
+ pixel_values = pixel_values.to(torch.bfloat16).cuda()
1095
+ input_ids = tokenizer(texts, return_tensors='pt', max_length=80,
1096
+ truncation=True, padding='max_length').input_ids.cuda()
1097
+
1098
+ # InternVL-C
1099
+ logits_per_image, logits_per_text = model(
1100
+ image=pixel_values, text=input_ids, mode='InternVL-C')
1101
+ probs = logits_per_image.softmax(dim=-1)
1102
+ # tensor([[9.9609e-01, 5.2185e-03, 6.0070e-08],
1103
+ # [2.2949e-02, 9.7656e-01, 5.9903e-06],
1104
+ # [3.2932e-06, 7.4863e-05, 1.0000e+00]], device='cuda:0',
1105
+ # dtype=torch.bfloat16, grad_fn=<SoftmaxBackward0>)
1106
+
1107
+ # InternVL-G
1108
+ logits_per_image, logits_per_text = model(
1109
+ image=pixel_values, text=input_ids, mode='InternVL-G')
1110
+ probs = logits_per_image.softmax(dim=-1)
1111
+ # tensor([[9.9609e-01, 3.1738e-03, 3.6322e-08],
1112
+ # [8.6060e-03, 9.9219e-01, 2.8759e-06],
1113
+ # [1.7583e-06, 3.1233e-05, 1.0000e+00]], device='cuda:0',
1114
+ # dtype=torch.bfloat16, grad_fn=<SoftmaxBackward0>)
1115
+
1116
+ # please set add_eos_token to False for generation
1117
+ tokenizer.add_eos_token = False
1118
+ image = Image.open('./examples/image1.jpg').convert('RGB')
1119
+ pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
1120
+ pixel_values = pixel_values.to(torch.bfloat16).cuda()
1121
+
1122
+ tokenized = tokenizer("English caption:", return_tensors='pt')
1123
+ pred = model.generate(
1124
+ pixel_values=pixel_values,
1125
+ input_ids=tokenized.input_ids.cuda(),
1126
+ attention_mask=tokenized.attention_mask.cuda(),
1127
+ num_beams=5,
1128
+ min_new_tokens=8,
1129
+ )
1130
+ caption = tokenizer.decode(pred[0].cpu(), skip_special_tokens=True).strip()
1131
+ # English caption: a red panda sitting on top of a wooden platform
1132
+ ```
1133
+
1134
+ </details>
1135
+
1136
+ <details>
1137
+ <summary>using InternVL 2.5 for multimodal chat (click to expand)</summary>
1138
+
1139
+ Here, we take the smaller `OpenGVLab/InternVL2_5-8B` as an example:
1140
+
1141
+ ```python
1142
+ import numpy as np
1143
+ import torch
1144
+ import torchvision.transforms as T
1145
+ from decord import VideoReader, cpu
1146
+ from PIL import Image
1147
+ from torchvision.transforms.functional import InterpolationMode
1148
+ from transformers import AutoModel, AutoTokenizer
1149
+
1150
+ IMAGENET_MEAN = (0.485, 0.456, 0.406)
1151
+ IMAGENET_STD = (0.229, 0.224, 0.225)
1152
+
1153
+ def build_transform(input_size):
1154
+ MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
1155
+ transform = T.Compose([
1156
+ T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
1157
+ T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
1158
+ T.ToTensor(),
1159
+ T.Normalize(mean=MEAN, std=STD)
1160
+ ])
1161
+ return transform
1162
+
1163
+ def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
1164
+ best_ratio_diff = float('inf')
1165
+ best_ratio = (1, 1)
1166
+ area = width * height
1167
+ for ratio in target_ratios:
1168
+ target_aspect_ratio = ratio[0] / ratio[1]
1169
+ ratio_diff = abs(aspect_ratio - target_aspect_ratio)
1170
+ if ratio_diff < best_ratio_diff:
1171
+ best_ratio_diff = ratio_diff
1172
+ best_ratio = ratio
1173
+ elif ratio_diff == best_ratio_diff:
1174
+ if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
1175
+ best_ratio = ratio
1176
+ return best_ratio
1177
+
1178
+ def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
1179
+ orig_width, orig_height = image.size
1180
+ aspect_ratio = orig_width / orig_height
1181
+
1182
+ # calculate the existing image aspect ratio
1183
+ target_ratios = set(
1184
+ (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
1185
+ i * j <= max_num and i * j >= min_num)
1186
+ target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
1187
+
1188
+ # find the closest aspect ratio to the target
1189
+ target_aspect_ratio = find_closest_aspect_ratio(
1190
+ aspect_ratio, target_ratios, orig_width, orig_height, image_size)
1191
+
1192
+ # calculate the target width and height
1193
+ target_width = image_size * target_aspect_ratio[0]
1194
+ target_height = image_size * target_aspect_ratio[1]
1195
+ blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
1196
+
1197
+ # resize the image
1198
+ resized_img = image.resize((target_width, target_height))
1199
+ processed_images = []
1200
+ for i in range(blocks):
1201
+ box = (
1202
+ (i % (target_width // image_size)) * image_size,
1203
+ (i // (target_width // image_size)) * image_size,
1204
+ ((i % (target_width // image_size)) + 1) * image_size,
1205
+ ((i // (target_width // image_size)) + 1) * image_size
1206
+ )
1207
+ # split the image
1208
+ split_img = resized_img.crop(box)
1209
+ processed_images.append(split_img)
1210
+ assert len(processed_images) == blocks
1211
+ if use_thumbnail and len(processed_images) != 1:
1212
+ thumbnail_img = image.resize((image_size, image_size))
1213
+ processed_images.append(thumbnail_img)
1214
+ return processed_images
1215
+
1216
+ def load_image(image_file, input_size=448, max_num=12):
1217
+ image = Image.open(image_file).convert('RGB')
1218
+ transform = build_transform(input_size=input_size)
1219
+ images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
1220
+ pixel_values = [transform(image) for image in images]
1221
+ pixel_values = torch.stack(pixel_values)
1222
+ return pixel_values
1223
+
1224
+ # If you have an 80G A100 GPU, you can put the entire model on a single GPU.
1225
+ # Otherwise, you need to load a model using multiple GPUs, please refer to the `Multiple GPUs` section.
1226
+ path = 'OpenGVLab/InternVL2_5-8B'
1227
+ model = AutoModel.from_pretrained(
1228
+ path,
1229
+ torch_dtype=torch.bfloat16,
1230
+ low_cpu_mem_usage=True,
1231
+ trust_remote_code=True).eval().cuda()
1232
+ tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
1233
+
1234
+ # set the max number of tiles in `max_num`
1235
+ pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
1236
+ generation_config = dict(max_new_tokens=1024, do_sample=False)
1237
+
1238
+ # pure-text conversation (纯文本对话)
1239
+ question = 'Hello, who are you?'
1240
+ response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
1241
+ print(f'User: {question}
1242
+ Assistant: {response}')
1243
+
1244
+ question = 'Can you tell me a story?'
1245
+ response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
1246
+ print(f'User: {question}
1247
+ Assistant: {response}')
1248
+
1249
+ # single-image single-round conversation (单图单轮对话)
1250
+ question = '<image>
1251
+ Please describe the image shortly.'
1252
+ response = model.chat(tokenizer, pixel_values, question, generation_config)
1253
+ print(f'User: {question}
1254
+ Assistant: {response}')
1255
+
1256
+ # single-image multi-round conversation (单图多轮对话)
1257
+ question = '<image>
1258
+ Please describe the image in detail.'
1259
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
1260
+ print(f'User: {question}
1261
+ Assistant: {response}')
1262
+
1263
+ question = 'Please write a poem according to the image.'
1264
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
1265
+ print(f'User: {question}
1266
+ Assistant: {response}')
1267
+
1268
+ # multi-image multi-round conversation, combined images (多图多轮对话,拼接图像)
1269
+ pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
1270
+ pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
1271
+ pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
1272
+
1273
+ question = '<image>
1274
+ Describe the two images in detail.'
1275
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config,
1276
+ history=None, return_history=True)
1277
+ print(f'User: {question}
1278
+ Assistant: {response}')
1279
+
1280
+ question = 'What are the similarities and differences between these two images.'
1281
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config,
1282
+ history=history, return_history=True)
1283
+ print(f'User: {question}
1284
+ Assistant: {response}')
1285
+
1286
+ # multi-image multi-round conversation, separate images (多图多轮对话,独立图像)
1287
+ pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
1288
+ pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
1289
+ pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
1290
+ num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
1291
+
1292
+ question = 'Image-1: <image>
1293
+ Image-2: <image>
1294
+ Describe the two images in detail.'
1295
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config,
1296
+ num_patches_list=num_patches_list,
1297
+ history=None, return_history=True)
1298
+ print(f'User: {question}
1299
+ Assistant: {response}')
1300
+
1301
+ question = 'What are the similarities and differences between these two images.'
1302
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config,
1303
+ num_patches_list=num_patches_list, history=history, return_history=True)
1304
+ print(f'User: {question}
1305
+ Assistant: {response}')
1306
+
1307
+ # batch inference, single image per sample (单图批处理)
1308
+ pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
1309
+ pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
1310
+ num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
1311
+ pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
1312
+
1313
+ questions = ['<image>
1314
+ Describe the image in detail.'] * len(num_patches_list)
1315
+ responses = model.batch_chat(tokenizer, pixel_values,
1316
+ num_patches_list=num_patches_list,
1317
+ questions=questions,
1318
+ generation_config=generation_config)
1319
+ for question, response in zip(questions, responses):
1320
+ print(f'User: {question}
1321
+ Assistant: {response}')
1322
+
1323
+ # video multi-round conversation (视频多轮对话)
1324
+ def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
1325
+ if bound:
1326
+ start, end = bound[0], bound[1]
1327
+ else:
1328
+ start, end = -100000, 100000
1329
+ start_idx = max(first_idx, round(start * fps))
1330
+ end_idx = min(round(end * fps), max_frame)
1331
+ seg_size = float(end_idx - start_idx) / num_segments
1332
+ frame_indices = np.array([
1333
+ int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
1334
+ for idx in range(num_segments)
1335
+ ])
1336
+ return frame_indices
1337
+
1338
+ def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=32):
1339
+ vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
1340
+ max_frame = len(vr) - 1
1341
+ fps = float(vr.get_avg_fps())
1342
+
1343
+ pixel_values_list, num_patches_list = [], []
1344
+ transform = build_transform(input_size=input_size)
1345
+ frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)
1346
+ for frame_index in frame_indices:
1347
+ img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
1348
+ img = dynamic_preprocess(img, image_size=input_size, use_thumbnail=True, max_num=max_num)
1349
+ pixel_values = [transform(tile) for tile in img]
1350
+ pixel_values = torch.stack(pixel_values)
1351
+ num_patches_list.append(pixel_values.shape[0])
1352
+ pixel_values_list.append(pixel_values)
1353
+ pixel_values = torch.cat(pixel_values_list)
1354
+ return pixel_values, num_patches_list
1355
+
1356
+ video_path = './examples/red-panda.mp4'
1357
+ pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
1358
+ pixel_values = pixel_values.to(torch.bfloat16).cuda()
1359
+ video_prefix = ''.join([f'Frame-{i+1}: <image>
1360
+ ' for i in range(len(num_patches_list))])
1361
+ question = video_prefix + 'What is the red panda doing?'
1362
+ # Frame1: <image>
1363
+ Frame2: <image>
1364
+ ...
1365
+ Frame8: <image>
1366
+ {question}
1367
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config,
1368
+ num_patches_list=num_patches_list, history=None, return_history=True)
1369
+ print(f'User: {question}
1370
+ Assistant: {response}')
1371
+
1372
+ question = 'Describe this video in detail.'
1373
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config,
1374
+ num_patches_list=num_patches_list, history=history, return_history=True)
1375
+ print(f'User: {question}
1376
+ Assistant: {response}')
1377
+ ```
1378
+
1379
+ </details>
1380
+
1381
+ ## License
1382
+
1383
+ This project is released under the MIT License. This project uses the pre-trained Qwen2.5-72B-Instruct as a component, which is licensed under the Qwen License.
1384
+
1385
+ ## Citation
1386
+
1387
+ If you find this project useful in your research, please consider citing:
1388
+
1389
+ ```BibTeX
1390
+ @article{wang2024mpo,
1391
+ title={Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization},
1392
+ author={Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Zhu, Jinguo and Zhu, Xizhou and Lu, Lewei and Qiao, Yu and Dai, Jifeng},
1393
+ journal={arXiv preprint arXiv:2411.10442},
1394
+ year={2024}
1395
+ }
1396
+ @article{chen2024expanding,
1397
+ title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling},
1398
+ author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and others},
1399
+ journal={arXiv preprint arXiv:2412.05271},
1400
+ year={2024}
1401
+ }
1402
+ @article{chen2024far,
1403
+ title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
1404
+ author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
1405
+ journal={Science China Information Sciences},
1406
+ volume={67},
1407
+ number={12},
1408
+ pages={220101},
1409
+ year={2024},
1410
+ publisher={Springer}
1411
+ }
1412
  @inproceedings{chen2024internvl,
1413
  title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
1414
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
 
1417
  year={2024}
1418
  }
1419
  ```
1420
+
1421
+ ## What can InternVL do?
1422
+
1423
+ <details>
1424
+ <summary>Visual Perception (click to expand)</summary>
1425
+
1426
+ - Linear-Probe Image Classification [[see details]](./classification#-evaluation)
1427
+
1428
+ ViT-22B uses the private JFT-3B dataset.
1429
+
1430
+ | method | #param | IN-1K | IN-ReaL | IN-V2 | IN-A | IN-R | IN-Sketch |
1431
+ | ------------------- | :----: | :---: | :-----: | :---: | :---: | :---: | :-------: |
1432
+ | OpenCLIP-G | 1.8B | 86.2 | 89.4 | 77.2 | 63.8 | 87.8 | 66.4 |
1433
+ | DINOv2-g | 1.1B | 86.5 | 89.6 | 78.4 | 75.9 | 78.8 | 62.5 |
1434
+ | EVA-01-CLIP-g | 1.1B | 86.5 | 89.3 | 77.4 | 70.5 | 87.7 | 63.1 |
1435
+ | MAWS-ViT-6.5B | 6.5B | 87.8 | - | - | - | - | - |
1436
+ | ViT-22B\* | 21.7B | 89.5 | 90.9 | 83.2 | 83.8 | 87.4 | - |
1437
+ | InternViT-6B (ours) | 5.9B | 88.2 | 90.4 | 79.9 | 77.5 | 89.8 | 69.1 |
1438
+
1439
+ - Semantic Segmentation [[see details]](./segmentation#-evaluation)
1440
+
1441
+ | method | decoder | #param (train/total) | crop size | mIoU |
1442
+ | --------------------- | :-----: | :------------------: | :-------: | ------------ |
1443
+ | OpenCLIP-G (frozen) | Linear | 0.3M / 1.8B | 512 | 39.3 |
1444
+ | ViT-22B (frozen) | Linear | 0.9M / 21.7B | 504 | 34.6 |
1445
+ | InternViT-6B (frozen) | Linear | 0.5M / 5.9B | 504 | 47.2 (+12.6) |
1446
+ | ViT-22B (frozen) | UperNet | 0.8B / 22.5B | 504 | 52.7 |
1447
+ | InternViT-6B (frozen) | UperNet | 0.4B / 6.3B | 504 | 54.9 (+2.2) |
1448
+ | ViT-22B | UperNet | 22.5B / 22.5B | 504 | 55.3 |
1449
+ | InternViT-6B | UperNet | 6.3B / 6.3B | 504 | 58.9 (+3.6) |
1450
+
1451
+ - Zero-Shot Image Classification [[see details]](./clip_benchmark#imagenet-variants-and-objectnet)
1452
+
1453
+ | method | IN-1K | IN-A | IN-R | IN-V2 | IN-Sketch | ObjectNet |
1454
+ | ----------------- | :---: | :---: | :---: | :---: | :-------: | :-------: |
1455
+ | OpenCLIP-G | 80.1 | 69.3 | 92.1 | 73.6 | 68.9 | 73.0 |
1456
+ | EVA-02-CLIP-E+ | 82.0 | 82.1 | 94.5 | 75.7 | 71.6 | 79.6 |
1457
+ | ViT-22B\* | 85.9 | 90.1 | 96.0 | 80.9 | - | 87.6 |
1458
+ | InternVL-C (ours) | 83.2 | 83.8 | 95.5 | 77.3 | 73.9 | 80.6 |
1459
+
1460
+ - Multilingual Zero-Shot Image Classification [[see details]](./clip_benchmark#multilingual-imagenet-1k)
1461
+
1462
+ EN: English, ZH: Chinese, JP: Japanese, Ar: Arabic, IT: Italian
1463
+
1464
+ | method | IN-1K (EN) | IN-1K (ZH) | IN-1K (JP) | IN-1K (AR) | IN-1K (IT) |
1465
+ | ----------------- | :--------: | :--------: | :--------: | :--------: | :--------: |
1466
+ | Taiyi-CLIP-ViT-H | - | 54.4 | - | - | - |
1467
+ | WuKong-ViT-L-G | - | 57.5 | - | - | - |
1468
+ | CN-CLIP-ViT-H | - | 59.6 | - | - | - |
1469
+ | AltCLIP-ViT-L | 74.5 | 59.6 | - | - | - |
1470
+ | EVA-02-CLIP-E+ | 82.0 | - | - | - | 41.2 |
1471
+ | OpenCLIP-XLM-R-H | 77.0 | 55.7 | 53.1 | 37.0 | 56.8 |
1472
+ | InternVL-C (ours) | 83.2 | 64.5 | 61.5 | 44.9 | 65.7 |
1473
+
1474
+ - Zero-Shot Video Classification
1475
+
1476
+ | method | #frame | K400 | K600 | K700 |
1477
+ | ----------------- | :----: | :---: | :---: | :---: |
1478
+ | OpenCLIP-G | 1 | 65.9 | 66.1 | 59.2 |
1479
+ | EVA-02-CLIP-E+ | 1 | 69.8 | 69.3 | 63.4 |
1480
+ | InternVL-C (ours) | 1 | 71.0 | 71.3 | 65.7 |
1481
+ | ViCLIP | 8 | 75.7 | 73.5 | 66.4 |
1482
+ | InternVL-C (ours) | 8 | 79.4 | 78.8 | 71.5 |
1483
+
1484
+ </details>
1485
+
1486
+ <details>
1487
+ <summary>Cross-Modal Retrieval (click to expand)</summary>
1488
+
1489
+ - English Zero-Shot Image-Text Retrieval [[see details]](./clip_benchmark#flickr30k--coco)
1490
+
1491
+ <table>
1492
+ <tr align=center>
1493
+ <td rowspan="3" align=left><b>model</b></td>
1494
+ <td colspan="6" align=center><b>Flickr30K</b></td>
1495
+ <td colspan="6" align=center><b>COCO</b></td>
1496
+ <td rowspan="3" align=center><b>avg</b></td>
1497
+ </tr>
1498
+ <tr align=center>
1499
+ <td colspan="3" align=center><b>image-to-text</b></td>
1500
+ <td colspan="3" align=center><b>text-to-image</b></td>
1501
+ <td colspan="3" align=center><b>image-to-text</b></td>
1502
+ <td colspan="3" align=center><b>text-to-image</b></td>
1503
+ </tr>
1504
+ <tr>
1505
+ <td>R@1</td>
1506
+ <td>R@5</td>
1507
+ <td>R@10</td>
1508
+ <td>R@1</td>
1509
+ <td>R@5</td>
1510
+ <td>R@10</td>
1511
+ <td>R@1</td>
1512
+ <td>R@5</td>
1513
+ <td>R@10</td>
1514
+ <td>R@1</td>
1515
+ <td>R@5</td>
1516
+ <td>R@10</td>
1517
+ </tr>
1518
+ <tr align=center>
1519
+ <td align=left>OpenCLIP-G</td>
1520
+ <td>92.9</td>
1521
+ <td>99.3</td>
1522
+ <td>99.8</td>
1523
+ <td>79.5</td>
1524
+ <td>95.0</td>
1525
+ <td>97.1</td>
1526
+ <td>67.3</td>
1527
+ <td>86.9</td>
1528
+ <td>92.6</td>
1529
+ <td>51.4</td>
1530
+ <td>74.9</td>
1531
+ <td>83.0</td>
1532
+ <td>85.0</td>
1533
+ </tr>
1534
+ <tr align=center>
1535
+ <td align=left>EVA-02-CLIP-E+</td>
1536
+ <td>93.9</td>
1537
+ <td>99.4</td>
1538
+ <td>99.8</td>
1539
+ <td>78.8</td>
1540
+ <td>94.2</td>
1541
+ <td>96.8</td>
1542
+ <td>68.8</td>
1543
+ <td>87.8</td>
1544
+ <td>92.8</td>
1545
+ <td>51.1</td>
1546
+ <td>75.0</td>
1547
+ <td>82.7</td>
1548
+ <td>85.1</td>
1549
+ </tr>
1550
+ <tr align=center>
1551
+ <td align=left>EVA-CLIP-8B</td>
1552
+ <td>95.6</td>
1553
+ <td>99.6</td>
1554
+ <td>99.9</td>
1555
+ <td>80.8</td>
1556
+ <td>95.5</td>
1557
+ <td>97.6</td>
1558
+ <td>70.3</td>
1559
+ <td>89.3</td>
1560
+ <td>93.9</td>
1561
+ <td>53.0</td>
1562
+ <td>76.0</td>
1563
+ <td>83.4</td>
1564
+ <td>86.2</td>
1565
+ </tr>
1566
+ <tr align=center>
1567
+ <td align=left>InternVL-C (ours)</td>
1568
+ <td>94.7</td>
1569
+ <td>99.6</td>
1570
+ <td>99.9</td>
1571
+ <td>81.7</td>
1572
+ <td>96.0</td>
1573
+ <td>98.2</td>
1574
+ <td>70.6</td>
1575
+ <td>89.0</td>
1576
+ <td>93.5</td>
1577
+ <td>54.1</td>
1578
+ <td>77.3</td>
1579
+ <td>84.6</td>
1580
+ <td>86.6</td>
1581
+ </tr>
1582
+ <tr align=center>
1583
+ <td align=left>InternVL-G (ours)</td>
1584
+ <td>95.7</td>
1585
+ <td>99.7</td>
1586
+ <td>99.9</td>
1587
+ <td>85.0</td>
1588
+ <td>97.0</td>
1589
+ <td>98.6</td>
1590
+ <td>74.9</td>
1591
+ <td>91.3</td>
1592
+ <td>95.2</td>
1593
+ <td>58.6</td>
1594
+ <td>81.3</td>
1595
+ <td>88.0</td>
1596
+ <td>88.8</td>
1597
+ </tr>
1598
+
1599
+ </table>
1600
+
1601
+ - Chinese Zero-Shot Image-Text Retrieval [[see details]](./clip_benchmark#flickr30k-cn--coco-cn)
1602
+
1603
+ <table>
1604
+ <tr align=center>
1605
+ <td rowspan="3" align=left><b>model</b></td>
1606
+ <td colspan="6" align=center><b>Flickr30K-CN</b></td>
1607
+ <td colspan="6" align=center><b>COCO-CN</b></td>
1608
+ <td rowspan="3" align=center><b>avg</b></td>
1609
+
1610
+ </tr>
1611
+ <tr align=center>
1612
+ <td colspan="3" align=center><b>image-to-text</b></td>
1613
+ <td colspan="3" align=center><b>text-to-image</b></td>
1614
+ <td colspan="3" align=center><b>image-to-text</b></td>
1615
+ <td colspan="3" align=center><b>text-to-image</b></td>
1616
+ </tr>
1617
+ <tr>
1618
+ <td>R@1</td>
1619
+ <td>R@5</td>
1620
+ <td>R@10</td>
1621
+ <td>R@1</td>
1622
+ <td>R@5</td>
1623
+ <td>R@10</td>
1624
+ <td>R@1</td>
1625
+ <td>R@5</td>
1626
+ <td>R@10</td>
1627
+ <td>R@1</td>
1628
+ <td>R@5</td>
1629
+ <td>R@10</td>
1630
+ </tr>
1631
+
1632
+ <tr align=center>
1633
+ <td align=left>CN-CLIP-ViT-H</td>
1634
+ <td>81.6</td>
1635
+ <td>97.5</td>
1636
+ <td>98.8</td>
1637
+ <td>71.2</td>
1638
+ <td>91.4</td>
1639
+ <td>95.5</td>
1640
+ <td>63.0</td>
1641
+ <td>86.6</td>
1642
+ <td>92.9</td>
1643
+ <td>69.2</td>
1644
+ <td>89.9</td>
1645
+ <td>96.1</td>
1646
+ <td>86.1</td>
1647
+ </tr>
1648
+
1649
+ <tr align=center>
1650
+ <td align=left>OpenCLIP-XLM-R-H</td>
1651
+ <td>86.1</td>
1652
+ <td>97.5</td>
1653
+ <td>99.2</td>
1654
+ <td>71.0</td>
1655
+ <td>90.5</td>
1656
+ <td>94.9</td>
1657
+ <td>70.0</td>
1658
+ <td>91.5</td>
1659
+ <td>97.0</td>
1660
+ <td>66.1</td>
1661
+ <td>90.8</td>
1662
+ <td>96.0</td>
1663
+ <td>87.6</td>
1664
+ </tr>
1665
+
1666
+ <tr align=center>
1667
+ <td align=left>InternVL-C (ours)</td>
1668
+ <td>90.3</td>
1669
+ <td>98.8</td>
1670
+ <td>99.7</td>
1671
+ <td>75.1</td>
1672
+ <td>92.9</td>
1673
+ <td>96.4</td>
1674
+ <td>68.8</td>
1675
+ <td>92.0</td>
1676
+ <td>96.7</td>
1677
+ <td>68.9</td>
1678
+ <td>91.9</td>
1679
+ <td>96.5</td>
1680
+ <td>89.0</td>
1681
+ </tr>
1682
+ <tr align=center>
1683
+ <td align=left>InternVL-G (ours)</td>
1684
+ <td>92.9</td>
1685
+ <td>99.4</td>
1686
+ <td>99.8</td>
1687
+ <td>77.7</td>
1688
+ <td>94.8</td>
1689
+ <td>97.3</td>
1690
+ <td>71.4</td>
1691
+ <td>93.9</td>
1692
+ <td>97.7</td>
1693
+ <td>73.8</td>
1694
+ <td>94.4</td>
1695
+ <td>98.1</td>
1696
+ <td>90.9</td>
1697
+ </tr>
1698
+
1699
+ </table>
1700
+
1701
+ - Multilingual Zero-Shot Image-Text Retrieval on XTD [[see details]](./clip_benchmark#xtd)
1702
+
1703
+ | method | EN | ES | FR | ZH | IT | KO | RU | JP | average |
1704
+ | ----------------- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :-----: |
1705
+ | AltCLIP | 95.4 | 94.1 | 92.9 | 95.1 | 94.2 | 94.4 | 91.8 | 91.7 | 93.7 |
1706
+ | OpenCLIP-XLM-R-H | 97.3 | 96.1 | 94.5 | 94.7 | 96.0 | 90.2 | 93.9 | 94.0 | 94.6 |
1707
+ | InternVL-C (ours) | 97.3 | 95.7 | 95.1 | 95.6 | 96.0 | 92.2 | 93.3 | 95.5 | 95.1 |
1708
+ | InternVL-G (ours) | 98.6 | 97.7 | 96.5 | 96.7 | 96.9 | 95.1 | 94.8 | 96.1 | 96.6 |
1709
+
1710
+ </details>
1711
+
1712
+ <details>
1713
+ <summary>Multimodal Dialogue</summary>
1714
+
1715
+ </details>