czczup commited on
Commit
75e738e
1 Parent(s): 4dfa205

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +210 -3
README.md CHANGED
@@ -1,3 +1,210 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ pipeline_tag: visual-question-answering
4
+ ---
5
+
6
+ # Model Card for InternVL2-4B
7
+
8
+ [\[🆕 Blog\]](https://internvl.github.io/blog/) [\[📜 InternVL 1.0 Paper\]](https://arxiv.org/abs/2312.14238) [\[📜 InternVL 1.5 Report\]](https://arxiv.org/abs/2404.16821) [\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/)
9
+
10
+ [\[🤗 HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[🚀 Quick Start\]](#model-usage) [\[🌐 Community-hosted API\]](https://rapidapi.com/adushar1320/api/internvl-chat) [\[📖 中文解读\]](https://zhuanlan.zhihu.com/p/675877376)
11
+
12
+ ## Model Usage
13
+
14
+ We provide an example code to run InternVL2-4B using `transformers`.
15
+
16
+ > Please use transformers==4.37.2 to ensure the model works normally.
17
+
18
+ ```python
19
+ import torch
20
+ import torchvision.transforms as T
21
+ from PIL import Image
22
+ from torchvision.transforms.functional import InterpolationMode
23
+ from transformers import AutoModel, AutoTokenizer
24
+
25
+ IMAGENET_MEAN = (0.485, 0.456, 0.406)
26
+ IMAGENET_STD = (0.229, 0.224, 0.225)
27
+
28
+
29
+ def build_transform(input_size):
30
+ MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
31
+ transform = T.Compose([
32
+ T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
33
+ T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
34
+ T.ToTensor(),
35
+ T.Normalize(mean=MEAN, std=STD)
36
+ ])
37
+ return transform
38
+
39
+
40
+ def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
41
+ best_ratio_diff = float('inf')
42
+ best_ratio = (1, 1)
43
+ area = width * height
44
+ for ratio in target_ratios:
45
+ target_aspect_ratio = ratio[0] / ratio[1]
46
+ ratio_diff = abs(aspect_ratio - target_aspect_ratio)
47
+ if ratio_diff < best_ratio_diff:
48
+ best_ratio_diff = ratio_diff
49
+ best_ratio = ratio
50
+ elif ratio_diff == best_ratio_diff:
51
+ if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
52
+ best_ratio = ratio
53
+ return best_ratio
54
+
55
+
56
+ def dynamic_preprocess(image, min_num=1, max_num=6, image_size=448, use_thumbnail=False):
57
+ orig_width, orig_height = image.size
58
+ aspect_ratio = orig_width / orig_height
59
+
60
+ # calculate the existing image aspect ratio
61
+ target_ratios = set(
62
+ (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
63
+ i * j <= max_num and i * j >= min_num)
64
+ target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
65
+
66
+ # find the closest aspect ratio to the target
67
+ target_aspect_ratio = find_closest_aspect_ratio(
68
+ aspect_ratio, target_ratios, orig_width, orig_height, image_size)
69
+
70
+ # calculate the target width and height
71
+ target_width = image_size * target_aspect_ratio[0]
72
+ target_height = image_size * target_aspect_ratio[1]
73
+ blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
74
+
75
+ # resize the image
76
+ resized_img = image.resize((target_width, target_height))
77
+ processed_images = []
78
+ for i in range(blocks):
79
+ box = (
80
+ (i % (target_width // image_size)) * image_size,
81
+ (i // (target_width // image_size)) * image_size,
82
+ ((i % (target_width // image_size)) + 1) * image_size,
83
+ ((i // (target_width // image_size)) + 1) * image_size
84
+ )
85
+ # split the image
86
+ split_img = resized_img.crop(box)
87
+ processed_images.append(split_img)
88
+ assert len(processed_images) == blocks
89
+ if use_thumbnail and len(processed_images) != 1:
90
+ thumbnail_img = image.resize((image_size, image_size))
91
+ processed_images.append(thumbnail_img)
92
+ return processed_images
93
+
94
+
95
+ def load_image(image_file, input_size=448, max_num=6):
96
+ image = Image.open(image_file).convert('RGB')
97
+ transform = build_transform(input_size=input_size)
98
+ images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
99
+ pixel_values = [transform(image) for image in images]
100
+ pixel_values = torch.stack(pixel_values)
101
+ return pixel_values
102
+
103
+
104
+ path = 'OpenGVLab/InternVL2-4B'
105
+ model = AutoModel.from_pretrained(
106
+ path,
107
+ torch_dtype=torch.bfloat16,
108
+ low_cpu_mem_usage=True,
109
+ trust_remote_code=True).eval().cuda()
110
+
111
+ tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
112
+ # set the max number of tiles in `max_num`
113
+ pixel_values = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
114
+
115
+ generation_config = dict(
116
+ num_beams=1,
117
+ max_new_tokens=1024,
118
+ do_sample=False,
119
+ )
120
+
121
+ # pure-text conversation (纯文本对话)
122
+ question = 'Hello, who are you?'
123
+ response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
124
+ print(f'User: {question}')
125
+ print(f'Assistant: {response}')
126
+
127
+ question = 'Can you tell me a story?'
128
+ response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
129
+ print(f'User: {question}')
130
+ print(f'Assistant: {response}')
131
+
132
+ # single-image single-round conversation (单图单轮对话)
133
+ question = '<image>\nPlease describe the image shortly.'
134
+ response = model.chat(tokenizer, pixel_values, question, generation_config)
135
+ print(f'User: {question}')
136
+ print(f'Assistant: {response}')
137
+
138
+ # single-image multi-round conversation (单图多轮对话)
139
+ question = '<image>\nPlease describe the image in detail.'
140
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
141
+ print(f'User: {question}')
142
+ print(f'Assistant: {response}')
143
+
144
+ question = 'Please write a poem according to the image.'
145
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
146
+ print(f'User: {question}')
147
+ print(f'Assistant: {response}')
148
+
149
+ # multi-image multi-round conversation (多图多轮对话)
150
+ pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
151
+ pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()
152
+ pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
153
+ num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
154
+
155
+ question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
156
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config,
157
+ num_patches_list=num_patches_list,
158
+ history=None, return_history=True)
159
+ print(f'User: {question}')
160
+ print(f'Assistant: {response}')
161
+
162
+ question = 'What are the similarities and differences between these two images.'
163
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config,
164
+ num_patches_list=num_patches_list,
165
+ history=history, return_history=True)
166
+ print(f'User: {question}')
167
+ print(f'Assistant: {response}')
168
+
169
+ # batch inference, single image per sample (单图批处理)
170
+ pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
171
+ pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()
172
+ num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
173
+ pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
174
+
175
+ questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list)
176
+ responses = model.batch_chat(tokenizer, pixel_values,
177
+ num_patches_list=num_patches_list,
178
+ questions=questions,
179
+ generation_config=generation_config)
180
+ for question, response in zip(questions, responses):
181
+ print(f'User: {question}')
182
+ print(f'Assistant: {response}')
183
+ ```
184
+
185
+ ## Citation
186
+
187
+ If you find this project useful in your research, please consider citing:
188
+
189
+ ```BibTeX
190
+ @article{chen2023internvl,
191
+ title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
192
+ author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
193
+ journal={arXiv preprint arXiv:2312.14238},
194
+ year={2023}
195
+ }
196
+ @article{chen2024far,
197
+ title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
198
+ author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
199
+ journal={arXiv preprint arXiv:2404.16821},
200
+ year={2024}
201
+ }
202
+ ```
203
+
204
+ ## License
205
+
206
+ This project is released under the MIT license.
207
+
208
+ ## Acknowledgement
209
+
210
+ InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!