minpeter soldni commited on
Commit
1c12cc0
·
verified ·
0 Parent(s):

Duplicate from allenai/Molmo-72B-0924

Browse files

Co-authored-by: Luca Soldaini <[email protected]>

This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .gitattributes +35 -0
  2. Notice.txt +3 -0
  3. README.md +210 -0
  4. added_tokens.json +428 -0
  5. config.json +32 -0
  6. config_molmo.py +60 -0
  7. generation_config.json +4 -0
  8. image_preprocessing_molmo.py +546 -0
  9. merges.txt +0 -0
  10. model-00001-of-00083.safetensors +3 -0
  11. model-00002-of-00083.safetensors +3 -0
  12. model-00003-of-00083.safetensors +3 -0
  13. model-00004-of-00083.safetensors +3 -0
  14. model-00005-of-00083.safetensors +3 -0
  15. model-00006-of-00083.safetensors +3 -0
  16. model-00007-of-00083.safetensors +3 -0
  17. model-00008-of-00083.safetensors +3 -0
  18. model-00009-of-00083.safetensors +3 -0
  19. model-00010-of-00083.safetensors +3 -0
  20. model-00011-of-00083.safetensors +3 -0
  21. model-00012-of-00083.safetensors +3 -0
  22. model-00013-of-00083.safetensors +3 -0
  23. model-00014-of-00083.safetensors +3 -0
  24. model-00015-of-00083.safetensors +3 -0
  25. model-00016-of-00083.safetensors +3 -0
  26. model-00017-of-00083.safetensors +3 -0
  27. model-00018-of-00083.safetensors +3 -0
  28. model-00019-of-00083.safetensors +3 -0
  29. model-00020-of-00083.safetensors +3 -0
  30. model-00021-of-00083.safetensors +3 -0
  31. model-00022-of-00083.safetensors +3 -0
  32. model-00023-of-00083.safetensors +3 -0
  33. model-00024-of-00083.safetensors +3 -0
  34. model-00025-of-00083.safetensors +3 -0
  35. model-00026-of-00083.safetensors +3 -0
  36. model-00027-of-00083.safetensors +3 -0
  37. model-00028-of-00083.safetensors +3 -0
  38. model-00029-of-00083.safetensors +3 -0
  39. model-00030-of-00083.safetensors +3 -0
  40. model-00031-of-00083.safetensors +3 -0
  41. model-00032-of-00083.safetensors +3 -0
  42. model-00033-of-00083.safetensors +3 -0
  43. model-00034-of-00083.safetensors +3 -0
  44. model-00035-of-00083.safetensors +3 -0
  45. model-00036-of-00083.safetensors +3 -0
  46. model-00037-of-00083.safetensors +3 -0
  47. model-00038-of-00083.safetensors +3 -0
  48. model-00039-of-00083.safetensors +3 -0
  49. model-00040-of-00083.safetensors +3 -0
  50. model-00041-of-00083.safetensors +3 -0
.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
Notice.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ Molmo-72B is trained on Qwen2-70B as the base model. Tongyi Qianwen is licensed under the Tongyi Qianwen LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved.
2
+
3
+ A copy of the license for Qwen2-70B can be found at https://huggingface.co/Qwen/Qwen2-72B/blob/main/LICENSE.
README.md ADDED
@@ -0,0 +1,210 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ base_model:
6
+ - openai/clip-vit-large-patch14-336
7
+ - Qwen/Qwen2-72B
8
+ pipeline_tag: image-text-to-text
9
+ tags:
10
+ - multimodal
11
+ - olmo
12
+ - molmo
13
+ - pixmo
14
+ library_name: transformers
15
+ ---
16
+
17
+ <img src="molmo_logo.png" alt="Logo for the Molmo Project" style="width: auto; height: 50px;">
18
+
19
+ # Molmo 72B
20
+
21
+ Molmo is a family of open vision-language models developed by the Allen Institute for AI. Molmo models are trained on PixMo, a dataset of 1 million, highly-curated image-text pairs. It has state-of-the-art performance among multimodal models with a similar size while being fully open-source. You can find all models in the Molmo family [here](https://huggingface.co/collections/allenai/molmo-66f379e6fe3b8ef090a8ca19).
22
+ **Learn more** about the Molmo family [in our announcement blog post](https://molmo.allenai.org/blog) or the [paper](https://huggingface.co/papers/2409.17146).
23
+
24
+ Molmo 72B is based on [Qwen2-72B](https://huggingface.co/Qwen/Qwen2-72B) and uses [OpenAI CLIP](https://huggingface.co/openai/clip-vit-large-patch14-336) as vision backbone.
25
+ Molmo-72B achieves the highest academic benchmark score and ranks second on human evaluation, just slightly behind GPT-4o.
26
+
27
+ This checkpoint is a **preview** of the Molmo release. All artifacts used in creating Molmo (PixMo dataset, training code, evaluations, intermediate checkpoints) will be made available at a later date, furthering our commitment to open-source AI development and reproducibility.
28
+
29
+ [**Sign up here**](https://docs.google.com/forms/d/e/1FAIpQLSdML1MhNNBDsCHpgWG65Oydg2SjZzVasyqlP08nBrWjZp_c7A/viewform) to be the first to know when artifacts are released.
30
+
31
+ Quick links:
32
+ - 💬 [Demo](https://molmo.allenai.org/)
33
+ - 📂 [All Models](https://huggingface.co/collections/allenai/molmo-66f379e6fe3b8ef090a8ca19)
34
+ - 📃 [Paper](https://molmo.allenai.org/paper.pdf)
35
+ - 🎥 [Blog with Videos](https://molmo.allenai.org/blog)
36
+
37
+ ## Quick Start
38
+
39
+ To run Molmo, first install dependencies:
40
+
41
+ ```bash
42
+ pip install einops torchvision
43
+ ```
44
+
45
+ Then, follow these steps:
46
+
47
+ ```python
48
+ from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
49
+ from PIL import Image
50
+ import requests
51
+ import torch
52
+
53
+ # load the processor
54
+ processor = AutoProcessor.from_pretrained(
55
+ 'allenai/Molmo-72B-0924',
56
+ trust_remote_code=True,
57
+ torch_dtype='auto',
58
+ device_map='auto'
59
+ )
60
+
61
+ # load the model
62
+ model = AutoModelForCausalLM.from_pretrained(
63
+ 'allenai/Molmo-72B-0924',
64
+ trust_remote_code=True,
65
+ torch_dtype='auto',
66
+ device_map='auto'
67
+ )
68
+
69
+ # process the image and text
70
+ inputs = processor.process(
71
+ images=[Image.open(requests.get("https://picsum.photos/id/237/536/354", stream=True).raw)],
72
+ text="Describe this image."
73
+ )
74
+
75
+ # move inputs to the correct device and make a batch of size 1
76
+ inputs = {k: v.to(model.device).unsqueeze(0) for k, v in inputs.items()}
77
+
78
+ # generate output; maximum 200 new tokens; stop generation when <|endoftext|> is generated
79
+ output = model.generate_from_batch(
80
+ inputs,
81
+ GenerationConfig(max_new_tokens=200, stop_strings="<|endoftext|>"),
82
+ tokenizer=processor.tokenizer
83
+ )
84
+
85
+ # only get generated tokens; decode them to text
86
+ generated_tokens = output[0,inputs['input_ids'].size(1):]
87
+ generated_text = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True)
88
+
89
+ # print the generated text
90
+ print(generated_text)
91
+
92
+ # >>> This image features an adorable black Labrador puppy sitting on a wooden deck.
93
+ # The puppy is positioned in the center of the frame, looking up at the camera...
94
+ ```
95
+
96
+ To make inference more efficient, run with autocast:
97
+
98
+
99
+ ```python
100
+ with torch.autocast(device_type="cuda", enabled=True, dtype=torch.bfloat16):
101
+ output = model.generate_from_batch(
102
+ inputs,
103
+ GenerationConfig(max_new_tokens=200, stop_strings="<|endoftext|>"),
104
+ tokenizer=processor.tokenizer
105
+ )
106
+ ```
107
+
108
+ We did most of our evaluation in this setting (autocast on, but float32 weights)
109
+
110
+ To even further reduce the memory requirements, the model can be run with bfloat16 weights:
111
+
112
+ ```
113
+ model.to(dtype=torch.bfloat16)
114
+ inputs["images"] = inputs["images"].to(torch.bfloat16)
115
+ output = model.generate_from_batch(
116
+ inputs,
117
+ GenerationConfig(max_new_tokens=200, stop_strings="<|endoftext|>"),
118
+ tokenizer=processor.tokenizer
119
+ )
120
+ ```
121
+ Note that we have observed that this can change the output of the model compared to running with float32 weights.
122
+
123
+ ## Evaluations
124
+
125
+ | Model | Average Score on 11 Academic Benchmarks | Human Preference Elo Rating |
126
+ |-----------------------------|-----------------------------------------|-----------------------------|
127
+ | **Molmo 72B (this model)** | **81.2** | **1077** |
128
+ | Molmo 7B-D | 77.3 | 1056 |
129
+ | Molmo 7B-O | 74.6 | 1051 |
130
+ | MolmoE 1B | 68.6 | 1032 |
131
+ | GPT-4o | 78.5 | 1079 |
132
+ | GPT-4V | 71.1 | 1041 |
133
+ | Gemini 1.5 Pro | 78.3 | 1074 |
134
+ | Gemini 1.5 Flash | 75.1 | 1054 |
135
+ | Claude 3.5 Sonnet | 76.7 | 1069 |
136
+ | Claude 3 Opus | 66.4 | 971 |
137
+ | Claude 3 Haiku | 65.3 | 999 |
138
+ | Qwen VL2 72B | 79.4 | 1037 |
139
+ | Qwen VL2 7B | 73.7 | 1025 |
140
+ | Intern VL2 LLAMA 76B | 77.1 | 1018 |
141
+ | Intern VL2 8B | 69.4 | 953 |
142
+ | Pixtral 12B | 69.5 | 1016 |
143
+ | Phi3.5-Vision 4B | 59.7 | 982 |
144
+ | PaliGemma 3B | 50.0 | 937 |
145
+ | LLAVA OneVision 72B | 76.6 | 1051 |
146
+ | LLAVA OneVision 7B | 72.0 | 1024 |
147
+ | Cambrian-1 34B | 66.8 | 953 |
148
+ | Cambrian-1 8B | 63.4 | 952 |
149
+ | xGen - MM - Interleave 4B | 59.5 | 979 |
150
+ | LLAVA-1.5 13B | 43.9 | 960 |
151
+ | LLAVA-1.5 7B | 40.7 | 951 |
152
+
153
+ *Benchmarks: AI2D test, ChartQA test, VQA v2.0 test, DocQA test, InfographicVQA test, TextVQA val, RealWorldQA, MMMU val, MathVista testmini, CountBenchQA, Flickr Count (we collected this new dataset that is significantly harder than CountBenchQA).*
154
+
155
+ ## FAQs
156
+
157
+ ### I'm getting an error a broadcast error when processing images!
158
+
159
+ Your image might not be in RGB format. You can convert it using the following code snippet:
160
+
161
+ ```python
162
+ from PIL import Image
163
+
164
+ image = Image.open(...)
165
+
166
+ if image.mode != "RGB":
167
+ image = image.convert("RGB")
168
+ ```
169
+
170
+ ### Molmo doesn't work great with transparent images!
171
+
172
+ We received reports that Molmo models might struggle with transparent images.
173
+ For the time being, we recommend adding a white or dark background to your images before passing them to the model. The code snippet below shows how to do this using the Python Imaging Library (PIL):
174
+
175
+ ```python
176
+
177
+ # Load the image
178
+ url = "..."
179
+ image = Image.open(requests.get(url, stream=True).raw)
180
+
181
+ # Convert the image to grayscale to calculate brightness
182
+ gray_image = image.convert('L') # Convert to grayscale
183
+
184
+ # Calculate the average brightness
185
+ stat = ImageStat.Stat(gray_image)
186
+ average_brightness = stat.mean[0] # Get the average value
187
+
188
+ # Define background color based on brightness (threshold can be adjusted)
189
+ bg_color = (0, 0, 0) if average_brightness > 127 else (255, 255, 255)
190
+
191
+ # Create a new image with the same size as the original, filled with the background color
192
+ new_image = Image.new('RGB', image.size, bg_color)
193
+
194
+ # Paste the original image on top of the background (use image as a mask if needed)
195
+ new_image.paste(image, (0, 0), image if image.mode == 'RGBA' else None)
196
+
197
+ # Now you can pass the new_image to Molmo
198
+ processor = AutoProcessor.from_pretrained(
199
+ 'allenai/Molmo-7B-D-0924',
200
+ trust_remote_code=True,
201
+ torch_dtype='auto',
202
+ device_map='auto'
203
+ )
204
+ ```
205
+
206
+ ## License and Use
207
+
208
+ This model is licensed under Apache 2.0. It is intended for research and educational use.
209
+ For more information, please see our [Responsible Use Guidelines](https://allenai.org/responsible-use).
210
+ The base model used is Qwen2-72B, whose license (the Tongyi Qianwen license) you can find [here](https://huggingface.co/Qwen/Qwen2-72B/blob/main/LICENSE).
added_tokens.json ADDED
@@ -0,0 +1,428 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "<im_col>": 152067,
3
+ "<im_end>": 152065,
4
+ "<im_patch>": 152066,
5
+ "<im_start>": 152064,
6
+ "<|endoftext|>": 151643,
7
+ "<|im_end|>": 151645,
8
+ "<|im_start|>": 151644,
9
+ "<|image|>": 152068,
10
+ "|<EXTRA_TOKENS_0>|": 151646,
11
+ "|<EXTRA_TOKENS_100>|": 151746,
12
+ "|<EXTRA_TOKENS_101>|": 151747,
13
+ "|<EXTRA_TOKENS_102>|": 151748,
14
+ "|<EXTRA_TOKENS_103>|": 151749,
15
+ "|<EXTRA_TOKENS_104>|": 151750,
16
+ "|<EXTRA_TOKENS_105>|": 151751,
17
+ "|<EXTRA_TOKENS_106>|": 151752,
18
+ "|<EXTRA_TOKENS_107>|": 151753,
19
+ "|<EXTRA_TOKENS_108>|": 151754,
20
+ "|<EXTRA_TOKENS_109>|": 151755,
21
+ "|<EXTRA_TOKENS_10>|": 151656,
22
+ "|<EXTRA_TOKENS_110>|": 151756,
23
+ "|<EXTRA_TOKENS_111>|": 151757,
24
+ "|<EXTRA_TOKENS_112>|": 151758,
25
+ "|<EXTRA_TOKENS_113>|": 151759,
26
+ "|<EXTRA_TOKENS_114>|": 151760,
27
+ "|<EXTRA_TOKENS_115>|": 151761,
28
+ "|<EXTRA_TOKENS_116>|": 151762,
29
+ "|<EXTRA_TOKENS_117>|": 151763,
30
+ "|<EXTRA_TOKENS_118>|": 151764,
31
+ "|<EXTRA_TOKENS_119>|": 151765,
32
+ "|<EXTRA_TOKENS_11>|": 151657,
33
+ "|<EXTRA_TOKENS_120>|": 151766,
34
+ "|<EXTRA_TOKENS_121>|": 151767,
35
+ "|<EXTRA_TOKENS_122>|": 151768,
36
+ "|<EXTRA_TOKENS_123>|": 151769,
37
+ "|<EXTRA_TOKENS_124>|": 151770,
38
+ "|<EXTRA_TOKENS_125>|": 151771,
39
+ "|<EXTRA_TOKENS_126>|": 151772,
40
+ "|<EXTRA_TOKENS_127>|": 151773,
41
+ "|<EXTRA_TOKENS_128>|": 151774,
42
+ "|<EXTRA_TOKENS_129>|": 151775,
43
+ "|<EXTRA_TOKENS_12>|": 151658,
44
+ "|<EXTRA_TOKENS_130>|": 151776,
45
+ "|<EXTRA_TOKENS_131>|": 151777,
46
+ "|<EXTRA_TOKENS_132>|": 151778,
47
+ "|<EXTRA_TOKENS_133>|": 151779,
48
+ "|<EXTRA_TOKENS_134>|": 151780,
49
+ "|<EXTRA_TOKENS_135>|": 151781,
50
+ "|<EXTRA_TOKENS_136>|": 151782,
51
+ "|<EXTRA_TOKENS_137>|": 151783,
52
+ "|<EXTRA_TOKENS_138>|": 151784,
53
+ "|<EXTRA_TOKENS_139>|": 151785,
54
+ "|<EXTRA_TOKENS_13>|": 151659,
55
+ "|<EXTRA_TOKENS_140>|": 151786,
56
+ "|<EXTRA_TOKENS_141>|": 151787,
57
+ "|<EXTRA_TOKENS_142>|": 151788,
58
+ "|<EXTRA_TOKENS_143>|": 151789,
59
+ "|<EXTRA_TOKENS_144>|": 151790,
60
+ "|<EXTRA_TOKENS_145>|": 151791,
61
+ "|<EXTRA_TOKENS_146>|": 151792,
62
+ "|<EXTRA_TOKENS_147>|": 151793,
63
+ "|<EXTRA_TOKENS_148>|": 151794,
64
+ "|<EXTRA_TOKENS_149>|": 151795,
65
+ "|<EXTRA_TOKENS_14>|": 151660,
66
+ "|<EXTRA_TOKENS_150>|": 151796,
67
+ "|<EXTRA_TOKENS_151>|": 151797,
68
+ "|<EXTRA_TOKENS_152>|": 151798,
69
+ "|<EXTRA_TOKENS_153>|": 151799,
70
+ "|<EXTRA_TOKENS_154>|": 151800,
71
+ "|<EXTRA_TOKENS_155>|": 151801,
72
+ "|<EXTRA_TOKENS_156>|": 151802,
73
+ "|<EXTRA_TOKENS_157>|": 151803,
74
+ "|<EXTRA_TOKENS_158>|": 151804,
75
+ "|<EXTRA_TOKENS_159>|": 151805,
76
+ "|<EXTRA_TOKENS_15>|": 151661,
77
+ "|<EXTRA_TOKENS_160>|": 151806,
78
+ "|<EXTRA_TOKENS_161>|": 151807,
79
+ "|<EXTRA_TOKENS_162>|": 151808,
80
+ "|<EXTRA_TOKENS_163>|": 151809,
81
+ "|<EXTRA_TOKENS_164>|": 151810,
82
+ "|<EXTRA_TOKENS_165>|": 151811,
83
+ "|<EXTRA_TOKENS_166>|": 151812,
84
+ "|<EXTRA_TOKENS_167>|": 151813,
85
+ "|<EXTRA_TOKENS_168>|": 151814,
86
+ "|<EXTRA_TOKENS_169>|": 151815,
87
+ "|<EXTRA_TOKENS_16>|": 151662,
88
+ "|<EXTRA_TOKENS_170>|": 151816,
89
+ "|<EXTRA_TOKENS_171>|": 151817,
90
+ "|<EXTRA_TOKENS_172>|": 151818,
91
+ "|<EXTRA_TOKENS_173>|": 151819,
92
+ "|<EXTRA_TOKENS_174>|": 151820,
93
+ "|<EXTRA_TOKENS_175>|": 151821,
94
+ "|<EXTRA_TOKENS_176>|": 151822,
95
+ "|<EXTRA_TOKENS_177>|": 151823,
96
+ "|<EXTRA_TOKENS_178>|": 151824,
97
+ "|<EXTRA_TOKENS_179>|": 151825,
98
+ "|<EXTRA_TOKENS_17>|": 151663,
99
+ "|<EXTRA_TOKENS_180>|": 151826,
100
+ "|<EXTRA_TOKENS_181>|": 151827,
101
+ "|<EXTRA_TOKENS_182>|": 151828,
102
+ "|<EXTRA_TOKENS_183>|": 151829,
103
+ "|<EXTRA_TOKENS_184>|": 151830,
104
+ "|<EXTRA_TOKENS_185>|": 151831,
105
+ "|<EXTRA_TOKENS_186>|": 151832,
106
+ "|<EXTRA_TOKENS_187>|": 151833,
107
+ "|<EXTRA_TOKENS_188>|": 151834,
108
+ "|<EXTRA_TOKENS_189>|": 151835,
109
+ "|<EXTRA_TOKENS_18>|": 151664,
110
+ "|<EXTRA_TOKENS_190>|": 151836,
111
+ "|<EXTRA_TOKENS_191>|": 151837,
112
+ "|<EXTRA_TOKENS_192>|": 151838,
113
+ "|<EXTRA_TOKENS_193>|": 151839,
114
+ "|<EXTRA_TOKENS_194>|": 151840,
115
+ "|<EXTRA_TOKENS_195>|": 151841,
116
+ "|<EXTRA_TOKENS_196>|": 151842,
117
+ "|<EXTRA_TOKENS_197>|": 151843,
118
+ "|<EXTRA_TOKENS_198>|": 151844,
119
+ "|<EXTRA_TOKENS_199>|": 151845,
120
+ "|<EXTRA_TOKENS_19>|": 151665,
121
+ "|<EXTRA_TOKENS_1>|": 151647,
122
+ "|<EXTRA_TOKENS_200>|": 151846,
123
+ "|<EXTRA_TOKENS_201>|": 151847,
124
+ "|<EXTRA_TOKENS_202>|": 151848,
125
+ "|<EXTRA_TOKENS_203>|": 151849,
126
+ "|<EXTRA_TOKENS_204>|": 151850,
127
+ "|<EXTRA_TOKENS_205>|": 151851,
128
+ "|<EXTRA_TOKENS_206>|": 151852,
129
+ "|<EXTRA_TOKENS_207>|": 151853,
130
+ "|<EXTRA_TOKENS_208>|": 151854,
131
+ "|<EXTRA_TOKENS_209>|": 151855,
132
+ "|<EXTRA_TOKENS_20>|": 151666,
133
+ "|<EXTRA_TOKENS_210>|": 151856,
134
+ "|<EXTRA_TOKENS_211>|": 151857,
135
+ "|<EXTRA_TOKENS_212>|": 151858,
136
+ "|<EXTRA_TOKENS_213>|": 151859,
137
+ "|<EXTRA_TOKENS_214>|": 151860,
138
+ "|<EXTRA_TOKENS_215>|": 151861,
139
+ "|<EXTRA_TOKENS_216>|": 151862,
140
+ "|<EXTRA_TOKENS_217>|": 151863,
141
+ "|<EXTRA_TOKENS_218>|": 151864,
142
+ "|<EXTRA_TOKENS_219>|": 151865,
143
+ "|<EXTRA_TOKENS_21>|": 151667,
144
+ "|<EXTRA_TOKENS_220>|": 151866,
145
+ "|<EXTRA_TOKENS_221>|": 151867,
146
+ "|<EXTRA_TOKENS_222>|": 151868,
147
+ "|<EXTRA_TOKENS_223>|": 151869,
148
+ "|<EXTRA_TOKENS_224>|": 151870,
149
+ "|<EXTRA_TOKENS_225>|": 151871,
150
+ "|<EXTRA_TOKENS_226>|": 151872,
151
+ "|<EXTRA_TOKENS_227>|": 151873,
152
+ "|<EXTRA_TOKENS_228>|": 151874,
153
+ "|<EXTRA_TOKENS_229>|": 151875,
154
+ "|<EXTRA_TOKENS_22>|": 151668,
155
+ "|<EXTRA_TOKENS_230>|": 151876,
156
+ "|<EXTRA_TOKENS_231>|": 151877,
157
+ "|<EXTRA_TOKENS_232>|": 151878,
158
+ "|<EXTRA_TOKENS_233>|": 151879,
159
+ "|<EXTRA_TOKENS_234>|": 151880,
160
+ "|<EXTRA_TOKENS_235>|": 151881,
161
+ "|<EXTRA_TOKENS_236>|": 151882,
162
+ "|<EXTRA_TOKENS_237>|": 151883,
163
+ "|<EXTRA_TOKENS_238>|": 151884,
164
+ "|<EXTRA_TOKENS_239>|": 151885,
165
+ "|<EXTRA_TOKENS_23>|": 151669,
166
+ "|<EXTRA_TOKENS_240>|": 151886,
167
+ "|<EXTRA_TOKENS_241>|": 151887,
168
+ "|<EXTRA_TOKENS_242>|": 151888,
169
+ "|<EXTRA_TOKENS_243>|": 151889,
170
+ "|<EXTRA_TOKENS_244>|": 151890,
171
+ "|<EXTRA_TOKENS_245>|": 151891,
172
+ "|<EXTRA_TOKENS_246>|": 151892,
173
+ "|<EXTRA_TOKENS_247>|": 151893,
174
+ "|<EXTRA_TOKENS_248>|": 151894,
175
+ "|<EXTRA_TOKENS_249>|": 151895,
176
+ "|<EXTRA_TOKENS_24>|": 151670,
177
+ "|<EXTRA_TOKENS_250>|": 151896,
178
+ "|<EXTRA_TOKENS_251>|": 151897,
179
+ "|<EXTRA_TOKENS_252>|": 151898,
180
+ "|<EXTRA_TOKENS_253>|": 151899,
181
+ "|<EXTRA_TOKENS_254>|": 151900,
182
+ "|<EXTRA_TOKENS_255>|": 151901,
183
+ "|<EXTRA_TOKENS_256>|": 151902,
184
+ "|<EXTRA_TOKENS_257>|": 151903,
185
+ "|<EXTRA_TOKENS_258>|": 151904,
186
+ "|<EXTRA_TOKENS_259>|": 151905,
187
+ "|<EXTRA_TOKENS_25>|": 151671,
188
+ "|<EXTRA_TOKENS_260>|": 151906,
189
+ "|<EXTRA_TOKENS_261>|": 151907,
190
+ "|<EXTRA_TOKENS_262>|": 151908,
191
+ "|<EXTRA_TOKENS_263>|": 151909,
192
+ "|<EXTRA_TOKENS_264>|": 151910,
193
+ "|<EXTRA_TOKENS_265>|": 151911,
194
+ "|<EXTRA_TOKENS_266>|": 151912,
195
+ "|<EXTRA_TOKENS_267>|": 151913,
196
+ "|<EXTRA_TOKENS_268>|": 151914,
197
+ "|<EXTRA_TOKENS_269>|": 151915,
198
+ "|<EXTRA_TOKENS_26>|": 151672,
199
+ "|<EXTRA_TOKENS_270>|": 151916,
200
+ "|<EXTRA_TOKENS_271>|": 151917,
201
+ "|<EXTRA_TOKENS_272>|": 151918,
202
+ "|<EXTRA_TOKENS_273>|": 151919,
203
+ "|<EXTRA_TOKENS_274>|": 151920,
204
+ "|<EXTRA_TOKENS_275>|": 151921,
205
+ "|<EXTRA_TOKENS_276>|": 151922,
206
+ "|<EXTRA_TOKENS_277>|": 151923,
207
+ "|<EXTRA_TOKENS_278>|": 151924,
208
+ "|<EXTRA_TOKENS_279>|": 151925,
209
+ "|<EXTRA_TOKENS_27>|": 151673,
210
+ "|<EXTRA_TOKENS_280>|": 151926,
211
+ "|<EXTRA_TOKENS_281>|": 151927,
212
+ "|<EXTRA_TOKENS_282>|": 151928,
213
+ "|<EXTRA_TOKENS_283>|": 151929,
214
+ "|<EXTRA_TOKENS_284>|": 151930,
215
+ "|<EXTRA_TOKENS_285>|": 151931,
216
+ "|<EXTRA_TOKENS_286>|": 151932,
217
+ "|<EXTRA_TOKENS_287>|": 151933,
218
+ "|<EXTRA_TOKENS_288>|": 151934,
219
+ "|<EXTRA_TOKENS_289>|": 151935,
220
+ "|<EXTRA_TOKENS_28>|": 151674,
221
+ "|<EXTRA_TOKENS_290>|": 151936,
222
+ "|<EXTRA_TOKENS_291>|": 151937,
223
+ "|<EXTRA_TOKENS_292>|": 151938,
224
+ "|<EXTRA_TOKENS_293>|": 151939,
225
+ "|<EXTRA_TOKENS_294>|": 151940,
226
+ "|<EXTRA_TOKENS_295>|": 151941,
227
+ "|<EXTRA_TOKENS_296>|": 151942,
228
+ "|<EXTRA_TOKENS_297>|": 151943,
229
+ "|<EXTRA_TOKENS_298>|": 151944,
230
+ "|<EXTRA_TOKENS_299>|": 151945,
231
+ "|<EXTRA_TOKENS_29>|": 151675,
232
+ "|<EXTRA_TOKENS_2>|": 151648,
233
+ "|<EXTRA_TOKENS_300>|": 151946,
234
+ "|<EXTRA_TOKENS_301>|": 151947,
235
+ "|<EXTRA_TOKENS_302>|": 151948,
236
+ "|<EXTRA_TOKENS_303>|": 151949,
237
+ "|<EXTRA_TOKENS_304>|": 151950,
238
+ "|<EXTRA_TOKENS_305>|": 151951,
239
+ "|<EXTRA_TOKENS_306>|": 151952,
240
+ "|<EXTRA_TOKENS_307>|": 151953,
241
+ "|<EXTRA_TOKENS_308>|": 151954,
242
+ "|<EXTRA_TOKENS_309>|": 151955,
243
+ "|<EXTRA_TOKENS_30>|": 151676,
244
+ "|<EXTRA_TOKENS_310>|": 151956,
245
+ "|<EXTRA_TOKENS_311>|": 151957,
246
+ "|<EXTRA_TOKENS_312>|": 151958,
247
+ "|<EXTRA_TOKENS_313>|": 151959,
248
+ "|<EXTRA_TOKENS_314>|": 151960,
249
+ "|<EXTRA_TOKENS_315>|": 151961,
250
+ "|<EXTRA_TOKENS_316>|": 151962,
251
+ "|<EXTRA_TOKENS_317>|": 151963,
252
+ "|<EXTRA_TOKENS_318>|": 151964,
253
+ "|<EXTRA_TOKENS_319>|": 151965,
254
+ "|<EXTRA_TOKENS_31>|": 151677,
255
+ "|<EXTRA_TOKENS_320>|": 151966,
256
+ "|<EXTRA_TOKENS_321>|": 151967,
257
+ "|<EXTRA_TOKENS_322>|": 151968,
258
+ "|<EXTRA_TOKENS_323>|": 151969,
259
+ "|<EXTRA_TOKENS_324>|": 151970,
260
+ "|<EXTRA_TOKENS_325>|": 151971,
261
+ "|<EXTRA_TOKENS_326>|": 151972,
262
+ "|<EXTRA_TOKENS_327>|": 151973,
263
+ "|<EXTRA_TOKENS_328>|": 151974,
264
+ "|<EXTRA_TOKENS_329>|": 151975,
265
+ "|<EXTRA_TOKENS_32>|": 151678,
266
+ "|<EXTRA_TOKENS_330>|": 151976,
267
+ "|<EXTRA_TOKENS_331>|": 151977,
268
+ "|<EXTRA_TOKENS_332>|": 151978,
269
+ "|<EXTRA_TOKENS_333>|": 151979,
270
+ "|<EXTRA_TOKENS_334>|": 151980,
271
+ "|<EXTRA_TOKENS_335>|": 151981,
272
+ "|<EXTRA_TOKENS_336>|": 151982,
273
+ "|<EXTRA_TOKENS_337>|": 151983,
274
+ "|<EXTRA_TOKENS_338>|": 151984,
275
+ "|<EXTRA_TOKENS_339>|": 151985,
276
+ "|<EXTRA_TOKENS_33>|": 151679,
277
+ "|<EXTRA_TOKENS_340>|": 151986,
278
+ "|<EXTRA_TOKENS_341>|": 151987,
279
+ "|<EXTRA_TOKENS_342>|": 151988,
280
+ "|<EXTRA_TOKENS_343>|": 151989,
281
+ "|<EXTRA_TOKENS_344>|": 151990,
282
+ "|<EXTRA_TOKENS_345>|": 151991,
283
+ "|<EXTRA_TOKENS_346>|": 151992,
284
+ "|<EXTRA_TOKENS_347>|": 151993,
285
+ "|<EXTRA_TOKENS_348>|": 151994,
286
+ "|<EXTRA_TOKENS_349>|": 151995,
287
+ "|<EXTRA_TOKENS_34>|": 151680,
288
+ "|<EXTRA_TOKENS_350>|": 151996,
289
+ "|<EXTRA_TOKENS_351>|": 151997,
290
+ "|<EXTRA_TOKENS_352>|": 151998,
291
+ "|<EXTRA_TOKENS_353>|": 151999,
292
+ "|<EXTRA_TOKENS_354>|": 152000,
293
+ "|<EXTRA_TOKENS_355>|": 152001,
294
+ "|<EXTRA_TOKENS_356>|": 152002,
295
+ "|<EXTRA_TOKENS_357>|": 152003,
296
+ "|<EXTRA_TOKENS_358>|": 152004,
297
+ "|<EXTRA_TOKENS_359>|": 152005,
298
+ "|<EXTRA_TOKENS_35>|": 151681,
299
+ "|<EXTRA_TOKENS_360>|": 152006,
300
+ "|<EXTRA_TOKENS_361>|": 152007,
301
+ "|<EXTRA_TOKENS_362>|": 152008,
302
+ "|<EXTRA_TOKENS_363>|": 152009,
303
+ "|<EXTRA_TOKENS_364>|": 152010,
304
+ "|<EXTRA_TOKENS_365>|": 152011,
305
+ "|<EXTRA_TOKENS_366>|": 152012,
306
+ "|<EXTRA_TOKENS_367>|": 152013,
307
+ "|<EXTRA_TOKENS_368>|": 152014,
308
+ "|<EXTRA_TOKENS_369>|": 152015,
309
+ "|<EXTRA_TOKENS_36>|": 151682,
310
+ "|<EXTRA_TOKENS_370>|": 152016,
311
+ "|<EXTRA_TOKENS_371>|": 152017,
312
+ "|<EXTRA_TOKENS_372>|": 152018,
313
+ "|<EXTRA_TOKENS_373>|": 152019,
314
+ "|<EXTRA_TOKENS_374>|": 152020,
315
+ "|<EXTRA_TOKENS_375>|": 152021,
316
+ "|<EXTRA_TOKENS_376>|": 152022,
317
+ "|<EXTRA_TOKENS_377>|": 152023,
318
+ "|<EXTRA_TOKENS_378>|": 152024,
319
+ "|<EXTRA_TOKENS_379>|": 152025,
320
+ "|<EXTRA_TOKENS_37>|": 151683,
321
+ "|<EXTRA_TOKENS_380>|": 152026,
322
+ "|<EXTRA_TOKENS_381>|": 152027,
323
+ "|<EXTRA_TOKENS_382>|": 152028,
324
+ "|<EXTRA_TOKENS_383>|": 152029,
325
+ "|<EXTRA_TOKENS_384>|": 152030,
326
+ "|<EXTRA_TOKENS_385>|": 152031,
327
+ "|<EXTRA_TOKENS_386>|": 152032,
328
+ "|<EXTRA_TOKENS_387>|": 152033,
329
+ "|<EXTRA_TOKENS_388>|": 152034,
330
+ "|<EXTRA_TOKENS_389>|": 152035,
331
+ "|<EXTRA_TOKENS_38>|": 151684,
332
+ "|<EXTRA_TOKENS_390>|": 152036,
333
+ "|<EXTRA_TOKENS_391>|": 152037,
334
+ "|<EXTRA_TOKENS_392>|": 152038,
335
+ "|<EXTRA_TOKENS_393>|": 152039,
336
+ "|<EXTRA_TOKENS_394>|": 152040,
337
+ "|<EXTRA_TOKENS_395>|": 152041,
338
+ "|<EXTRA_TOKENS_396>|": 152042,
339
+ "|<EXTRA_TOKENS_397>|": 152043,
340
+ "|<EXTRA_TOKENS_398>|": 152044,
341
+ "|<EXTRA_TOKENS_399>|": 152045,
342
+ "|<EXTRA_TOKENS_39>|": 151685,
343
+ "|<EXTRA_TOKENS_3>|": 151649,
344
+ "|<EXTRA_TOKENS_400>|": 152046,
345
+ "|<EXTRA_TOKENS_401>|": 152047,
346
+ "|<EXTRA_TOKENS_402>|": 152048,
347
+ "|<EXTRA_TOKENS_403>|": 152049,
348
+ "|<EXTRA_TOKENS_404>|": 152050,
349
+ "|<EXTRA_TOKENS_405>|": 152051,
350
+ "|<EXTRA_TOKENS_406>|": 152052,
351
+ "|<EXTRA_TOKENS_407>|": 152053,
352
+ "|<EXTRA_TOKENS_408>|": 152054,
353
+ "|<EXTRA_TOKENS_409>|": 152055,
354
+ "|<EXTRA_TOKENS_40>|": 151686,
355
+ "|<EXTRA_TOKENS_410>|": 152056,
356
+ "|<EXTRA_TOKENS_411>|": 152057,
357
+ "|<EXTRA_TOKENS_412>|": 152058,
358
+ "|<EXTRA_TOKENS_413>|": 152059,
359
+ "|<EXTRA_TOKENS_414>|": 152060,
360
+ "|<EXTRA_TOKENS_415>|": 152061,
361
+ "|<EXTRA_TOKENS_416>|": 152062,
362
+ "|<EXTRA_TOKENS_417>|": 152063,
363
+ "|<EXTRA_TOKENS_41>|": 151687,
364
+ "|<EXTRA_TOKENS_42>|": 151688,
365
+ "|<EXTRA_TOKENS_43>|": 151689,
366
+ "|<EXTRA_TOKENS_44>|": 151690,
367
+ "|<EXTRA_TOKENS_45>|": 151691,
368
+ "|<EXTRA_TOKENS_46>|": 151692,
369
+ "|<EXTRA_TOKENS_47>|": 151693,
370
+ "|<EXTRA_TOKENS_48>|": 151694,
371
+ "|<EXTRA_TOKENS_49>|": 151695,
372
+ "|<EXTRA_TOKENS_4>|": 151650,
373
+ "|<EXTRA_TOKENS_50>|": 151696,
374
+ "|<EXTRA_TOKENS_51>|": 151697,
375
+ "|<EXTRA_TOKENS_52>|": 151698,
376
+ "|<EXTRA_TOKENS_53>|": 151699,
377
+ "|<EXTRA_TOKENS_54>|": 151700,
378
+ "|<EXTRA_TOKENS_55>|": 151701,
379
+ "|<EXTRA_TOKENS_56>|": 151702,
380
+ "|<EXTRA_TOKENS_57>|": 151703,
381
+ "|<EXTRA_TOKENS_58>|": 151704,
382
+ "|<EXTRA_TOKENS_59>|": 151705,
383
+ "|<EXTRA_TOKENS_5>|": 151651,
384
+ "|<EXTRA_TOKENS_60>|": 151706,
385
+ "|<EXTRA_TOKENS_61>|": 151707,
386
+ "|<EXTRA_TOKENS_62>|": 151708,
387
+ "|<EXTRA_TOKENS_63>|": 151709,
388
+ "|<EXTRA_TOKENS_64>|": 151710,
389
+ "|<EXTRA_TOKENS_65>|": 151711,
390
+ "|<EXTRA_TOKENS_66>|": 151712,
391
+ "|<EXTRA_TOKENS_67>|": 151713,
392
+ "|<EXTRA_TOKENS_68>|": 151714,
393
+ "|<EXTRA_TOKENS_69>|": 151715,
394
+ "|<EXTRA_TOKENS_6>|": 151652,
395
+ "|<EXTRA_TOKENS_70>|": 151716,
396
+ "|<EXTRA_TOKENS_71>|": 151717,
397
+ "|<EXTRA_TOKENS_72>|": 151718,
398
+ "|<EXTRA_TOKENS_73>|": 151719,
399
+ "|<EXTRA_TOKENS_74>|": 151720,
400
+ "|<EXTRA_TOKENS_75>|": 151721,
401
+ "|<EXTRA_TOKENS_76>|": 151722,
402
+ "|<EXTRA_TOKENS_77>|": 151723,
403
+ "|<EXTRA_TOKENS_78>|": 151724,
404
+ "|<EXTRA_TOKENS_79>|": 151725,
405
+ "|<EXTRA_TOKENS_7>|": 151653,
406
+ "|<EXTRA_TOKENS_80>|": 151726,
407
+ "|<EXTRA_TOKENS_81>|": 151727,
408
+ "|<EXTRA_TOKENS_82>|": 151728,
409
+ "|<EXTRA_TOKENS_83>|": 151729,
410
+ "|<EXTRA_TOKENS_84>|": 151730,
411
+ "|<EXTRA_TOKENS_85>|": 151731,
412
+ "|<EXTRA_TOKENS_86>|": 151732,
413
+ "|<EXTRA_TOKENS_87>|": 151733,
414
+ "|<EXTRA_TOKENS_88>|": 151734,
415
+ "|<EXTRA_TOKENS_89>|": 151735,
416
+ "|<EXTRA_TOKENS_8>|": 151654,
417
+ "|<EXTRA_TOKENS_90>|": 151736,
418
+ "|<EXTRA_TOKENS_91>|": 151737,
419
+ "|<EXTRA_TOKENS_92>|": 151738,
420
+ "|<EXTRA_TOKENS_93>|": 151739,
421
+ "|<EXTRA_TOKENS_94>|": 151740,
422
+ "|<EXTRA_TOKENS_95>|": 151741,
423
+ "|<EXTRA_TOKENS_96>|": 151742,
424
+ "|<EXTRA_TOKENS_97>|": 151743,
425
+ "|<EXTRA_TOKENS_98>|": 151744,
426
+ "|<EXTRA_TOKENS_99>|": 151745,
427
+ "|<EXTRA_TOKENS_9>|": 151655
428
+ }
config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "MolmoForCausalLM"
4
+ ],
5
+ "attention_layer_norm": false,
6
+ "auto_map": {
7
+ "AutoConfig": "config_molmo.MolmoConfig",
8
+ "AutoModelForCausalLM": "modeling_molmo.MolmoForCausalLM"
9
+ },
10
+ "clip_qkv": null,
11
+ "embedding_size": 152064,
12
+ "hidden_size": 8192,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 59136,
15
+ "layer_norm_eps": 1e-05,
16
+ "layer_norm_type": "rms",
17
+ "max_position_embeddings": 4096,
18
+ "model_type": "molmo",
19
+ "norm_after": false,
20
+ "num_attention_heads": 64,
21
+ "num_hidden_layers": 80,
22
+ "num_key_value_heads": 8,
23
+ "qkv_bias": true,
24
+ "rope_theta": 1000000.0,
25
+ "tie_word_embeddings": false,
26
+ "torch_dtype": "float32",
27
+ "transformers_version": "4.43.3",
28
+ "use_cache": true,
29
+ "use_position_ids": true,
30
+ "vocab_size": 152064,
31
+ "weight_tying": false
32
+ }
config_molmo.py ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import List
2
+
3
+ from transformers import PretrainedConfig, AutoTokenizer
4
+
5
+
6
+ class MolmoConfig(PretrainedConfig):
7
+ model_type = "molmo"
8
+ keys_to_ignore_at_inference = ["past_key_values"]
9
+
10
+ def __init__(
11
+ self,
12
+ vocab_size=50304,
13
+ embedding_size=50304,
14
+ hidden_size=4096,
15
+ intermediate_size=11008,
16
+ num_hidden_layers=32,
17
+ num_attention_heads=32,
18
+ num_key_value_heads=None,
19
+ max_position_embeddings=2048,
20
+ initializer_range=0.02,
21
+ use_cache=True,
22
+ layer_norm_eps: float = 1e-5,
23
+ rope_theta=10000.0,
24
+ clip_qkv=None,
25
+ qkv_bias: bool = False,
26
+ weight_tying: bool = False,
27
+ use_position_ids: bool=True,
28
+ tie_word_embeddings: bool=True,
29
+ attention_layer_norm: bool=False,
30
+ norm_after: bool = False,
31
+ layer_norm_type: str="rms",
32
+ **kwargs,
33
+ ):
34
+ self.vocab_size = vocab_size
35
+ self.embedding_size = embedding_size
36
+ self.max_position_embeddings = max_position_embeddings
37
+ self.hidden_size = hidden_size
38
+ self.intermediate_size = intermediate_size
39
+ self.num_hidden_layers = num_hidden_layers
40
+ self.num_attention_heads = num_attention_heads
41
+ self.layer_norm_eps = layer_norm_eps
42
+ self.weight_tying = weight_tying
43
+ self.use_position_ids = use_position_ids
44
+ self.attention_layer_norm = attention_layer_norm
45
+ self.num_key_value_heads = num_key_value_heads
46
+ self.initializer_range = initializer_range
47
+ self.use_cache = use_cache
48
+ self.rope_theta = rope_theta
49
+ self.clip_qkv = clip_qkv
50
+ self.qkv_bias = qkv_bias
51
+ self.norm_after = norm_after
52
+ self.tie_word_embeddings = tie_word_embeddings
53
+ self.layer_norm_type = layer_norm_type
54
+
55
+ super().__init__(
56
+ tie_word_embeddings=tie_word_embeddings,
57
+ **kwargs,
58
+ )
59
+
60
+ MolmoConfig.register_for_auto_class()
generation_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "transformers_version": "4.43.3"
4
+ }
image_preprocessing_molmo.py ADDED
@@ -0,0 +1,546 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Image processor class for Molmo"""
2
+ from typing import List, Optional, Union, Mapping
3
+
4
+ import numpy as np
5
+ import einops
6
+ import torch
7
+ import torchvision.transforms
8
+ from torchvision.transforms import InterpolationMode
9
+ from torchvision.transforms.functional import convert_image_dtype
10
+
11
+ from transformers.image_utils import (
12
+ OPENAI_CLIP_MEAN,
13
+ OPENAI_CLIP_STD,
14
+ ImageInput,
15
+ is_valid_image,
16
+ )
17
+ from transformers.processing_utils import ImagesKwargs
18
+ from transformers.image_processing_utils import BaseImageProcessor
19
+ from transformers.utils import logging
20
+
21
+
22
+ logger = logging.get_logger(__name__)
23
+
24
+
25
+ def pad_to_bounding_box(
26
+ image, offset_height, offset_width, target_height,
27
+ target_width, value=0
28
+ ):
29
+ height, width = image.shape[:2]
30
+ after_padding_width = target_width - offset_width - width
31
+ after_padding_height = target_height - offset_height - height
32
+ return np.pad(image, [
33
+ [offset_height, after_padding_height],
34
+ [offset_width, after_padding_width],
35
+ [0, 0]
36
+ ], constant_values=value)
37
+
38
+
39
+ def normalize_image(image, offset, scale):
40
+ image -= np.array(offset, dtype=np.float32)[None, None, :]
41
+ image /= np.array(scale, dtype=np.float32)[None, None, :]
42
+ return image
43
+
44
+
45
+ def resize_and_pad(
46
+ image,
47
+ desired_output_size,
48
+ resize_method="torch-bilinear",
49
+ pad_value=0,
50
+ normalize=True,
51
+ image_mean=OPENAI_CLIP_MEAN,
52
+ image_std=OPENAI_CLIP_STD,
53
+ ):
54
+ desired_height, desired_width = desired_output_size
55
+ height, width = image.shape[:2]
56
+
57
+ # Cast into float32 since the training code did this in float32 and it (very rarely) effects
58
+ # the results after rounding.
59
+ image_scale_y = np.array(desired_height, np.float32) / np.array(height, np.float32)
60
+ image_scale_x = np.array(desired_width, np.float32) / np.array(width, np.float32)
61
+ image_scale = min(image_scale_x, image_scale_y)
62
+ scaled_height = int(np.array(height, np.float32) * image_scale)
63
+ scaled_width = int(np.array(width, np.float32) * image_scale)
64
+
65
+ if resize_method == "tensorflow":
66
+ # This how the original training code did resizing, it can produce slightly different
67
+ # results then using torch resize so we keep it just in case
68
+ import tensorflow as tf
69
+ image = tf.image.convert_image_dtype(tf.constant(image), dtype=tf.float32)
70
+ image = tf.image.resize(
71
+ image,
72
+ [scaled_height, scaled_width],
73
+ method=tf.image.ResizeMethod.BILINEAR,
74
+ antialias=True,
75
+ )
76
+ image = tf.clip_by_value(image, 0.0, 1.0)
77
+ image = image.numpy()
78
+ elif resize_method == "torch-bilinear":
79
+ image = torch.permute(torch.from_numpy(image), [2, 0, 1])
80
+ image = convert_image_dtype(image) # resize in float32 to match the training code
81
+ image = torchvision.transforms.Resize(
82
+ [scaled_height, scaled_width], InterpolationMode.BILINEAR, antialias=True
83
+ )(image)
84
+ image = torch.clip(image, 0.0, 1.0)
85
+ image = torch.permute(image, [1, 2, 0]).numpy()
86
+ else:
87
+ raise NotImplementedError(resize_method)
88
+
89
+ top_pad = (desired_height - scaled_height) // 2
90
+ left_pad = (desired_width - scaled_width) // 2
91
+ padding = [
92
+ [top_pad, desired_height - scaled_height - top_pad],
93
+ [left_pad, desired_width - scaled_width - left_pad],
94
+ [0, 0]
95
+ ]
96
+ image_mask = np.pad(np.ones_like(image[:, :, 0], dtype=bool), padding[:2])
97
+ image = np.pad(image, padding, constant_values=pad_value)
98
+ if normalize:
99
+ image = normalize_image(image, offset=image_mean, scale=image_std)
100
+ return image, image_mask
101
+
102
+
103
+ def select_tiling(h, w, patch_size, max_num_patches):
104
+ """Decide how best to divide in image of size [w, h] in up to max_num_patches of size patch_size"""
105
+ original_size = np.stack([h, w]) # [1, 2]
106
+ original_res = h * w
107
+ tilings = []
108
+ for i in range(1, max_num_patches+1):
109
+ for j in range(1, max_num_patches+1):
110
+ if i*j <= max_num_patches:
111
+ tilings.append((i, j))
112
+ # sort so argmin and argmax favour smaller tilings in the event of a tie
113
+ tilings.sort(key=lambda x: (x[0]*x[1], x[0]))
114
+ candidate_tilings = np.array(tilings, dtype=np.int32) # [n_resolutions, 2]
115
+ candidate_resolutions = candidate_tilings * patch_size # [n_resolutions, 2]
116
+
117
+ # How much we would need to scale the image to fit exactly in each tiling
118
+ original_size = np.stack([h, w], dtype=np.float32) # [1, 2]
119
+ required_scale_d = candidate_resolutions.astype(np.float32) / original_size
120
+ required_scale = np.min(required_scale_d, axis=-1, keepdims=True) # [n_resolutions, 1]
121
+ if np.all(required_scale < 1):
122
+ # We are forced to downscale, so try to minimize the amount of downscaling
123
+ ix = np.argmax(required_scale)
124
+ else:
125
+ # Pick the resolution that required the least upscaling so that it most closely fits the image
126
+ required_scale = np.where(required_scale < 1.0, 10e9, required_scale)
127
+ ix = np.argmin(required_scale)
128
+ return candidate_tilings[ix]
129
+
130
+
131
+ class MolmoImagesKwargs(ImagesKwargs, total=False):
132
+ max_crops: Optional[int]
133
+ overlap_margins: Optional[List[int]]
134
+ base_image_input_size: Optional[List[int]]
135
+ image_token_length_w: Optional[int]
136
+ image_token_length_h: Optional[int]
137
+ image_patch_size: Optional[int]
138
+ image_padding_mask: Optional[bool]
139
+
140
+
141
+ class MolmoImageProcessor(BaseImageProcessor):
142
+ """Preprocess images and multi-model inputs"""
143
+
144
+ def __init__(
145
+ self,
146
+ max_crops: int = 12,
147
+ overlap_margins: List[int] = (4, 4),
148
+ base_image_input_size: List[int] = (336, 336),
149
+ image_token_length_w: int = 12,
150
+ image_token_length_h: int = 12,
151
+ image_patch_size: int = 14,
152
+ image_padding_mask: bool = True,
153
+ do_normalize: bool = True,
154
+ image_mean: Optional[Union[float, List[float]]] = None,
155
+ image_std: Optional[Union[float, List[float]]] = None,
156
+ **kwargs,
157
+ ):
158
+ super().__init__(**kwargs)
159
+ self.max_crops = max_crops
160
+ self.overlap_margins = overlap_margins
161
+ self.base_image_input_size = base_image_input_size
162
+ self.image_token_length_w = image_token_length_w
163
+ self.image_token_length_h = image_token_length_h
164
+ self.image_patch_size = image_patch_size
165
+ self.image_padding_mask = image_padding_mask
166
+ self.do_normalize = do_normalize
167
+ self.image_mean = image_mean if image_mean is not None else OPENAI_CLIP_MEAN
168
+ self.image_std = image_std if image_std is not None else OPENAI_CLIP_STD
169
+
170
+ def image_to_patches_and_tokens(
171
+ self,
172
+ image: ImageInput,
173
+ image_patch_token_id: int,
174
+ image_col_token_id: int,
175
+ image_start_token_id: int,
176
+ image_end_token_id: int,
177
+ max_crops: Optional[int] = None,
178
+ overlap_margins: Optional[List[int]] = None,
179
+ base_image_input_size: Optional[Union[int, List[int]]] = None,
180
+ image_token_length_w: Optional[int] = None,
181
+ image_token_length_h: Optional[int] = None,
182
+ image_patch_size: Optional[int] = None,
183
+ ):
184
+ if isinstance(base_image_input_size, int):
185
+ base_image_input_size = (base_image_input_size, base_image_input_size)
186
+
187
+ base_image_input_d = image_patch_size
188
+ tokens_per_image = image_token_length_w * image_token_length_h
189
+ image_base_patch_w = base_image_input_size[1] // base_image_input_d
190
+ image_base_patch_h = base_image_input_size[0] // base_image_input_d
191
+
192
+ original_image_h, original_image_w = image.shape[:2]
193
+ crop_size = base_image_input_size[0]
194
+
195
+ # Discard this many patches from the (left/top, right/bottom) of crops
196
+ left_margin, right_margin = overlap_margins
197
+ # left_margin, right_margin = 2, 2
198
+ assert left_margin % 2 == 0 # Required for compatibility with 2x2 pooling
199
+ total_margin_pixels = base_image_input_d*(right_margin + left_margin) # pixels removed per dim
200
+ crop_patches = base_image_input_size[0] // base_image_input_d # patches per crop dim
201
+ crop_window_patches = crop_patches - (right_margin + left_margin) # usable patches
202
+ crop_window_size = crop_window_patches * base_image_input_d
203
+ tiling = select_tiling(
204
+ original_image_h - total_margin_pixels,
205
+ original_image_w - total_margin_pixels,
206
+ crop_window_size,
207
+ max_crops
208
+ )
209
+ src, img_mask = resize_and_pad(
210
+ image,
211
+ [tiling[0]*crop_window_size+total_margin_pixels, tiling[1]*crop_window_size+total_margin_pixels]
212
+ )
213
+
214
+ # Now we have to split the image into crops, while keeping track of how each patch in the
215
+ # each crop should be ordered in the global image, this require a lot of tricky booking
216
+ n_crops = tiling[0] * tiling[1]
217
+ patches_arr = []
218
+ mask_arr = []
219
+ patch_ordering_arr = []
220
+
221
+ # We assume 2x2 pooling, but can allow padding the right/bottom with extra
222
+ # patches if the number of patches per side is not even
223
+ assert (crop_patches+1)//2 == image_token_length_h
224
+ assert (crop_patches+1)//2 == image_token_length_w
225
+ on = 0
226
+ on_patch = 0
227
+ for i in range(tiling[0]):
228
+ y0 = i*crop_window_size
229
+ if i == 0:
230
+ crop_y0 = 0
231
+ else:
232
+ crop_y0 = left_margin // 2
233
+
234
+ crop_h = image_base_patch_h - (right_margin + left_margin)
235
+ if i == 0:
236
+ crop_h += left_margin
237
+ if i == (tiling[0]-1):
238
+ crop_h += right_margin
239
+ for j in range(tiling[1]):
240
+ x0 = j*crop_window_size
241
+ if j == 0:
242
+ crop_x0 = 0
243
+ else:
244
+ crop_x0 = left_margin // 2
245
+
246
+ crop_w = image_base_patch_w - (right_margin + left_margin)
247
+ if j == 0:
248
+ crop_w += left_margin
249
+ if j == (tiling[1]-1):
250
+ crop_w += right_margin
251
+
252
+ pooled_w = (crop_w + 1) // 2
253
+ pooled_h = (crop_h + 1) // 2
254
+ patch_ordering_arr.append(
255
+ pad_to_bounding_box(
256
+ np.reshape(np.arange(on, on+pooled_h*pooled_w, dtype=np.int32), (pooled_h, pooled_w, 1)),
257
+ crop_y0, crop_x0, image_token_length_h, image_token_length_w, value=-1
258
+ )[:, :, 0]
259
+ )
260
+ patches_arr.append(src[y0:y0+crop_size, x0:x0+crop_size])
261
+ mask_arr.append(img_mask[y0:y0+crop_size, x0:x0+crop_size])
262
+
263
+ on += pooled_h*pooled_w
264
+ on_patch += 1
265
+ patches = np.stack(patches_arr)
266
+ patch_ordering = np.stack(patch_ordering_arr)
267
+ img_mask = np.stack(mask_arr)
268
+
269
+ # Switch to [n_crops, n_patches, pixels_per_patch] format
270
+ image_layout_impatch_w, image_layout_impatch_h = tiling[0], tiling[1]
271
+ patches = einops.rearrange(
272
+ patches, 'p (h dh) (w dw) c -> p (h w) (dh dw c)',
273
+ dh=base_image_input_d,
274
+ dw=base_image_input_d,
275
+ h=image_base_patch_h,
276
+ w=image_base_patch_w
277
+ )
278
+ img_mask = einops.rearrange(
279
+ img_mask, 'p (h dh) (w dw) -> p (h w) (dh dw)',
280
+ dh=base_image_input_d,
281
+ dw=base_image_input_d,
282
+ h=image_base_patch_h,
283
+ w=image_base_patch_w
284
+ )
285
+
286
+ img_mask = img_mask.astype(np.float32).mean(axis=-1)
287
+ patch_ordering = np.reshape(patch_ordering, [-1])
288
+ valid = patch_ordering >= 0
289
+
290
+ # Transpose order, to get left-to-right order instead of crop-by-crop order
291
+ patch_ordering_rh = np.reshape(
292
+ patch_ordering,
293
+ [tiling[0], tiling[1], image_token_length_h, image_token_length_w]
294
+ )
295
+ patch_ordering_rh = np.transpose(patch_ordering_rh, [0, 2, 1, 3])
296
+ patch_ordering_rh = np.reshape(patch_ordering_rh, [-1])
297
+
298
+ # The transpose will screw up which patches are masked, project the
299
+ # new order into sparse structure of `patch_ordering` to fix this
300
+ patch_ordering[valid] = patch_ordering_rh[patch_ordering_rh >= 0]
301
+
302
+ # Now build the output tokens
303
+ h = tiling[0] * crop_window_patches + (right_margin+left_margin)
304
+ w = tiling[1] * crop_window_patches + (right_margin+left_margin)
305
+ per_row = np.full(
306
+ ((w+1)//2,),
307
+ image_patch_token_id,
308
+ )
309
+ per_row = np.concatenate([per_row, [image_col_token_id]], 0)
310
+
311
+ joint = np.tile(per_row, [(h+1)//2])
312
+ joint = [
313
+ [image_start_token_id],
314
+ joint,
315
+ [image_end_token_id]
316
+ ]
317
+
318
+ # Finally do the same for the global image
319
+ resized, _ = resize_and_pad(image, base_image_input_size)
320
+ resized = einops.rearrange(
321
+ resized, '(h dh) (w dw) c -> (h w) (dh dw c)',
322
+ dh=base_image_input_d,
323
+ dw=base_image_input_d,
324
+ h=image_base_patch_h,
325
+ w=image_base_patch_w
326
+ )
327
+ patches = np.concatenate([np.expand_dims(resized, 0), patches], 0)
328
+
329
+ # Global image goes first, so the order of patches in previous crops gets increased
330
+ patch_ordering = np.where(
331
+ patch_ordering >= 0,
332
+ patch_ordering + tokens_per_image,
333
+ -1
334
+ )
335
+ patch_ordering = np.concatenate([np.arange(0, tokens_per_image), patch_ordering], 0)
336
+ per_row = np.full(
337
+ (image_token_length_w,),
338
+ image_patch_token_id,
339
+ )
340
+ per_row = np.concatenate([per_row, [image_col_token_id]], 0)
341
+ extra_tokens = np.tile(per_row, [image_token_length_h])
342
+ joint = [
343
+ [image_start_token_id],
344
+ extra_tokens,
345
+ [image_end_token_id],
346
+ ] + joint
347
+
348
+ joint = np.concatenate(joint, 0)
349
+ img_mask = np.pad(img_mask, [[0, 1], [0, 0]], constant_values=-1)
350
+ return patches, joint, patch_ordering, img_mask
351
+
352
+ def build_image_input_idx(
353
+ self,
354
+ image_tokens: np.ndarray,
355
+ patch_order: np.ndarray,
356
+ image_patch_token_id: int,
357
+ no_image: Optional[bool] = None,
358
+ image_token_length_w: Optional[int] = None,
359
+ image_token_length_h: Optional[int] = None,
360
+ ):
361
+ """Converts `patch_order` into a mapping of token_id -> patch_id"""
362
+
363
+ tokens_per_image = image_token_length_w * image_token_length_h
364
+ if no_image is not None and no_image:
365
+ return np.zeros((0, tokens_per_image), np.int32)
366
+
367
+ # Indices to insert the patches
368
+ image_input_idx = image_tokens == image_patch_token_id
369
+ image_input_idx = np.nonzero(image_input_idx)[0].astype(np.int32)
370
+
371
+ if patch_order is not None:
372
+ n_tokens = image_input_idx.shape[0]
373
+ patch_order = np.reshape(patch_order, [-1])
374
+ n_patches = patch_order.shape[0]
375
+
376
+ valid = patch_order >= 0
377
+ n_valid_patches = valid.sum()
378
+ assert len(image_input_idx) == n_valid_patches
379
+
380
+ sorted_patch_ixs = np.zeros([n_tokens], np.int32)
381
+ sorted_patch_ixs[patch_order[valid]] = np.arange(n_valid_patches, dtype=np.int32)
382
+
383
+ # Project the inverted mapping into same sparse structure
384
+ sorted_patch_ixs_ex = np.full(np.shape(patch_order), -1)
385
+ sorted_patch_ixs_ex[valid] = sorted_patch_ixs
386
+
387
+ # Do the gather and then re-masked outputs that were masked in `sorted_patch_ixs`
388
+ valid = (sorted_patch_ixs_ex >= 0).astype(np.int32)
389
+ image_input_idx = image_input_idx[sorted_patch_ixs_ex*valid]
390
+ image_input_idx = image_input_idx*valid - 100*(1 - valid)
391
+ image_input_idx = np.reshape(image_input_idx, [-1, tokens_per_image])
392
+ return image_input_idx
393
+
394
+ def preprocess(
395
+ self,
396
+ image: np.ndarray,
397
+ image_patch_token_id: int,
398
+ image_col_token_id: int,
399
+ image_start_token_id: int,
400
+ image_end_token_id: int,
401
+ max_crops: Optional[int] = None,
402
+ overlap_margins: Optional[List[int]] = None,
403
+ base_image_input_size: Optional[Union[int, List[int]]] = None,
404
+ image_token_length_w: Optional[int] = None,
405
+ image_token_length_h: Optional[int] = None,
406
+ image_patch_size: Optional[int] = None,
407
+ **kwargs,
408
+ ):
409
+ """Preprocesses an image
410
+
411
+ Returns:
412
+ crops: (n_crops, n_patches, patch_dim) individual crops, `n_crops` might
413
+ change between images but the other dimension are fixed
414
+ tokens: (n_tokens,) int32 tokens, pad tokens indicate where to insert the
415
+ patch features, might include other special tokens as well
416
+ image_idx: (n_crops, n_patches) index in `tokens` to put the patch features from the
417
+ crops after pooling, negative values indicates patches features to exclude
418
+ padding_mask: (n_crops, n_patches) what percent of each crop is padding, can be None
419
+ if the image mask is not being used.
420
+ """
421
+
422
+ max_crops = max_crops or self.max_crops
423
+ overlap_margins = overlap_margins or self.overlap_margins
424
+ base_image_input_size = base_image_input_size or self.base_image_input_size
425
+ image_token_length_w = image_token_length_w or self.image_token_length_w
426
+ image_token_length_h = image_token_length_h or self.image_token_length_h
427
+ image_patch_size = image_patch_size or self.image_patch_size
428
+
429
+ crops, image_tokens, patch_ordering, img_mask = self.image_to_patches_and_tokens(
430
+ image,
431
+ image_patch_token_id,
432
+ image_col_token_id,
433
+ image_start_token_id,
434
+ image_end_token_id,
435
+ max_crops,
436
+ overlap_margins,
437
+ base_image_input_size,
438
+ image_token_length_w,
439
+ image_token_length_h,
440
+ image_patch_size,
441
+ )
442
+ patch_idx = self.build_image_input_idx(
443
+ image_tokens,
444
+ patch_ordering,
445
+ image_patch_token_id,
446
+ image_token_length_w=image_token_length_w,
447
+ image_token_length_h=image_token_length_h,
448
+ )
449
+ return crops, image_tokens, patch_idx, img_mask
450
+
451
+ def multimodal_preprocess(
452
+ self,
453
+ images: np.ndarray,
454
+ tokens: List[int],
455
+ image_idx: np.ndarray,
456
+ sequence_length: int,
457
+ image_patch_token_id: int,
458
+ image_col_token_id: int,
459
+ image_start_token_id: int,
460
+ image_end_token_id: int,
461
+ **kwargs,
462
+ ):
463
+ """Merge images and text tokens into multi-modal features for the model
464
+
465
+ :param images: images to use as input
466
+ :param tokens: input text tokens
467
+ :param image_idx: where to insert the images into `tokens`
468
+ :params image_patch_token_id: id to use of tokens that will contain image features
469
+ :params image_col_token_id: token id for image column special tokens
470
+ :params image_start_token_id: token id for image start special tokens
471
+ :params image_end_token_id: token id for image end special tokens
472
+ :params kwargs: override preprocessor default args
473
+ """
474
+ max_total_crops = kwargs.get("max_crops") or self.max_crops
475
+ image_token_length_w = kwargs.get("image_token_length_w") or self.image_token_length_w
476
+ image_token_length_h = kwargs.get("image_token_length_h") or self.image_token_length_h
477
+ image_patch_size = kwargs.get("image_patch_size") or self.image_patch_size
478
+ base_image_input_size = kwargs.get("base_image_input_size") or self.base_image_input_size
479
+ image_num_patch = (
480
+ base_image_input_size[0] // image_patch_size,
481
+ base_image_input_size[1] // image_patch_size,
482
+ )
483
+ image_padding_mask = kwargs.get("image_padding_mask") or self.image_padding_mask
484
+
485
+ tokens_per_image = image_token_length_w * image_token_length_h
486
+ n_pixels = image_patch_size * image_patch_size * 3
487
+ n_patches = image_num_patch[0] * image_num_patch[1]
488
+
489
+ if images is None:
490
+ return {
491
+ "input_ids": tokens,
492
+ }
493
+ else:
494
+ n = len(images)
495
+ all_crops = []
496
+ all_image_idx = []
497
+ out_tokens = []
498
+ all_crop_masks = []
499
+
500
+ for ix in range(n):
501
+ token_ix = image_idx[ix]
502
+ crops, image_tokens, patch_idx, img_mask = self.preprocess(
503
+ images[ix],
504
+ image_patch_token_id,
505
+ image_col_token_id,
506
+ image_start_token_id,
507
+ image_end_token_id,
508
+ **kwargs,
509
+ )
510
+
511
+ if token_ix == -1: # -1 is an image inserted at the very start
512
+ start = 0
513
+ token_ix = 0
514
+ end = 0
515
+ else:
516
+ start = 0 if ix == 0 else image_idx[ix-1] + 1
517
+ end = token_ix + 1
518
+
519
+ all_image_idx.append(patch_idx + token_ix)
520
+ all_crops.append(crops)
521
+ out_tokens.append(tokens[start:token_ix])
522
+ out_tokens.append(image_tokens)
523
+ if ix == (n - 1):
524
+ out_tokens.append(tokens[end:])
525
+ if image_padding_mask:
526
+ all_crop_masks.append(img_mask)
527
+
528
+ input_ids = np.concatenate(out_tokens, 0)
529
+ images = np.concatenate(all_crops, 0)
530
+ image_input_idx = np.concatenate(all_image_idx, 0)
531
+ if image_padding_mask:
532
+ image_masks = np.concatenate(all_crop_masks, 0)
533
+ else:
534
+ image_masks = None
535
+
536
+ out = {
537
+ "input_ids": input_ids,
538
+ "images": images,
539
+ "image_input_idx": image_input_idx
540
+ }
541
+ if image_masks is not None:
542
+ out["image_masks"] = image_masks
543
+ return out
544
+
545
+
546
+ MolmoImageProcessor.register_for_auto_class()
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model-00001-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5027bfc3649b4bb4fb35e5be030dcfeb84f0d38880e56454d818092f4989e339
3
+ size 4987060576
model-00002-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a1cc3aa5377ed4604299078f9106c228d7e08da1cb173b680274eee7cd440dae
3
+ size 4748125472
model-00003-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b2e5c5b0cf708950900cddde59fa3b56b0f5f89cd307f96c7634a3d810652f5a
3
+ size 3846325304
model-00004-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:439ac00ade686bf9af4aac0c03d41da02e4c93a56769fe6ffa48472bf6c94368
3
+ size 3510739800
model-00005-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:89861d9233c358a6db60efab583c77784ebdee4d4e30e860f3f55bcae4c9cdc2
3
+ size 3510739800
model-00006-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:494de8dd6ff50adeac7e92a53ab057e04c19748beee53601345731814b830240
3
+ size 3510739800
model-00007-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e196c810473673bc7ccc015970ae72a45a2ed43e364e691d40105bd088b1c240
3
+ size 3510739800
model-00008-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5ef4081b8fbf67b7a5ec34ed9167e4cd8aa0950cb039300a1c9caf5adc56d2aa
3
+ size 3510739800
model-00009-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2cbda5d6ff2afd461826b7feab07dbfea21cf8f8a9ce122c61ae73c5f51edaac
3
+ size 3510739800
model-00010-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0469e47cb78e5685fcd90d7ab8871311ea7c8eaadd3564fb7d0498db0eeb5090
3
+ size 3510739800
model-00011-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cfc0aea52fdc716d72e6ae9457a21788324eeb9581984409d4531ad4b26ea55e
3
+ size 3510739792
model-00012-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bda2bf30ef4903b8ca8cc03ed917a93ac25bc981496a583578328fd7f899e736
3
+ size 3510739808
model-00013-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4bb1a0125fcc9ed788d9251a02867a9900d411643839fdff4a1cbdc5291545e1
3
+ size 3510739808
model-00014-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7dfd129c7c4310b2dfa2586552b927b95eb28c5b8de84775a744a8be870f19dd
3
+ size 3510739808
model-00015-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0046b0763b8cf9cd100ee51fb8440a8b90174764e1e620e6dc44f8f1bb60956e
3
+ size 3510739808
model-00016-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ca6205a2ba7dd0a0f4648f17ba83345a747ac00087308ff8994b5c408b8d42b9
3
+ size 3510739808
model-00017-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:476c85010940c0dcaf22b91b9e354deb44aea785e1c24d647fe8683682070762
3
+ size 3510739808
model-00018-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c50af8499746df17e7f67ceec754313fad39389de2fc0c2e7165f2f288a9c1e3
3
+ size 3510739808
model-00019-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:193de920a1d37b44c478bdef9b10bd2da721a1efa0550d530de05abe6e05ccbf
3
+ size 3510739808
model-00020-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0a2448617f3927b8a4e75f9953c1dbd398e68282e662b4fe03cc3430dfcb14df
3
+ size 3510739808
model-00021-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:81736cc90cfc2d5f96786eb864ba5c9499a54256209a2dbaab4bc53bba2ca8f7
3
+ size 3510739808
model-00022-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:121d0a02ee908944e8477f89b61972387224c6944e5ff8d0d89dca00fb15e4df
3
+ size 3510739808
model-00023-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ba960e96b20901a97d8ab4dd63577c57b4115836f4faf46cfd9e9ef01186f7dd
3
+ size 3510739808
model-00024-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ba41eb0b90b1f5ae7f72e88e32429bd2a3f8ff8073d99efbf2c048cf66b9ad78
3
+ size 3510739808
model-00025-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ad847636f627f05d37ab7c96bf8b39b37da677d290ec758425e9f84aab2420ee
3
+ size 3510739808
model-00026-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:239e7422ffeea9561dabd94be05cbc99489219f556ccce468caa9fc601682fe9
3
+ size 3510739808
model-00027-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e85365a207fe405f7062b12389c8f26a0e9a9cb0be3bbb039b2ecdadb18cb00c
3
+ size 3510739808
model-00028-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2ea35ccd0a85cf2e3ef4f28f355e8fabbf1a7ff20a5dd8437549803841758961
3
+ size 3510739808
model-00029-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e44a12e4589f4c20a4065fabad5c8fe3ddc0e373b640b4a869d17f595fbf6aee
3
+ size 3510739808
model-00030-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7ac210af50eddf60604c7a8c793bcfcb3b26144c42231fd8485ccb1a3a6bf444
3
+ size 3510739808
model-00031-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b8b61c5215ac70a00a50c0e3aa3a2efe635683a10cee509983b194cdd5434bb2
3
+ size 3510739808
model-00032-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:64a3ff0e6a48b91a6c36de5568cec3975530377df14b275f463cd33e1f5f600e
3
+ size 3510739808
model-00033-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8e0950d7c66ac9fe3fdbe5a6fbb8af46ae30d5da0ab75fb37cce25ed18849a77
3
+ size 3510739808
model-00034-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b35d01cd2551ea952a66f0dc99994da13b0585991914def92744f49d079471b0
3
+ size 3510739808
model-00035-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fbc5e17860c5cb786b97e843e61489c806ba4b3f81805efd517272c4cdef4912
3
+ size 3510739808
model-00036-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:04e473ed195a967e28ce494561ae17bc1b0b1b8ba9bce8e916a000477f294d1f
3
+ size 3510739808
model-00037-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d9cb9dbeb0d9b9fb1414421c40fb77887ae2c5a72ae2114538322b5468340d44
3
+ size 3510739808
model-00038-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3f04c5bcdce77116fa013fe8bec74149986c7e181b831d5cbf2336746ed5f789
3
+ size 3510739808
model-00039-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d494a0cdb3d81d12b57452317fcb6c22e2f27d4d1e5d1d57d14e212461ec6d85
3
+ size 3510739808
model-00040-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b93a4a502934019d5c9854f89ed555067975262d22fce5622ca320f1581ee498
3
+ size 3510739808
model-00041-of-00083.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d01aab48a7a5fcf935597a460bc6683f1865c56495feded7ee2d8d005a39af80
3
+ size 3510739808