zhaoyue-zephyrus commited on
Commit
ecd1674
·
1 Parent(s): df1f2c8

first commit

Browse files
README.md CHANGED
@@ -1,3 +1,53 @@
1
  ---
2
  license: cc-by-nc-4.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-nc-4.0
3
  ---
4
+
5
+ # QLIP
6
+
7
+ [\[📂 GitHub\]](https://github.com/NVlabs/QLIP)
8
+ [\[📃 QLIP Tech Report\]](http://arxiv.org/abs/2502.yyyyy)
9
+ [\[🔗 Project Page\]](http://nvlabs.github.io/QLIP/)
10
+ [\[🤗 HF Model\]](https://huggingface.co/NVIDIA/QLIP-B-16-256)
11
+
12
+ ## Introduction
13
+ We introduce Quantized Language-Image Pretraining (**QLIP**), a visual tokenization method that combines state-of-the-art reconstruction quality with state-of-the-art zero-shot image understanding.
14
+ QLIP trains a binary-spherical-quantization-based autoencoder with reconstruction and language-image alignment objectives.
15
+ We are the first to show that the two objectives do not need to be at odds.
16
+ We balance the two loss terms dynamically during training and show that a two-stage training pipeline effectively mixes the large-batch requirements of image-language pre-training with the memory bottleneck imposed by the reconstruction objective.
17
+ We validate the effectiveness of QLIP for multimodal understanding and text-conditioned image generation with a single model.
18
+ Specifically, QLIP serves as a drop-in replacement for the visual encoder for LLaVA and the image tokenizer for LlamaGen with comparable or even better performance.
19
+ Finally, we demonstrate that QLIP enables a unified mixed-modality auto-regressive model for understanding and generation.
20
+
21
+ ## Model Zoo
22
+ We provide the following models:
23
+ | model name | #bits | CR<sub>&uarr;<sub> | 0-shot<sub>&uarr;<sub> | rFID<sub>&darr;<sub> | HF Link |
24
+ | ------------- | ------ | ----- | ------ | ---- | ------- |
25
+ | QLIP-B-16-256 | 28 | 219.4 | 74.3 | 3.21 | [🤗 link](https://huggingface.co/NVIDIA/QLIP-B-16-256) |
26
+ | QLIP-B-8-256 | 28 | 54.8 | 75.6 | 0.70 | [🤗 link](https://huggingface.co/NVIDIA/QLIP-B-8-256) |
27
+ | QLIP-L-14-392 | 28 | 168 | 79.1 | 1.46 | [🤗 link](https://huggingface.co/NVIDIA/QLIP-L-14-392) |
28
+
29
+ Note:
30
+ - **CR**: compression ratio = 24/(#bits)*patch_size^2;
31
+ - **0-shot**: zero-shot classification accuracy on IN-1k-val;
32
+ - **rFID**: reconstruction FID on IN-1k-val.
33
+
34
+ ## Citing QLIP
35
+
36
+ ```bibtex
37
+ @article{zhao2025qlip,
38
+ title={QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation},
39
+ author={Zhao, Yue and Xue, Fuzhao and Reed, Scott and Fan, Linxi and Zhu, Yuke and Kautz, Jan and Yu, Zhiding and Krähenbühl, Philipp and Huang, De-An},
40
+ journal={arXiv preprint arXiv:2502.yyyyy},
41
+ year={2025}
42
+ }
43
+ ```
44
+
45
+ ## Acknowledgement
46
+ The project builds upon the following open-source efforts:
47
+ - [EVA-CLIP](https://github.com/baaivision/EVA/tree/master/EVA-CLIP/rei): We use EVA-CLIP as initialization which significantly speeds up the training convergence.
48
+
49
+ - [LLaVA](https://github.com/haotian-liu/LLaVA): We use LLaVA to evaluate the multimodal understanding performance.
50
+
51
+ - [LlamaGen](https://github.com/FoundationVision/LlamaGen): We build the text-to-image generation evaluation on top of LlamaGen.
52
+
53
+ - [Lingua](https://github.com/facebookresearch/lingua): We build the unified multimodal model on top of Lingua.
bsq.py ADDED
@@ -0,0 +1,227 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) 2024, NVIDIA Corporation & Affiliates. All rights reserved.
2
+ #
3
+ # This work is made available under the Nvidia Source Code License-NC.
4
+ # To view a copy of this license, visit
5
+ # https://github.com/NVlabs/QLIP/blob/main/LICENSE
6
+
7
+ # MIT License
8
+ # Based on https://github.com/zhaoyue-zephyrus/bsq-vit/blob/main/transcoder/models/quantizer/bsq.py
9
+
10
+ import torch
11
+ import torch.nn as nn
12
+ from einops import rearrange, reduce
13
+
14
+ _EPS = 1e-8
15
+
16
+
17
+ class DifferentiableEntropyFunction(torch.autograd.Function):
18
+ @staticmethod
19
+ def forward(ctx, zq, basis, K, eps):
20
+ zb = (zq + 1) / 2
21
+ zi = ((zb * basis).sum(-1)).to(torch.int64)
22
+ cnt = torch.scatter_reduce(
23
+ torch.zeros(2**K, device=zq.device, dtype=zq.dtype),
24
+ 0,
25
+ zi.flatten(),
26
+ torch.ones_like(zi.flatten()).to(zq.dtype),
27
+ "sum",
28
+ )
29
+ prob = (cnt + eps) / (cnt + eps).sum()
30
+ H = torch.special.entr(prob).sum()
31
+ ctx.save_for_backward(zq, zi, prob)
32
+ ctx.K = K
33
+ return H
34
+
35
+ @staticmethod
36
+ def backward(ctx, grad_output):
37
+ zq, zi, prob = ctx.saved_tensors
38
+ grad_array = -grad_output * (torch.log(prob) + 1) / zi.numel() / ctx.K
39
+ reord_grad = grad_array[zi.flatten()].reshape(zi.shape)
40
+ grad_input = reord_grad.unsqueeze(-1) * zq
41
+ return grad_input, None, None, None, None
42
+
43
+
44
+ def codebook_entropy(zq, basis, K, eps=1e-8):
45
+ return DifferentiableEntropyFunction.apply(zq, basis, K, eps)
46
+
47
+
48
+ class BinarySphericalQuantizer(nn.Module):
49
+ def __init__(
50
+ self,
51
+ embed_dim: int = 18,
52
+ group_size: int = 9,
53
+ soft_entropy: bool = True,
54
+ beta: float = 0.0, # commit loss
55
+ gamma_0: float = 1.0, # entropy loss (E[H(q)])
56
+ gamma_1: float = 1.0, # entropy loss (H[E[q]])
57
+ input_format: str = "bchw",
58
+ persample_entropy_compute: str = "group",
59
+ l2_norm: bool = True,
60
+ inv_temperature: float = 100.0,
61
+ ):
62
+ super().__init__()
63
+ self.embed_dim = embed_dim
64
+ self.group_size = group_size
65
+ assert embed_dim % group_size == 0, "embed_dim must be divisible by group_size"
66
+ self.soft_entropy = soft_entropy
67
+ self.beta = beta
68
+ self.gamma_0 = gamma_0
69
+ self.gamma_1 = gamma_1
70
+ assert input_format in ["bchw", "blc"]
71
+ self.input_format = input_format
72
+ assert persample_entropy_compute in [
73
+ "group",
74
+ "analytical",
75
+ ], "persample_entropy_compute must be either 'group' or 'analytical'"
76
+ self.persample_entropy_compute = persample_entropy_compute
77
+ self.l2_norm = l2_norm
78
+ self.inv_temperature = inv_temperature
79
+
80
+ self.register_buffer("basis", 2 ** torch.arange(embed_dim - 1, -1, -1), persistent=False)
81
+ self.register_buffer(
82
+ "group_basis", 2 ** torch.arange(group_size - 1, -1, -1), persistent=False
83
+ )
84
+
85
+ group_codes = torch.arange(2**self.group_size)
86
+ group_codebook = self.indexes_to_codes(group_codes).float()[:, -group_size:]
87
+ self.register_buffer("group_codebook", group_codebook, persistent=False)
88
+
89
+ def quantize(self, z):
90
+ assert (
91
+ z.shape[-1] == self.embed_dim
92
+ ), f"Expected {self.embed_dim} dimensions, got {z.shape[-1]}"
93
+ zhat = torch.where(z > 0, torch.ones_like(z), -torch.ones_like(z))
94
+ return z + (zhat - z).detach()
95
+
96
+ def forward(self, z):
97
+ if self.input_format == "bchw":
98
+ z = rearrange(z, "b c h w -> b h w c")
99
+ zq = self.quantize(z)
100
+
101
+ indices = self.codes_to_indexes(zq.detach())
102
+ group_indices = self.codes_to_group_indexes(zq.detach())
103
+
104
+ if not self.training:
105
+ used_codes = torch.unique(indices, return_counts=False)
106
+ else:
107
+ used_codes = None
108
+
109
+ if self.soft_entropy:
110
+ persample_entropy, cb_entropy = self.soft_entropy_loss(z)
111
+ else:
112
+ persample_entropy, cb_entropy = self.hard_entropy_loss(z)
113
+ entropy_penalty = self.gamma_0 * persample_entropy - self.gamma_1 * cb_entropy
114
+
115
+ q_scale = 1.0 / (self.embed_dim**0.5) if self.l2_norm else 1.0
116
+ zq = zq * q_scale
117
+ commit_loss = self.beta * torch.mean(((zq.detach() - z) ** 2).sum(dim=-1))
118
+
119
+ if self.input_format == "bchw":
120
+ zq = rearrange(zq, "b h w c -> b c h w")
121
+
122
+ return (
123
+ zq,
124
+ commit_loss + entropy_penalty / self.inv_temperature,
125
+ {
126
+ "H": cb_entropy,
127
+ "used_codes": used_codes,
128
+ "indices": indices,
129
+ "group_indices": group_indices,
130
+ },
131
+ )
132
+
133
+ def soft_entropy_loss(self, z):
134
+ group_codebook = self.group_codebook / (self.embed_dim**0.5 if self.l2_norm else 1)
135
+ divided_z = rearrange(z, "... (g c) -> ... g c", c=self.group_size)
136
+
137
+ if self.persample_entropy_compute == "group":
138
+ distance = -2 * torch.einsum("... g c, d c -> ... g d", divided_z, group_codebook)
139
+ prob = (-distance * self.inv_temperature).softmax(dim=-1)
140
+ persample_entropy = torch.special.entr(prob + _EPS).sum((-1, -2)).mean()
141
+ else:
142
+ p = torch.sigmoid(
143
+ -4 * z / (self.embed_dim**0.5 if self.l2_norm else 1) * self.inv_temperature
144
+ )
145
+ prob = torch.stack([p, 1 - p], dim=-1)
146
+ persample_entropy = torch.special.entr(prob + _EPS).sum((-1, -2)).mean()
147
+
148
+ # macro average of the probability of each subgroup
149
+ avg_prob = reduce(prob, "... g d -> g d", "mean")
150
+ cb_entropy = torch.special.entr(avg_prob + _EPS).sum()
151
+
152
+ return persample_entropy, cb_entropy
153
+
154
+ def hard_entropy_loss(self, z):
155
+ zb = ((z + 1) / 2).reshape(z.shape[0], -1, z.shape[-1]).to(torch.float32)
156
+ prob_per_dim = zb.sum(1) / zb.shape[1]
157
+ prob = torch.stack([prob_per_dim, 1 - prob_per_dim], dim=-1)
158
+ persample_entropy = torch.special.entr(prob + _EPS).sum((-1, -2)).mean()
159
+ cb_entropy = codebook_entropy(z, self.basis, self.embed_dim)
160
+
161
+ return persample_entropy, cb_entropy
162
+
163
+ def codes_to_indexes(self, zhat):
164
+ """Converts a `code` to an index in the codebook.
165
+ Args:
166
+ zhat: A tensor of shape (B, ..., C) containing the codes. must be in {-1, 1}
167
+ """
168
+ assert (
169
+ zhat.shape[-1] == self.embed_dim
170
+ ), f"Expected {self.embed_dim} dimensions, got {zhat.shape[-1]}"
171
+ return ((zhat.int() + 1) / 2 * self.basis).sum(axis=-1).to(torch.int64)
172
+
173
+ def codes_to_group_indexes(self, zhat):
174
+ """Converts a `code` to a list of indexes (in groups) in the codebook.
175
+ Args:
176
+ zhat: A tensor of shape (B, ..., C) containing the codes. must be in {-1, 1}
177
+ """
178
+ zhat_in_group = rearrange(zhat, "b ... (g c) -> b ... g c", c=self.group_size)
179
+ return ((zhat_in_group.int() + 1) / 2 * self.group_basis).sum(axis=-1).to(torch.int64)
180
+
181
+ def indexes_to_codes(self, indices):
182
+ """Inverse of `codes_to_indexes`."""
183
+ indices = indices.unsqueeze(-1)
184
+ codes_non_centered = torch.remainder(torch.floor_divide(indices, self.basis), 2)
185
+ return codes_non_centered * 2 - 1
186
+
187
+ def group_indexes_to_codes(self, group_indices):
188
+ """Inverse of `codes_to_group_indexes`."""
189
+ group_indices = group_indices.unsqueeze(-1)
190
+ codes_non_centered = torch.remainder(torch.floor_divide(group_indices, self.group_basis), 2)
191
+ codes_non_centered = rearrange(codes_non_centered, "b ... g c -> b ... (g c)")
192
+ return codes_non_centered * 2 - 1
193
+
194
+ def get_group_codebook_entry(self, group_indices, one_hot=False):
195
+ """
196
+ Args:
197
+ group_indices: A tensor of shape (B, L, G, C) containing the group indices.
198
+ """
199
+ if one_hot:
200
+ z_q = group_indices @ self.group_codebook
201
+ else:
202
+ z_q = self.group_indexes_to_codes(group_indices)
203
+ q_scale = 1.0 / (self.embed_dim**0.5) if self.l2_norm else 1.0
204
+ z_q = z_q * q_scale
205
+ if self.input_format == "bchw":
206
+ h, w = int(z_q.shape[1] ** 0.5)
207
+ assert h * w == z_q.shape[1], "Invalid sequence length"
208
+ z_q = rearrange(z_q, "b (h w) c -> b c h w", h=h)
209
+ return z_q
210
+
211
+ def get_codebook_entry(self, indices, one_hot=False):
212
+ """
213
+ Args:
214
+ group_indices: A tensor of shape (B, L, C) containing the indices.
215
+ """
216
+ if one_hot:
217
+ assert self.embed_dim == self.group_size, "one_hot is only supported for group_size == embed_dim"
218
+ z_q = indices @ self.group_codebook
219
+ else:
220
+ z_q = self.indexes_to_codes(indices)
221
+ q_scale = 1.0 / (self.embed_dim**0.5) if self.l2_norm else 1.0
222
+ z_q = z_q * q_scale
223
+ if self.input_format == "bchw":
224
+ h, w = int(z_q.shape[1] ** 0.5)
225
+ assert h * w == z_q.shape[1], "Invalid sequence length"
226
+ z_q = rearrange(z_q, "b (h w) c -> b c h w", h=h)
227
+ return z_q
config.json ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "EVA-BSQCLIP",
3
+ "architectures": [
4
+ "QLIPModel"
5
+ ],
6
+ "auto_map": {
7
+ "AutoConfig": "configuration_evaclip.QLIPConfig",
8
+ "AutoModel": "modeling_evaclip.QLIPModel"
9
+ },
10
+ "decoder_config": {
11
+ "dropout": 0.0,
12
+ "image_size": 256,
13
+ "intermediate_size": 2048,
14
+ "k_bias": false,
15
+ "layer_norm_eps": 1e-06,
16
+ "model_type": "clip_decoder_model",
17
+ "patch_size": 16,
18
+ "rope": true,
19
+ "rope_shift": 0,
20
+ "subln": true,
21
+ "swiglu": true,
22
+ "use_bfloat16": true,
23
+ "use_rms_norm": true
24
+ },
25
+ "initializer_factor": 1.0,
26
+ "logit_scale_init_value": 2.6592,
27
+ "model_type": "clip",
28
+ "projection_dim": 512,
29
+ "text_config": {
30
+ "bos_token_id": 0,
31
+ "dropout": 0.0,
32
+ "eos_token_id": 2,
33
+ "model_type": "clip_text_model",
34
+ "use_bfloat16": true,
35
+ "use_rms_norm": false
36
+ },
37
+ "text_projection_bias": false,
38
+ "torch_dtype": "float32",
39
+ "transformers_version": "4.37.2",
40
+ "vision_config": {
41
+ "dropout": 0.0,
42
+ "image_size": 256,
43
+ "intermediate_size": 2048,
44
+ "k_bias": false,
45
+ "layer_norm_eps": 1e-06,
46
+ "model_type": "clip_vision_model",
47
+ "patch_size": 16,
48
+ "quantizer": "bsq",
49
+ "quantizer_cfg": {
50
+ "embed_dim": 28,
51
+ "group_size": 1,
52
+ "input_format": "blc",
53
+ "inv_temperature": 1.0,
54
+ "l2_norm": true
55
+ },
56
+ "quantizer_embed_type": "mlp",
57
+ "quantizer_l2_norm": true,
58
+ "rope": true,
59
+ "rope_shift": 1,
60
+ "subln": true,
61
+ "swiglu": true,
62
+ "use_bfloat16": true,
63
+ "use_rms_norm": true
64
+ },
65
+ "vision_projection_bias": true
66
+ }
configuration_qlip.py ADDED
@@ -0,0 +1,566 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) 2024, NVIDIA Corporation & Affiliates. All rights reserved.
2
+ #
3
+ # This work is made available under the Nvidia Source Code License-NC.
4
+ # To view a copy of this license, visit
5
+ # https://github.com/NVlabs/QLIP/blob/main/LICENSE
6
+
7
+ # Copyright 2021 The HuggingFace Inc. team. All rights reserved.
8
+ #
9
+ # Licensed under the Apache License, Version 2.0 (the "License");
10
+ # you may not use this file except in compliance with the License.
11
+ # You may obtain a copy of the License at
12
+ #
13
+ # http://www.apache.org/licenses/LICENSE-2.0
14
+ #
15
+ # Unless required by applicable law or agreed to in writing, software
16
+ # distributed under the License is distributed on an "AS IS" BASIS,
17
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
18
+ # See the License for the specific language governing permissions and
19
+ # limitations under the License.
20
+ """ CLIP model configuration"""
21
+
22
+ import os
23
+ from collections import OrderedDict
24
+ from typing import TYPE_CHECKING, Any, Mapping, Optional, Union
25
+
26
+
27
+ if TYPE_CHECKING:
28
+ from transformers.processing_utils import ProcessorMixin
29
+ from transformers.utils import TensorType
30
+
31
+ from transformers.configuration_utils import PretrainedConfig
32
+ from transformers.onnx import OnnxConfig
33
+ from transformers.utils import logging
34
+
35
+
36
+ logger = logging.get_logger(__name__)
37
+
38
+
39
+ class QLIPTextConfig(PretrainedConfig):
40
+ r"""
41
+ This is the configuration class to store the configuration of a [`CLIPTextModel`]. It is used to instantiate a CLIP
42
+ text encoder according to the specified arguments, defining the model architecture. Instantiating a configuration
43
+ with the defaults will yield a similar configuration to that of the text encoder of the CLIP
44
+ [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) architecture.
45
+
46
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
47
+ documentation from [`PretrainedConfig`] for more information.
48
+
49
+ Args:
50
+ vocab_size (`int`, *optional*, defaults to 49408):
51
+ Vocabulary size of the CLIP text model. Defines the number of different tokens that can be represented by
52
+ the `inputs_ids` passed when calling [`CLIPModel`].
53
+ hidden_size (`int`, *optional*, defaults to 512):
54
+ Dimensionality of the encoder layers and the pooler layer.
55
+ intermediate_size (`int`, *optional*, defaults to 2048):
56
+ Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
57
+ projection_dim (`int`, *optional*, defaults to 512):
58
+ Dimentionality of text and vision projection layers.
59
+ num_hidden_layers (`int`, *optional*, defaults to 12):
60
+ Number of hidden layers in the Transformer encoder.
61
+ num_attention_heads (`int`, *optional*, defaults to 8):
62
+ Number of attention heads for each attention layer in the Transformer encoder.
63
+ max_position_embeddings (`int`, *optional*, defaults to 77):
64
+ The maximum sequence length that this model might ever be used with. Typically set this to something large
65
+ just in case (e.g., 512 or 1024 or 2048).
66
+ hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`):
67
+ The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
68
+ `"relu"`, `"selu"` and `"gelu_new"` `"quick_gelu"` are supported.
69
+ layer_norm_eps (`float`, *optional*, defaults to 1e-05):
70
+ The epsilon used by the layer normalization layers.
71
+ attention_dropout (`float`, *optional*, defaults to 0.0):
72
+ The dropout ratio for the attention probabilities.
73
+ initializer_range (`float`, *optional*, defaults to 0.02):
74
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
75
+ initializer_factor (`float`, *optional*, defaults to 1.0):
76
+ A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
77
+ testing).
78
+ pad_token_id (`int`, *optional*, defaults to 1):
79
+ Padding token id.
80
+ bos_token_id (`int`, *optional*, defaults to 49406):
81
+ Beginning of stream token id.
82
+ eos_token_id (`int`, *optional*, defaults to 49407):
83
+ End of stream token id.
84
+
85
+ Example:
86
+
87
+ ```python
88
+ >>> from transformers import CLIPTextConfig, CLIPTextModel
89
+
90
+ >>> # Initializing a CLIPTextConfig with openai/clip-vit-base-patch32 style configuration
91
+ >>> configuration = CLIPTextConfig()
92
+
93
+ >>> # Initializing a CLIPTextModel (with random weights) from the openai/clip-vit-base-patch32 style configuration
94
+ >>> model = CLIPTextModel(configuration)
95
+
96
+ >>> # Accessing the model configuration
97
+ >>> configuration = model.config
98
+ ```"""
99
+
100
+ model_type = "clip_text_model"
101
+
102
+ def __init__(
103
+ self,
104
+ vocab_size=49408,
105
+ hidden_size=512,
106
+ intermediate_size=2048,
107
+ projection_dim=512,
108
+ num_hidden_layers=12,
109
+ num_attention_heads=8,
110
+ max_position_embeddings=77,
111
+ hidden_act="gelu",
112
+ layer_norm_eps=1e-5,
113
+ attention_dropout=0.0,
114
+ initializer_range=0.02,
115
+ initializer_factor=1.0,
116
+ # This differs from `CLIPTokenizer`'s default and from openai/clip
117
+ # See https://github.com/huggingface/transformers/pull/24773#issuecomment-1632287538
118
+ q_bias=True,
119
+ k_bias=True,
120
+ v_bias=True,
121
+ subln=False,
122
+ swiglu=False,
123
+ rope=False,
124
+ post_layernorm=False,
125
+ pad_token_id=1,
126
+ bos_token_id=49406,
127
+ eos_token_id=49407,
128
+ **kwargs,
129
+ ):
130
+ super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
131
+
132
+ self.vocab_size = vocab_size
133
+ self.hidden_size = hidden_size
134
+ self.intermediate_size = intermediate_size
135
+ self.projection_dim = projection_dim
136
+ self.num_hidden_layers = num_hidden_layers
137
+ self.num_attention_heads = num_attention_heads
138
+ self.max_position_embeddings = max_position_embeddings
139
+ self.layer_norm_eps = layer_norm_eps
140
+ self.hidden_act = hidden_act
141
+ self.initializer_range = initializer_range
142
+ self.initializer_factor = initializer_factor
143
+ self.q_bias=q_bias
144
+ self.k_bias=k_bias
145
+ self.v_bias=v_bias
146
+ self.subln = subln
147
+ self.swiglu = swiglu
148
+ self.rope = rope
149
+ self.post_layernorm = post_layernorm
150
+ self.attention_dropout = attention_dropout
151
+
152
+ @classmethod
153
+ def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
154
+ cls._set_token_in_kwargs(kwargs)
155
+
156
+ config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
157
+
158
+ # get the text config dict if we are loading from CLIPConfig
159
+ if config_dict.get("model_type") == "clip":
160
+ config_dict = config_dict["text_config"]
161
+
162
+ if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
163
+ logger.warning(
164
+ f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
165
+ f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
166
+ )
167
+
168
+ return cls.from_dict(config_dict, **kwargs)
169
+
170
+
171
+ class QLIPVisionConfig(PretrainedConfig):
172
+ r"""
173
+ This is the configuration class to store the configuration of a [`CLIPVisionModel`]. It is used to instantiate a
174
+ CLIP vision encoder according to the specified arguments, defining the model architecture. Instantiating a
175
+ configuration with the defaults will yield a similar configuration to that of the vision encoder of the CLIP
176
+ [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) architecture.
177
+
178
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
179
+ documentation from [`PretrainedConfig`] for more information.
180
+
181
+ Args:
182
+ hidden_size (`int`, *optional*, defaults to 768):
183
+ Dimensionality of the encoder layers and the pooler layer.
184
+ intermediate_size (`int`, *optional*, defaults to 3072):
185
+ Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
186
+ projection_dim (`int`, *optional*, defaults to 512):
187
+ Dimentionality of text and vision projection layers.
188
+ num_hidden_layers (`int`, *optional*, defaults to 12):
189
+ Number of hidden layers in the Transformer encoder.
190
+ num_attention_heads (`int`, *optional*, defaults to 12):
191
+ Number of attention heads for each attention layer in the Transformer encoder.
192
+ num_channels (`int`, *optional*, defaults to 3):
193
+ The number of input channels.
194
+ image_size (`int`, *optional*, defaults to 224):
195
+ The size (resolution) of each image.
196
+ patch_size (`int`, *optional*, defaults to 32):
197
+ The size (resolution) of each patch.
198
+ hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`):
199
+ The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
200
+ `"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported.
201
+ layer_norm_eps (`float`, *optional*, defaults to 1e-05):
202
+ The epsilon used by the layer normalization layers.
203
+ attention_dropout (`float`, *optional*, defaults to 0.0):
204
+ The dropout ratio for the attention probabilities.
205
+ initializer_range (`float`, *optional*, defaults to 0.02):
206
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
207
+ initializer_factor (`float`, *optional*, defaults to 1.0):
208
+ A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
209
+ testing).
210
+
211
+ Example:
212
+
213
+ ```python
214
+ >>> from transformers import CLIPVisionConfig, CLIPVisionModel
215
+
216
+ >>> # Initializing a CLIPVisionConfig with openai/clip-vit-base-patch32 style configuration
217
+ >>> configuration = CLIPVisionConfig()
218
+
219
+ >>> # Initializing a CLIPVisionModel (with random weights) from the openai/clip-vit-base-patch32 style configuration
220
+ >>> model = CLIPVisionModel(configuration)
221
+
222
+ >>> # Accessing the model configuration
223
+ >>> configuration = model.config
224
+ ```"""
225
+
226
+ model_type = "clip_vision_model"
227
+
228
+ def __init__(
229
+ self,
230
+ hidden_size=768,
231
+ intermediate_size=3072,
232
+ projection_dim=512,
233
+ num_hidden_layers=12,
234
+ num_attention_heads=12,
235
+ num_channels=3,
236
+ image_size=224,
237
+ patch_size=32,
238
+ hidden_act="gelu",
239
+ layer_norm_eps=1e-5,
240
+ attention_dropout=0.0,
241
+ initializer_range=0.02,
242
+ initializer_factor=1.0,
243
+ q_bias=True,
244
+ k_bias=True,
245
+ v_bias=True,
246
+ subln=False,
247
+ swiglu=False,
248
+ rope=False,
249
+ post_layernorm=False,
250
+ # quantizer specs
251
+ quantizer="none",
252
+ quantizer_l2_norm=False,
253
+ quantizer_embed_type="identity",
254
+ hidden_size_post_q=None,
255
+ quantizer_cfg=dict(),
256
+ **kwargs,
257
+ ):
258
+ super().__init__(**kwargs)
259
+
260
+ self.hidden_size = hidden_size
261
+ self.intermediate_size = intermediate_size
262
+ self.projection_dim = projection_dim
263
+ self.num_hidden_layers = num_hidden_layers
264
+ self.num_attention_heads = num_attention_heads
265
+ self.num_channels = num_channels
266
+ self.patch_size = patch_size
267
+ self.image_size = image_size
268
+ self.initializer_range = initializer_range
269
+ self.initializer_factor = initializer_factor
270
+ self.q_bias=q_bias
271
+ self.k_bias=k_bias
272
+ self.v_bias=v_bias
273
+ self.subln = subln
274
+ self.swiglu = swiglu
275
+ self.rope = rope
276
+ self.post_layernorm = post_layernorm
277
+ self.attention_dropout = attention_dropout
278
+ self.layer_norm_eps = layer_norm_eps
279
+ self.hidden_act = hidden_act
280
+
281
+ self.quantizer = quantizer
282
+ self.quantizer_l2_norm = quantizer_l2_norm
283
+ self.quantizer_embed_type = quantizer_embed_type
284
+ self.hidden_size_post_q = self.hidden_size if hidden_size_post_q is None else hidden_size_post_q
285
+ self.quantizer_cfg = quantizer_cfg
286
+
287
+ @classmethod
288
+ def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
289
+ cls._set_token_in_kwargs(kwargs)
290
+
291
+ config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
292
+
293
+ # get the vision config dict if we are loading from CLIPConfig
294
+ if config_dict.get("model_type") == "clip":
295
+ config_dict = config_dict["vision_config"]
296
+
297
+ if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
298
+ logger.warning(
299
+ f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
300
+ f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
301
+ )
302
+
303
+ return cls.from_dict(config_dict, **kwargs)
304
+
305
+
306
+ class QLIPDecoderConfig(PretrainedConfig):
307
+ model_type = "clip_decoder_model"
308
+
309
+ def __init__(
310
+ self,
311
+ hidden_size=768,
312
+ intermediate_size=3072,
313
+ projection_dim=512,
314
+ num_hidden_layers=12,
315
+ num_attention_heads=12,
316
+ num_channels=3,
317
+ image_size=224,
318
+ patch_size=32,
319
+ hidden_act="gelu",
320
+ layer_norm_eps=1e-5,
321
+ attention_dropout=0.0,
322
+ initializer_range=0.02,
323
+ initializer_factor=1.0,
324
+ q_bias=True,
325
+ k_bias=True,
326
+ v_bias=True,
327
+ subln=False,
328
+ swiglu=False,
329
+ rope=False,
330
+ post_layernorm=False,
331
+ # quantizer specs
332
+ quantizer="none",
333
+ quantizer_l2_norm=False,
334
+ quantizer_embed_type="identity",
335
+ hidden_size_post_q=None,
336
+ quantizer_cfg=dict(),
337
+ **kwargs,
338
+ ):
339
+ super().__init__(**kwargs)
340
+
341
+ self.hidden_size = hidden_size
342
+ self.intermediate_size = intermediate_size
343
+ self.projection_dim = projection_dim
344
+ self.num_hidden_layers = num_hidden_layers
345
+ self.num_attention_heads = num_attention_heads
346
+ self.num_channels = num_channels
347
+ self.patch_size = patch_size
348
+ self.image_size = image_size
349
+ self.initializer_range = initializer_range
350
+ self.initializer_factor = initializer_factor
351
+ self.q_bias=q_bias
352
+ self.k_bias=k_bias
353
+ self.v_bias=v_bias
354
+ self.subln = subln
355
+ self.swiglu = swiglu
356
+ self.rope = rope
357
+ self.post_layernorm = post_layernorm
358
+ self.attention_dropout = attention_dropout
359
+ self.layer_norm_eps = layer_norm_eps
360
+ self.hidden_act = hidden_act
361
+
362
+ self.quantizer = quantizer
363
+ self.quantizer_l2_norm = quantizer_l2_norm
364
+ self.quantizer_embed_type = quantizer_embed_type
365
+ self.hidden_size_post_q = self.hidden_size if hidden_size_post_q is None else hidden_size_post_q
366
+ self.quantizer_cfg = quantizer_cfg
367
+
368
+ @classmethod
369
+ def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
370
+ cls._set_token_in_kwargs(kwargs)
371
+
372
+ config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
373
+
374
+ # get the vision config dict if we are loading from CLIPConfig
375
+ if config_dict.get("model_type") == "clip":
376
+ config_dict = config_dict["vision_config"]
377
+
378
+ if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
379
+ logger.warning(
380
+ f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
381
+ f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
382
+ )
383
+
384
+ return cls.from_dict(config_dict, **kwargs)
385
+
386
+
387
+ class QLIPConfig(PretrainedConfig):
388
+ r"""
389
+ [`CLIPConfig`] is the configuration class to store the configuration of a [`CLIPModel`]. It is used to instantiate
390
+ a CLIP model according to the specified arguments, defining the text model and vision model configs. Instantiating
391
+ a configuration with the defaults will yield a similar configuration to that of the CLIP
392
+ [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) architecture.
393
+
394
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
395
+ documentation from [`PretrainedConfig`] for more information.
396
+
397
+ Args:
398
+ text_config (`dict`, *optional*):
399
+ Dictionary of configuration options used to initialize [`CLIPTextConfig`].
400
+ vision_config (`dict`, *optional*):
401
+ Dictionary of configuration options used to initialize [`CLIPVisionConfig`].
402
+ projection_dim (`int`, *optional*, defaults to 512):
403
+ Dimentionality of text and vision projection layers.
404
+ logit_scale_init_value (`float`, *optional*, defaults to 2.6592):
405
+ The inital value of the *logit_scale* paramter. Default is used as per the original CLIP implementation.
406
+ kwargs (*optional*):
407
+ Dictionary of keyword arguments.
408
+
409
+ Example:
410
+
411
+ ```python
412
+ >>> from transformers import CLIPConfig, CLIPModel
413
+
414
+ >>> # Initializing a CLIPConfig with openai/clip-vit-base-patch32 style configuration
415
+ >>> configuration = CLIPConfig()
416
+
417
+ >>> # Initializing a CLIPModel (with random weights) from the openai/clip-vit-base-patch32 style configuration
418
+ >>> model = CLIPModel(configuration)
419
+
420
+ >>> # Accessing the model configuration
421
+ >>> configuration = model.config
422
+
423
+ >>> # We can also initialize a CLIPConfig from a CLIPTextConfig and a CLIPVisionConfig
424
+ >>> from transformers import CLIPTextConfig, CLIPVisionConfig
425
+
426
+ >>> # Initializing a CLIPText and CLIPVision configuration
427
+ >>> config_text = CLIPTextConfig()
428
+ >>> config_vision = CLIPVisionConfig()
429
+
430
+ >>> config = CLIPConfig.from_text_vision_configs(config_text, config_vision)
431
+ ```"""
432
+
433
+ model_type = "clip"
434
+
435
+ def __init__(
436
+ self, text_config=None, vision_config=None, decoder_config=None, projection_dim=512, logit_scale_init_value=2.6592, **kwargs
437
+ ):
438
+ # If `_config_dict` exist, we use them for the backward compatibility.
439
+ # We pop out these 2 attributes before calling `super().__init__` to avoid them being saved (which causes a lot
440
+ # of confusion!).
441
+ text_config_dict = kwargs.pop("text_config_dict", None)
442
+ vision_config_dict = kwargs.pop("vision_config_dict", None)
443
+ decoder_config_dict = kwargs.pop("decoder_config_dict", None)
444
+
445
+ super().__init__(**kwargs)
446
+
447
+ # Instead of simply assigning `[text|vision]_config_dict` to `[text|vision]_config`, we use the values in
448
+ # `[text|vision]_config_dict` to update the values in `[text|vision]_config`. The values should be same in most
449
+ # cases, but we don't want to break anything regarding `_config_dict` that existed before commit `8827e1b2`.
450
+ if text_config_dict is not None:
451
+ if text_config is None:
452
+ text_config = {}
453
+
454
+ # This is the complete result when using `text_config_dict`.
455
+ _text_config_dict = QLIPTextConfig(**text_config_dict).to_dict()
456
+
457
+ # Give a warning if the values exist in both `_text_config_dict` and `text_config` but being different.
458
+ for key, value in _text_config_dict.items():
459
+ if key in text_config and value != text_config[key] and key not in ["transformers_version"]:
460
+ # If specified in `text_config_dict`
461
+ if key in text_config_dict:
462
+ message = (
463
+ f"`{key}` is found in both `text_config_dict` and `text_config` but with different values. "
464
+ f'The value `text_config_dict["{key}"]` will be used instead.'
465
+ )
466
+ # If inferred from default argument values (just to be super careful)
467
+ else:
468
+ message = (
469
+ f"`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The "
470
+ f'value `text_config["{key}"]` will be overriden.'
471
+ )
472
+ logger.info(message)
473
+
474
+ # Update all values in `text_config` with the ones in `_text_config_dict`.
475
+ text_config.update(_text_config_dict)
476
+
477
+ if vision_config_dict is not None:
478
+ if vision_config is None:
479
+ vision_config = {}
480
+
481
+ # This is the complete result when using `vision_config_dict`.
482
+ _vision_config_dict = QLIPVisionConfig(**vision_config_dict).to_dict()
483
+ # convert keys to string instead of integer
484
+ if "id2label" in _vision_config_dict:
485
+ _vision_config_dict["id2label"] = {
486
+ str(key): value for key, value in _vision_config_dict["id2label"].items()
487
+ }
488
+
489
+ # Give a warning if the values exist in both `_vision_config_dict` and `vision_config` but being different.
490
+ for key, value in _vision_config_dict.items():
491
+ if key in vision_config and value != vision_config[key] and key not in ["transformers_version"]:
492
+ # If specified in `vision_config_dict`
493
+ if key in vision_config_dict:
494
+ message = (
495
+ f"`{key}` is found in both `vision_config_dict` and `vision_config` but with different "
496
+ f'values. The value `vision_config_dict["{key}"]` will be used instead.'
497
+ )
498
+ # If inferred from default argument values (just to be super careful)
499
+ else:
500
+ message = (
501
+ f"`vision_config_dict` is provided which will be used to initialize `CLIPVisionConfig`. "
502
+ f'The value `vision_config["{key}"]` will be overriden.'
503
+ )
504
+ logger.info(message)
505
+
506
+ # Update all values in `vision_config` with the ones in `_vision_config_dict`.
507
+ vision_config.update(_vision_config_dict)
508
+
509
+ if decoder_config_dict is not None:
510
+ if decoder_config is None:
511
+ decoder_config = {}
512
+
513
+ # This is the complete result when using `decoder_config_dict`.
514
+ _decoder_config_dict = QLIPDecoderConfig(**decoder_config_dict).to_dict()
515
+
516
+ # Give a warning if the values exist in both `_decoder_config_dict` and `decoder_config` but being different.
517
+ for key, value in _decoder_config_dict.items():
518
+ if key in decoder_config and value != decoder_config[key] and key not in ["transformers_version"]:
519
+ # If specified in `decoder_config_dict`
520
+ if key in decoder_config_dict:
521
+ message = (
522
+ f"`{key}` is found in both `decoder_config_dict` and `decoder_config` but with different values. "
523
+ f'The value `decoder_config_dict["{key}"]` will be used instead.'
524
+ )
525
+ # If inferred from default argument values (just to be super careful)
526
+ else:
527
+ message = (
528
+ f"`decoder_config_dict` is provided which will be used to initialize `QLIPDecoderConfig`. The "
529
+ f'value `decoder_config["{key}"]` will be overriden.'
530
+ )
531
+ logger.info(message)
532
+
533
+ # Update all values in `decoder_config` with the ones in `_decoder_config_dict`.
534
+ decoder_config.update(_decoder_config_dict)
535
+
536
+ if text_config is None:
537
+ text_config = {}
538
+ logger.info("`text_config` is `None`. Initializing the `CLIPTextConfig` with default values.")
539
+
540
+ if vision_config is None:
541
+ vision_config = {}
542
+ logger.info("`vision_config` is `None`. initializing the `CLIPVisionConfig` with default values.")
543
+
544
+ if decoder_config is None:
545
+ decoder_config = {}
546
+ logger.info("`decoder_config` is `None`. initializing the `CLIPDecoderConfig` with default values.")
547
+
548
+ self.text_config = QLIPTextConfig(**text_config)
549
+ self.vision_config = QLIPVisionConfig(**vision_config)
550
+ self.decoder_config = QLIPDecoderConfig(**decoder_config)
551
+
552
+ self.projection_dim = projection_dim
553
+ self.logit_scale_init_value = logit_scale_init_value
554
+ self.initializer_factor = 1.0
555
+
556
+ @classmethod
557
+ def from_text_vision_configs(cls, text_config: QLIPTextConfig, vision_config: QLIPVisionConfig, decoder_config: QLIPDecoderConfig, **kwargs):
558
+ r"""
559
+ Instantiate a [`CLIPConfig`] (or a derived class) from clip text model configuration and clip vision model
560
+ configuration.
561
+
562
+ Returns:
563
+ [`CLIPConfig`]: An instance of a configuration object
564
+ """
565
+
566
+ return cls(text_config=text_config.to_dict(), vision_config=vision_config.to_dict(), decoder_config=decoder_config.to_dict(), **kwargs)
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fadc513e54e22fa7e1f8b3195e5202a5b36f6dcb4f7ae8b00af6b792b337da52
3
+ size 958085620
modeling_qlip.py ADDED
@@ -0,0 +1,1481 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) 2024, NVIDIA Corporation & Affiliates. All rights reserved.
2
+ #
3
+ # This work is made available under the Nvidia Source Code License-NC.
4
+ # To view a copy of this license, visit
5
+ # https://github.com/NVlabs/QLIP/blob/main/LICENSE
6
+
7
+ # Copyright 2021 The OpenAI Team Authors and The HuggingFace Team. All rights reserved.
8
+ #
9
+ # Licensed under the Apache License, Version 2.0 (the "License");
10
+ # you may not use this file except in compliance with the License.
11
+ # You may obtain a copy of the License at
12
+ #
13
+ # http://www.apache.org/licenses/LICENSE-2.0
14
+ #
15
+ # Unless required by applicable law or agreed to in writing, software
16
+ # distributed under the License is distributed on an "AS IS" BASIS,
17
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
18
+ # See the License for the specific language governing permissions and
19
+ # limitations under the License.
20
+ """ PyTorch CLIP model."""
21
+
22
+
23
+ from collections import OrderedDict
24
+ from dataclasses import dataclass
25
+ from typing import Any, Optional, Tuple, Union
26
+
27
+ from einops import rearrange
28
+ import torch
29
+ import torch.utils.checkpoint
30
+ from torch import nn
31
+ import torch.nn.functional as F
32
+
33
+ from transformers.activations import ACT2FN
34
+ from transformers.modeling_attn_mask_utils import _create_4d_causal_attention_mask, _prepare_4d_attention_mask
35
+ from transformers.modeling_outputs import BaseModelOutput, BaseModelOutputWithPooling
36
+ from transformers.modeling_utils import PreTrainedModel
37
+ from transformers.utils import (
38
+ ModelOutput,
39
+ add_start_docstrings,
40
+ add_start_docstrings_to_model_forward,
41
+ logging,
42
+ replace_return_docstrings,
43
+ )
44
+
45
+ from configuration_qlip import QLIPConfig, QLIPTextConfig, QLIPVisionConfig, QLIPDecoderConfig
46
+ from bsq import BinarySphericalQuantizer
47
+ from rope import VisionRotaryEmbeddingFast
48
+
49
+
50
+ logger = logging.get_logger(__name__)
51
+
52
+ _CHECKPOINT_FOR_DOC = "openai/clip-vit-base-patch32"
53
+
54
+ CLIP_PRETRAINED_MODEL_ARCHIVE_LIST = [
55
+ "openai/clip-vit-base-patch32",
56
+ # See all CLIP models at https://huggingface.co/models?filter=clip
57
+ ]
58
+
59
+
60
+ # contrastive loss function, adapted from
61
+ # https://sachinruk.github.io/blog/2021-03-07-clip.html
62
+ def contrastive_loss(logits: torch.Tensor) -> torch.Tensor:
63
+ return nn.functional.cross_entropy(logits, torch.arange(len(logits), device=logits.device))
64
+
65
+
66
+ def clip_loss(similarity: torch.Tensor) -> torch.Tensor:
67
+ caption_loss = contrastive_loss(similarity)
68
+ image_loss = contrastive_loss(similarity.t())
69
+ return (caption_loss + image_loss) / 2.0
70
+
71
+
72
+ @dataclass
73
+ class QLIPVisionModelOutput(ModelOutput):
74
+ """
75
+ Base class for vision model's outputs that also contains image embeddings of the pooling of the last hidden states.
76
+
77
+ Args:
78
+ image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
79
+ The image embeddings obtained by applying the projection layer to the pooler_output.
80
+ last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
81
+ Sequence of hidden-states at the output of the last layer of the model.
82
+ hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
83
+ Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
84
+ one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
85
+
86
+ Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
87
+ attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
88
+ Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
89
+ sequence_length)`.
90
+
91
+ Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
92
+ heads.
93
+ """
94
+
95
+ image_embeds: Optional[torch.FloatTensor] = None
96
+ last_hidden_state: torch.FloatTensor = None
97
+ hidden_states: Optional[Tuple[torch.FloatTensor]] = None
98
+ attentions: Optional[Tuple[torch.FloatTensor]] = None
99
+
100
+
101
+ @dataclass
102
+ class QLIPTextModelOutput(ModelOutput):
103
+ """
104
+ Base class for text model's outputs that also contains a pooling of the last hidden states.
105
+
106
+ Args:
107
+ text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
108
+ The text embeddings obtained by applying the projection layer to the pooler_output.
109
+ last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
110
+ Sequence of hidden-states at the output of the last layer of the model.
111
+ hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
112
+ Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
113
+ one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
114
+
115
+ Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
116
+ attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
117
+ Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
118
+ sequence_length)`.
119
+
120
+ Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
121
+ heads.
122
+ """
123
+
124
+ text_embeds: Optional[torch.FloatTensor] = None
125
+ last_hidden_state: torch.FloatTensor = None
126
+ hidden_states: Optional[Tuple[torch.FloatTensor]] = None
127
+ attentions: Optional[Tuple[torch.FloatTensor]] = None
128
+
129
+
130
+ @dataclass
131
+ class QLIPOutput(ModelOutput):
132
+ """
133
+ Args:
134
+ loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`):
135
+ Contrastive loss for image-text similarity.
136
+ logits_per_image:(`torch.FloatTensor` of shape `(image_batch_size, text_batch_size)`):
137
+ The scaled dot product scores between `image_embeds` and `text_embeds`. This represents the image-text
138
+ similarity scores.
139
+ logits_per_text:(`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`):
140
+ The scaled dot product scores between `text_embeds` and `image_embeds`. This represents the text-image
141
+ similarity scores.
142
+ text_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`):
143
+ The text embeddings obtained by applying the projection layer to the pooled output of [`CLIPTextModel`].
144
+ image_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`):
145
+ The image embeddings obtained by applying the projection layer to the pooled output of [`CLIPVisionModel`].
146
+ text_model_output(`BaseModelOutputWithPooling`):
147
+ The output of the [`CLIPTextModel`].
148
+ vision_model_output(`BaseModelOutputWithPooling`):
149
+ The output of the [`CLIPVisionModel`].
150
+ """
151
+
152
+ loss: Optional[torch.FloatTensor] = None
153
+ logits_per_image: torch.FloatTensor = None
154
+ logits_per_text: torch.FloatTensor = None
155
+ text_embeds: torch.FloatTensor = None
156
+ image_embeds: torch.FloatTensor = None
157
+ text_model_output: BaseModelOutputWithPooling = None
158
+ vision_model_output: BaseModelOutputWithPooling = None
159
+ reconstructions: torch.FloatTensor = None
160
+
161
+ def to_tuple(self) -> Tuple[Any]:
162
+ return tuple(
163
+ self[k] if k not in ["text_model_output", "vision_model_output"] else getattr(self, k).to_tuple()
164
+ for k in self.keys()
165
+ )
166
+
167
+
168
+ class QLIPVisionEmbeddings(nn.Module):
169
+ def __init__(self, config: QLIPVisionConfig):
170
+ super().__init__()
171
+ self.config = config
172
+ self.embed_dim = config.hidden_size
173
+ self.image_size = config.image_size
174
+ self.patch_size = config.patch_size
175
+
176
+ self.class_embedding = nn.Parameter(torch.randn(self.embed_dim))
177
+
178
+ self.patch_embedding = nn.Conv2d(
179
+ in_channels=config.num_channels,
180
+ out_channels=self.embed_dim,
181
+ kernel_size=self.patch_size,
182
+ stride=self.patch_size,
183
+ bias=True,
184
+ )
185
+
186
+ self.num_patches = (self.image_size // self.patch_size) ** 2
187
+ self.num_positions = self.num_patches + 1
188
+ self.position_embedding = nn.Embedding(self.num_positions, self.embed_dim)
189
+ self.register_buffer("position_ids", torch.arange(self.num_positions).expand((1, -1)), persistent=False)
190
+
191
+ def forward(self, pixel_values: torch.FloatTensor) -> torch.Tensor:
192
+ batch_size = pixel_values.shape[0]
193
+ target_dtype = self.patch_embedding.weight.dtype
194
+ patch_embeds = self.patch_embedding(pixel_values.to(dtype=target_dtype)) # shape = [*, width, grid, grid]
195
+ patch_embeds = patch_embeds.flatten(2).transpose(1, 2)
196
+
197
+ class_embeds = self.class_embedding.expand(batch_size, 1, -1)
198
+ embeddings = torch.cat([class_embeds, patch_embeds], dim=1)
199
+ embeddings = embeddings + self.position_embedding(self.position_ids)
200
+ return embeddings
201
+
202
+
203
+ class QLIPTextEmbeddings(nn.Module):
204
+ def __init__(self, config: QLIPTextConfig):
205
+ super().__init__()
206
+ embed_dim = config.hidden_size
207
+
208
+ self.token_embedding = nn.Embedding(config.vocab_size, embed_dim)
209
+ self.position_embedding = nn.Embedding(config.max_position_embeddings, embed_dim)
210
+
211
+ # position_ids (1, len position emb) is contiguous in memory and exported when serialized
212
+ self.register_buffer(
213
+ "position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)), persistent=False
214
+ )
215
+
216
+ def forward(
217
+ self,
218
+ input_ids: Optional[torch.LongTensor] = None,
219
+ position_ids: Optional[torch.LongTensor] = None,
220
+ inputs_embeds: Optional[torch.FloatTensor] = None,
221
+ ) -> torch.Tensor:
222
+ seq_length = input_ids.shape[-1] if input_ids is not None else inputs_embeds.shape[-2]
223
+
224
+ if position_ids is None:
225
+ position_ids = self.position_ids[:, :seq_length]
226
+
227
+ if inputs_embeds is None:
228
+ inputs_embeds = self.token_embedding(input_ids)
229
+
230
+ position_embeddings = self.position_embedding(position_ids)
231
+ embeddings = inputs_embeds + position_embeddings
232
+
233
+ return embeddings
234
+
235
+
236
+ class QLIPAttention(nn.Module):
237
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
238
+
239
+ def __init__(self, config, rope=None, rope_shift=1):
240
+ super().__init__()
241
+ self.config = config
242
+ self.embed_dim = config.hidden_size
243
+ self.num_heads = config.num_attention_heads
244
+ self.head_dim = self.embed_dim // self.num_heads
245
+ if self.head_dim * self.num_heads != self.embed_dim:
246
+ raise ValueError(
247
+ f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`:"
248
+ f" {self.num_heads})."
249
+ )
250
+ self.scale = self.head_dim**-0.5
251
+ self.dropout = config.attention_dropout
252
+
253
+ self.subln = config.subln
254
+ self.k_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=config.k_bias)
255
+ self.v_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=config.v_bias)
256
+ self.q_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=config.q_bias)
257
+ self.inner_attn_ln = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_eps) if config.subln else nn.Identity()
258
+ self.out_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=True)
259
+
260
+ self.rope = rope
261
+ self.rope_shift = rope_shift
262
+
263
+ def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
264
+ return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
265
+
266
+ def forward(
267
+ self,
268
+ hidden_states: torch.Tensor,
269
+ attention_mask: Optional[torch.Tensor] = None,
270
+ causal_attention_mask: Optional[torch.Tensor] = None,
271
+ output_attentions: Optional[bool] = False,
272
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
273
+ """Input shape: Batch x Time x Channel"""
274
+
275
+ bsz, tgt_len, embed_dim = hidden_states.size()
276
+
277
+ # get query proj
278
+ query_states = self.q_proj(hidden_states) * self.scale
279
+ key_states = self._shape(self.k_proj(hidden_states), -1, bsz)
280
+ value_states = self._shape(self.v_proj(hidden_states), -1, bsz)
281
+
282
+ proj_shape = (bsz * self.num_heads, -1, self.head_dim)
283
+ query_states = self._shape(query_states, tgt_len, bsz).view(*proj_shape)
284
+ key_states = key_states.view(*proj_shape)
285
+ value_states = value_states.view(*proj_shape)
286
+
287
+ if self.rope:
288
+ q_t = query_states[:, self.rope_shift:, :]
289
+ ro_q_t = self.rope(q_t)
290
+ query_states = torch.cat([query_states[:, :self.rope_shift, :], ro_q_t], dim=-2).type_as(value_states)
291
+
292
+ k_t = key_states[:, self.rope_shift:, :]
293
+ ro_k_t = self.rope(k_t)
294
+ key_states = torch.cat([key_states[:, :self.rope_shift, :], ro_k_t], dim=-2).type_as(value_states)
295
+
296
+ src_len = key_states.size(1)
297
+ attn_weights = torch.bmm(query_states, key_states.transpose(1, 2))
298
+
299
+ if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
300
+ raise ValueError(
301
+ f"Attention weights should be of size {(bsz * self.num_heads, tgt_len, src_len)}, but is"
302
+ f" {attn_weights.size()}"
303
+ )
304
+
305
+ # apply the causal_attention_mask first
306
+ if causal_attention_mask is not None:
307
+ if causal_attention_mask.size() != (bsz, 1, tgt_len, src_len):
308
+ raise ValueError(
309
+ f"Attention mask should be of size {(bsz, 1, tgt_len, src_len)}, but is"
310
+ f" {causal_attention_mask.size()}"
311
+ )
312
+ attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len) + causal_attention_mask
313
+ attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)
314
+
315
+ if attention_mask is not None:
316
+ if attention_mask.size() != (bsz, 1, tgt_len, src_len):
317
+ raise ValueError(
318
+ f"Attention mask should be of size {(bsz, 1, tgt_len, src_len)}, but is {attention_mask.size()}"
319
+ )
320
+ attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len) + attention_mask
321
+ attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)
322
+
323
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1)
324
+
325
+ if output_attentions:
326
+ # this operation is a bit akward, but it's required to
327
+ # make sure that attn_weights keeps its gradient.
328
+ # In order to do so, attn_weights have to reshaped
329
+ # twice and have to be reused in the following
330
+ attn_weights_reshaped = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
331
+ attn_weights = attn_weights_reshaped.view(bsz * self.num_heads, tgt_len, src_len)
332
+ else:
333
+ attn_weights_reshaped = None
334
+
335
+ attn_probs = nn.functional.dropout(attn_weights, p=self.dropout, training=self.training)
336
+
337
+ attn_output = torch.bmm(attn_probs, value_states)
338
+
339
+ if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):
340
+ raise ValueError(
341
+ f"`attn_output` should be of size {(bsz, self.num_heads, tgt_len, self.head_dim)}, but is"
342
+ f" {attn_output.size()}"
343
+ )
344
+
345
+ attn_output = attn_output.view(bsz, self.num_heads, tgt_len, self.head_dim)
346
+ attn_output = attn_output.transpose(1, 2)
347
+ attn_output = attn_output.reshape(bsz, tgt_len, embed_dim)
348
+
349
+ attn_output = self.inner_attn_ln(attn_output)
350
+ attn_output = self.out_proj(attn_output)
351
+
352
+ return attn_output, attn_weights_reshaped
353
+
354
+
355
+ class QLIPSwiGLU(nn.Module):
356
+ def __init__(self, config):
357
+ super().__init__()
358
+ self.config = config
359
+ self.hidden_size = config.hidden_size
360
+ self.intermediate_size = config.intermediate_size
361
+ self.w1 = nn.Linear(self.hidden_size, self.intermediate_size)
362
+ self.w2 = nn.Linear(self.hidden_size, self.intermediate_size)
363
+ self.w3 = nn.Linear(self.intermediate_size, self.hidden_size)
364
+ self.act_fn = nn.SiLU()
365
+ self.ffn_ln = nn.LayerNorm(self.intermediate_size, eps=config.layer_norm_eps) if config.subln else nn.Identity()
366
+
367
+ def forward(self, x):
368
+ x1 = self.w1(x)
369
+ x2 = self.w2(x)
370
+ hidden = self.act_fn(x1) * x2
371
+ x = self.ffn_ln(hidden)
372
+ x = self.w3(x)
373
+ return x
374
+
375
+
376
+ class QLIPMLP(nn.Module):
377
+ def __init__(self, config):
378
+ super().__init__()
379
+ self.config = config
380
+ self.activation_fn = ACT2FN[config.hidden_act]
381
+ self.fc1 = nn.Linear(config.hidden_size, config.intermediate_size)
382
+ self.fc2 = nn.Linear(config.intermediate_size, config.hidden_size)
383
+ self.ffn_ln = nn.LayerNorm(config.intermediate_size, eps=config.layer_norm_eps) if config.subln else nn.Identity()
384
+
385
+ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
386
+ hidden_states = self.fc1(hidden_states)
387
+ hidden_states = self.activation_fn(hidden_states)
388
+ hidden_states = self.ffn_ln(hidden_states)
389
+ hidden_states = self.fc2(hidden_states)
390
+ return hidden_states
391
+
392
+
393
+ class QLIPEncoderLayer(nn.Module):
394
+ def __init__(self, config: QLIPConfig, rope=None, rope_shift=1):
395
+ super().__init__()
396
+ self.embed_dim = config.hidden_size
397
+ self.self_attn = QLIPAttention(config, rope=rope, rope_shift=rope_shift)
398
+ self.layer_norm1 = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_eps)
399
+ self.mlp = QLIPSwiGLU(config) if config.swiglu else QLIPMLP(config)
400
+ self.layer_norm2 = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_eps)
401
+
402
+ def forward(
403
+ self,
404
+ hidden_states: torch.Tensor,
405
+ attention_mask: torch.Tensor,
406
+ causal_attention_mask: torch.Tensor,
407
+ output_attentions: Optional[bool] = False,
408
+ ) -> Tuple[torch.FloatTensor]:
409
+ """
410
+ Args:
411
+ hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
412
+ attention_mask (`torch.FloatTensor`): attention mask of size
413
+ `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
414
+ `(config.encoder_attention_heads,)`.
415
+ output_attentions (`bool`, *optional*):
416
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
417
+ returned tensors for more detail.
418
+ """
419
+ residual = hidden_states
420
+
421
+ hidden_states = self.layer_norm1(hidden_states)
422
+ hidden_states, attn_weights = self.self_attn(
423
+ hidden_states=hidden_states,
424
+ attention_mask=attention_mask,
425
+ causal_attention_mask=causal_attention_mask,
426
+ output_attentions=output_attentions,
427
+ )
428
+ hidden_states = residual + hidden_states
429
+
430
+ residual = hidden_states
431
+ hidden_states = self.layer_norm2(hidden_states)
432
+ hidden_states = self.mlp(hidden_states)
433
+ hidden_states = residual + hidden_states
434
+
435
+ outputs = (hidden_states,)
436
+
437
+ if output_attentions:
438
+ outputs += (attn_weights,)
439
+
440
+ return outputs
441
+
442
+
443
+ class QLIPPreTrainedModel(PreTrainedModel):
444
+ """
445
+ An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
446
+ models.
447
+ """
448
+
449
+ config_class = QLIPConfig
450
+ base_model_prefix = "clip"
451
+ supports_gradient_checkpointing = True
452
+
453
+ def _init_weights(self, module):
454
+ """Initialize the weights"""
455
+ factor = self.config.initializer_factor
456
+ if isinstance(module, QLIPTextEmbeddings):
457
+ module.token_embedding.weight.data.normal_(mean=0.0, std=factor * 0.02)
458
+ module.position_embedding.weight.data.normal_(mean=0.0, std=factor * 0.02)
459
+ elif isinstance(module, QLIPVisionEmbeddings):
460
+ factor = self.config.initializer_factor
461
+ nn.init.normal_(module.class_embedding, mean=0.0, std=module.embed_dim**-0.5 * factor)
462
+ nn.init.normal_(module.patch_embedding.weight, std=module.config.initializer_range * factor)
463
+ nn.init.normal_(module.position_embedding.weight, std=module.config.initializer_range * factor)
464
+ elif isinstance(module, QLIPAttention):
465
+ factor = self.config.initializer_factor
466
+ in_proj_std = (module.embed_dim**-0.5) * ((2 * module.config.num_hidden_layers) ** -0.5) * factor
467
+ out_proj_std = (module.embed_dim**-0.5) * factor
468
+ nn.init.normal_(module.q_proj.weight, std=in_proj_std)
469
+ nn.init.normal_(module.k_proj.weight, std=in_proj_std)
470
+ nn.init.normal_(module.v_proj.weight, std=in_proj_std)
471
+ nn.init.normal_(module.out_proj.weight, std=out_proj_std)
472
+ elif isinstance(module, QLIPMLP):
473
+ factor = self.config.initializer_factor
474
+ in_proj_std = (module.config.hidden_size**-0.5) * ((2 * module.config.num_hidden_layers) ** -0.5) * factor
475
+ fc_std = (2 * module.config.hidden_size) ** -0.5 * factor
476
+ nn.init.normal_(module.fc1.weight, std=fc_std)
477
+ nn.init.normal_(module.fc2.weight, std=in_proj_std)
478
+ elif isinstance(module, QLIPModel):
479
+ nn.init.normal_(
480
+ module.text_projection.weight,
481
+ std=module.text_embed_dim**-0.5 * self.config.initializer_factor,
482
+ )
483
+ nn.init.normal_(
484
+ module.visual_projection.weight,
485
+ std=module.vision_embed_dim**-0.5 * self.config.initializer_factor,
486
+ )
487
+ elif isinstance(module, QLIPVisionModelWithProjection):
488
+ nn.init.normal_(
489
+ module.visual_projection.weight,
490
+ std=self.config.hidden_size**-0.5 * self.config.initializer_factor,
491
+ )
492
+ elif isinstance(module, QLIPTextModelWithProjection):
493
+ nn.init.normal_(
494
+ module.text_projection.weight,
495
+ std=self.config.hidden_size**-0.5 * self.config.initializer_factor,
496
+ )
497
+
498
+ if isinstance(module, nn.LayerNorm):
499
+ module.bias.data.zero_()
500
+ module.weight.data.fill_(1.0)
501
+ if isinstance(module, nn.Linear) and module.bias is not None:
502
+ module.bias.data.zero_()
503
+
504
+
505
+ CLIP_START_DOCSTRING = r"""
506
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
507
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
508
+ etc.)
509
+
510
+ This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
511
+ Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
512
+ and behavior.
513
+
514
+ Parameters:
515
+ config ([`CLIPConfig`]): Model configuration class with all the parameters of the model.
516
+ Initializing with a config file does not load the weights associated with the model, only the
517
+ configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
518
+ """
519
+
520
+ CLIP_TEXT_INPUTS_DOCSTRING = r"""
521
+ Args:
522
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
523
+ Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
524
+ it.
525
+
526
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
527
+ [`PreTrainedTokenizer.__call__`] for details.
528
+
529
+ [What are input IDs?](../glossary#input-ids)
530
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
531
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
532
+
533
+ - 1 for tokens that are **not masked**,
534
+ - 0 for tokens that are **masked**.
535
+
536
+ [What are attention masks?](../glossary#attention-mask)
537
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
538
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
539
+ config.max_position_embeddings - 1]`.
540
+
541
+ [What are position IDs?](../glossary#position-ids)
542
+ output_attentions (`bool`, *optional*):
543
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
544
+ tensors for more detail.
545
+ output_hidden_states (`bool`, *optional*):
546
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
547
+ more detail.
548
+ return_dict (`bool`, *optional*):
549
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
550
+ """
551
+
552
+ CLIP_VISION_INPUTS_DOCSTRING = r"""
553
+ Args:
554
+ pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
555
+ Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using
556
+ [`AutoImageProcessor`]. See [`CLIPImageProcessor.__call__`] for details.
557
+ output_attentions (`bool`, *optional*):
558
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
559
+ tensors for more detail.
560
+ output_hidden_states (`bool`, *optional*):
561
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
562
+ more detail.
563
+ return_dict (`bool`, *optional*):
564
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
565
+ """
566
+
567
+ CLIP_INPUTS_DOCSTRING = r"""
568
+ Args:
569
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
570
+ Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
571
+ it.
572
+
573
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
574
+ [`PreTrainedTokenizer.__call__`] for details.
575
+
576
+ [What are input IDs?](../glossary#input-ids)
577
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
578
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
579
+
580
+ - 1 for tokens that are **not masked**,
581
+ - 0 for tokens that are **masked**.
582
+
583
+ [What are attention masks?](../glossary#attention-mask)
584
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
585
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
586
+ config.max_position_embeddings - 1]`.
587
+
588
+ [What are position IDs?](../glossary#position-ids)
589
+ pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
590
+ Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using
591
+ [`AutoImageProcessor`]. See [`CLIPImageProcessor.__call__`] for details.
592
+ return_loss (`bool`, *optional*):
593
+ Whether or not to return the contrastive loss.
594
+ output_attentions (`bool`, *optional*):
595
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
596
+ tensors for more detail.
597
+ output_hidden_states (`bool`, *optional*):
598
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
599
+ more detail.
600
+ return_dict (`bool`, *optional*):
601
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
602
+ """
603
+
604
+
605
+ class QLIPEncoder(nn.Module):
606
+ """
607
+ Transformer encoder consisting of `config.num_hidden_layers` self attention layers. Each layer is a
608
+ [`CLIPEncoderLayer`].
609
+
610
+ Args:
611
+ config: CLIPConfig
612
+ """
613
+
614
+ def __init__(self, config: QLIPConfig, rope=None, rope_shift=1):
615
+ super().__init__()
616
+ self.config = config
617
+ self.layers = nn.ModuleList([
618
+ QLIPEncoderLayer(config, rope=rope, rope_shift=rope_shift)
619
+ for _ in range(config.num_hidden_layers)
620
+ ])
621
+ self.gradient_checkpointing = False
622
+
623
+ def forward(
624
+ self,
625
+ inputs_embeds,
626
+ attention_mask: Optional[torch.Tensor] = None,
627
+ causal_attention_mask: Optional[torch.Tensor] = None,
628
+ output_attentions: Optional[bool] = None,
629
+ output_hidden_states: Optional[bool] = None,
630
+ return_dict: Optional[bool] = None,
631
+ ) -> Union[Tuple, BaseModelOutput]:
632
+ r"""
633
+ Args:
634
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
635
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation.
636
+ This is useful if you want more control over how to convert `input_ids` indices into associated vectors
637
+ than the model's internal embedding lookup matrix.
638
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
639
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
640
+
641
+ - 1 for tokens that are **not masked**,
642
+ - 0 for tokens that are **masked**.
643
+
644
+ [What are attention masks?](../glossary#attention-mask)
645
+ causal_attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
646
+ Causal mask for the text model. Mask values selected in `[0, 1]`:
647
+
648
+ - 1 for tokens that are **not masked**,
649
+ - 0 for tokens that are **masked**.
650
+
651
+ [What are attention masks?](../glossary#attention-mask)
652
+ output_attentions (`bool`, *optional*):
653
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
654
+ returned tensors for more detail.
655
+ output_hidden_states (`bool`, *optional*):
656
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
657
+ for more detail.
658
+ return_dict (`bool`, *optional*):
659
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
660
+ """
661
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
662
+ output_hidden_states = (
663
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
664
+ )
665
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
666
+
667
+ encoder_states = () if output_hidden_states else None
668
+ all_attentions = () if output_attentions else None
669
+
670
+ hidden_states = inputs_embeds
671
+ for idx, encoder_layer in enumerate(self.layers):
672
+ if output_hidden_states:
673
+ encoder_states = encoder_states + (hidden_states,)
674
+ if self.gradient_checkpointing and self.training:
675
+ layer_outputs = self._gradient_checkpointing_func(
676
+ encoder_layer.__call__,
677
+ hidden_states,
678
+ attention_mask,
679
+ causal_attention_mask,
680
+ output_attentions,
681
+ )
682
+ else:
683
+ layer_outputs = encoder_layer(
684
+ hidden_states,
685
+ attention_mask,
686
+ causal_attention_mask,
687
+ output_attentions=output_attentions,
688
+ )
689
+
690
+ hidden_states = layer_outputs[0]
691
+
692
+ if output_attentions:
693
+ all_attentions = all_attentions + (layer_outputs[1],)
694
+
695
+ if output_hidden_states:
696
+ encoder_states = encoder_states + (hidden_states,)
697
+
698
+ if not return_dict:
699
+ return tuple(v for v in [hidden_states, encoder_states, all_attentions] if v is not None)
700
+ return BaseModelOutput(
701
+ last_hidden_state=hidden_states, hidden_states=encoder_states, attentions=all_attentions
702
+ )
703
+
704
+
705
+ class QLIPTextTransformer(nn.Module):
706
+ def __init__(self, config: QLIPTextConfig):
707
+ super().__init__()
708
+ self.config = config
709
+ embed_dim = config.hidden_size
710
+ self.embeddings = QLIPTextEmbeddings(config)
711
+ self.encoder = QLIPEncoder(config)
712
+ self.final_layer_norm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps)
713
+
714
+ # For `pooled_output` computation
715
+ self.eos_token_id = config.eos_token_id
716
+
717
+ @add_start_docstrings_to_model_forward(CLIP_TEXT_INPUTS_DOCSTRING)
718
+ @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=QLIPTextConfig)
719
+ def forward(
720
+ self,
721
+ input_ids: Optional[torch.Tensor] = None,
722
+ attention_mask: Optional[torch.Tensor] = None,
723
+ position_ids: Optional[torch.Tensor] = None,
724
+ output_attentions: Optional[bool] = None,
725
+ output_hidden_states: Optional[bool] = None,
726
+ return_dict: Optional[bool] = None,
727
+ ) -> Union[Tuple, BaseModelOutputWithPooling]:
728
+ r"""
729
+ Returns:
730
+
731
+ """
732
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
733
+ output_hidden_states = (
734
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
735
+ )
736
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
737
+
738
+ if input_ids is None:
739
+ raise ValueError("You have to specify input_ids")
740
+
741
+ input_shape = input_ids.size()
742
+ input_ids = input_ids.view(-1, input_shape[-1])
743
+
744
+ hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids)
745
+
746
+ # CLIP's text model uses causal mask, prepare it here.
747
+ # https://github.com/openai/CLIP/blob/cfcffb90e69f37bf2ff1e988237a0fbe41f33c04/clip/model.py#L324
748
+ causal_attention_mask = _create_4d_causal_attention_mask(
749
+ input_shape, hidden_states.dtype, device=hidden_states.device
750
+ )
751
+ # expand attention_mask
752
+ if attention_mask is not None:
753
+ # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
754
+ attention_mask = _prepare_4d_attention_mask(attention_mask, hidden_states.dtype)
755
+
756
+ encoder_outputs = self.encoder(
757
+ inputs_embeds=hidden_states,
758
+ attention_mask=attention_mask,
759
+ causal_attention_mask=causal_attention_mask,
760
+ output_attentions=output_attentions,
761
+ output_hidden_states=output_hidden_states,
762
+ return_dict=return_dict,
763
+ )
764
+
765
+ last_hidden_state = encoder_outputs[0]
766
+ last_hidden_state = self.final_layer_norm(last_hidden_state)
767
+
768
+ if self.eos_token_id == 2:
769
+ # The `eos_token_id` was incorrect before PR #24773: Let's keep what have been done here.
770
+ # A CLIP model with such `eos_token_id` in the config can't work correctly with extra new tokens added
771
+ # ------------------------------------------------------------
772
+ # text_embeds.shape = [batch_size, sequence_length, transformer.width]
773
+ # take features from the eot embedding (eot_token is the highest number in each sequence)
774
+ # casting to torch.int for onnx compatibility: argmax doesn't support int64 inputs with opset 14
775
+ pooled_output = last_hidden_state[
776
+ torch.arange(last_hidden_state.shape[0], device=last_hidden_state.device),
777
+ input_ids.to(dtype=torch.int, device=last_hidden_state.device).argmax(dim=-1),
778
+ ]
779
+ else:
780
+ # The config gets updated `eos_token_id` from PR #24773 (so the use of exta new tokens is possible)
781
+ pooled_output = last_hidden_state[
782
+ torch.arange(last_hidden_state.shape[0], device=last_hidden_state.device),
783
+ # We need to get the first position of `eos_token_id` value (`pad_token_ids` might equal to `eos_token_id`)
784
+ (input_ids.to(dtype=torch.int, device=last_hidden_state.device) == self.eos_token_id)
785
+ .int()
786
+ .argmax(dim=-1),
787
+ ]
788
+
789
+ if not return_dict:
790
+ return (last_hidden_state, pooled_output) + encoder_outputs[1:]
791
+
792
+ return BaseModelOutputWithPooling(
793
+ last_hidden_state=last_hidden_state,
794
+ pooler_output=pooled_output,
795
+ hidden_states=encoder_outputs.hidden_states,
796
+ attentions=encoder_outputs.attentions,
797
+ )
798
+
799
+
800
+ @add_start_docstrings(
801
+ """The text model from CLIP without any head or projection on top.""",
802
+ CLIP_START_DOCSTRING,
803
+ )
804
+ class QLIPTextModel(QLIPPreTrainedModel):
805
+ config_class = QLIPTextConfig
806
+
807
+ _no_split_modules = ["QLIPTextEmbeddings", "QLIPEncoderLayer"]
808
+
809
+ def __init__(self, config: QLIPTextConfig):
810
+ super().__init__(config)
811
+ self.text_model = QLIPTextTransformer(config)
812
+ # Initialize weights and apply final processing
813
+ self.post_init()
814
+
815
+ def get_input_embeddings(self) -> nn.Module:
816
+ return self.text_model.embeddings.token_embedding
817
+
818
+ def set_input_embeddings(self, value):
819
+ self.text_model.embeddings.token_embedding = value
820
+
821
+ @add_start_docstrings_to_model_forward(CLIP_TEXT_INPUTS_DOCSTRING)
822
+ @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=QLIPTextConfig)
823
+ def forward(
824
+ self,
825
+ input_ids: Optional[torch.Tensor] = None,
826
+ attention_mask: Optional[torch.Tensor] = None,
827
+ position_ids: Optional[torch.Tensor] = None,
828
+ output_attentions: Optional[bool] = None,
829
+ output_hidden_states: Optional[bool] = None,
830
+ return_dict: Optional[bool] = None,
831
+ ) -> Union[Tuple, BaseModelOutputWithPooling]:
832
+ r"""
833
+ Returns:
834
+
835
+ Examples:
836
+
837
+ ```python
838
+ >>> from transformers import AutoTokenizer, CLIPTextModel
839
+
840
+ >>> model = CLIPTextModel.from_pretrained("openai/clip-vit-base-patch32")
841
+ >>> tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")
842
+
843
+ >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
844
+
845
+ >>> outputs = model(**inputs)
846
+ >>> last_hidden_state = outputs.last_hidden_state
847
+ >>> pooled_output = outputs.pooler_output # pooled (EOS token) states
848
+ ```"""
849
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
850
+
851
+ return self.text_model(
852
+ input_ids=input_ids,
853
+ attention_mask=attention_mask,
854
+ position_ids=position_ids,
855
+ output_attentions=output_attentions,
856
+ output_hidden_states=output_hidden_states,
857
+ return_dict=return_dict,
858
+ )
859
+
860
+
861
+ class QLIPVisionTransformer(nn.Module):
862
+ def __init__(self, config: QLIPVisionConfig):
863
+ super().__init__()
864
+ self.config = config
865
+ embed_dim = config.hidden_size
866
+
867
+ self.embeddings = QLIPVisionEmbeddings(config)
868
+ if config.rope:
869
+ half_head_dim = config.hidden_size // config.num_attention_heads // 2
870
+ hw_seq_len = config.image_size // config.patch_size
871
+ self.rope = VisionRotaryEmbeddingFast(
872
+ dim=half_head_dim,
873
+ pt_seq_len=16,
874
+ ft_seq_len=hw_seq_len,
875
+ )
876
+ else:
877
+ self.rope = None
878
+ self.encoder = QLIPEncoder(config, rope=self.rope, rope_shift=1)
879
+ self.post_layernorm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps)
880
+
881
+ if config.quantizer == "bsq":
882
+ self.quantizer = BinarySphericalQuantizer(**config.quantizer_cfg)
883
+ self.quantizer_l2_norm = config.quantizer_l2_norm
884
+ if config.quantizer_embed_type == "mlp":
885
+ self.quant_embed = nn.Sequential(
886
+ OrderedDict(
887
+ [
888
+ ("c_fc", nn.Linear(config.hidden_size, config.hidden_size)),
889
+ ("gelu", nn.GELU()),
890
+ ("c_proj", nn.Linear(config.hidden_size, config.quantizer_cfg["embed_dim"])),
891
+ ]
892
+ )
893
+ )
894
+ self.quant_embed_post = nn.Sequential(
895
+ OrderedDict(
896
+ [
897
+ ("c_fc", nn.Linear(config.quantizer_cfg["embed_dim"], config.hidden_size_post_q)),
898
+ ("gelu", nn.GELU()),
899
+ ("c_proj", nn.Linear(config.hidden_size_post_q, config.hidden_size_post_q)),
900
+ ]
901
+ )
902
+ )
903
+ else:
904
+ self.quant_embed = nn.Identity()
905
+ self.quant_embed_post = nn.Identity()
906
+
907
+ @add_start_docstrings_to_model_forward(CLIP_VISION_INPUTS_DOCSTRING)
908
+ @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=QLIPVisionConfig)
909
+ def forward(
910
+ self,
911
+ pixel_values: Optional[torch.FloatTensor] = None,
912
+ output_attentions: Optional[bool] = None,
913
+ output_hidden_states: Optional[bool] = None,
914
+ return_dict: Optional[bool] = None,
915
+ ) -> Union[Tuple, BaseModelOutputWithPooling]:
916
+ r"""
917
+ Returns:
918
+
919
+ """
920
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
921
+ output_hidden_states = (
922
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
923
+ )
924
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
925
+
926
+ if pixel_values is None:
927
+ raise ValueError("You have to specify pixel_values")
928
+
929
+ hidden_states = self.embeddings(pixel_values)
930
+
931
+ encoder_outputs = self.encoder(
932
+ inputs_embeds=hidden_states,
933
+ output_attentions=output_attentions,
934
+ output_hidden_states=output_hidden_states,
935
+ return_dict=return_dict,
936
+ )
937
+
938
+ last_hidden_state = encoder_outputs[0]
939
+ pooled_output = last_hidden_state[:, 0, :]
940
+ z = last_hidden_state[:, 1:, :]
941
+ h = self.quant_embed(z)
942
+ if self.quantizer_l2_norm:
943
+ h = F.normalize(h, dim=-1)
944
+ if self.quantizer is not None:
945
+ quant, _, _ = self.quantizer(h)
946
+ else:
947
+ quant = h
948
+ zhat = self.quant_embed_post(quant)
949
+ last_hidden_state = zhat
950
+ pooled_output = self.post_layernorm(pooled_output)
951
+
952
+ if not return_dict:
953
+ return (last_hidden_state, pooled_output) + encoder_outputs[1:]
954
+
955
+ return BaseModelOutputWithPooling(
956
+ last_hidden_state=last_hidden_state,
957
+ pooler_output=pooled_output,
958
+ hidden_states=encoder_outputs.hidden_states,
959
+ attentions=encoder_outputs.attentions,
960
+ )
961
+
962
+
963
+ class QLIPVisionTransformerDecoder(nn.Module):
964
+ def __init__(self, config: QLIPDecoderConfig):
965
+ super().__init__()
966
+ self.config = config
967
+ embed_dim = config.hidden_size
968
+
969
+ num_patches = (config.image_size // config.patch_size) ** 2
970
+ self.patch_shape = (config.image_size // config.patch_size, config.image_size // config.patch_size)
971
+ self.position_embeddings = nn.Parameter(torch.zeros(1, num_patches, embed_dim))
972
+ if config.rope:
973
+ half_head_dim = config.hidden_size // config.num_attention_heads // 2
974
+ hw_seq_len = config.image_size // config.patch_size
975
+ self.rope = VisionRotaryEmbeddingFast(
976
+ dim=half_head_dim,
977
+ pt_seq_len=16,
978
+ ft_seq_len=hw_seq_len,
979
+ )
980
+ else:
981
+ self.rope = None
982
+ self.norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
983
+ self.encoder = QLIPEncoder(config, rope=self.rope, rope_shift=0)
984
+ self.ffn = nn.Sequential(
985
+ nn.Linear(config.hidden_size, config.intermediate_size),
986
+ nn.Tanh(),
987
+ )
988
+ self.conv_out = nn.Linear(
989
+ in_features=config.intermediate_size,
990
+ out_features=3 * config.patch_size * config.patch_size,
991
+ )
992
+
993
+ @add_start_docstrings_to_model_forward(CLIP_VISION_INPUTS_DOCSTRING)
994
+ @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=QLIPVisionConfig)
995
+ def forward(
996
+ self,
997
+ latents: Optional[torch.FloatTensor] = None,
998
+ output_attentions: Optional[bool] = None,
999
+ output_hidden_states: Optional[bool] = None,
1000
+ return_dict: Optional[bool] = None,
1001
+ ) -> Union[Tuple, BaseModelOutputWithPooling]:
1002
+ r"""
1003
+ Returns:
1004
+
1005
+ """
1006
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1007
+ output_hidden_states = (
1008
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1009
+ )
1010
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1011
+
1012
+ if latents is None:
1013
+ raise ValueError("You have to specify latents")
1014
+
1015
+ hidden_states = self.position_embeddings + latents
1016
+
1017
+ decoder_outputs = self.encoder(
1018
+ inputs_embeds=hidden_states,
1019
+ output_attentions=output_attentions,
1020
+ output_hidden_states=output_hidden_states,
1021
+ return_dict=return_dict,
1022
+ )
1023
+
1024
+ last_hidden_state = decoder_outputs[0]
1025
+ recon = self.conv_out(self.ffn(self.norm(last_hidden_state)))
1026
+ recon_reshaped = rearrange(
1027
+ recon, "b (hh ww) (c sh sw) -> b c (hh sh) (ww sw)",
1028
+ hh=self.patch_shape[0], ww=self.patch_shape[1],
1029
+ sh=self.config.patch_size, sw=self.config.patch_size,
1030
+ )
1031
+ return recon_reshaped
1032
+
1033
+
1034
+ @add_start_docstrings(
1035
+ """The vision model from CLIP without any head or projection on top.""",
1036
+ CLIP_START_DOCSTRING,
1037
+ )
1038
+ class QLIPVisionModel(QLIPPreTrainedModel):
1039
+ config_class = QLIPVisionConfig
1040
+ main_input_name = "pixel_values"
1041
+ _no_split_modules = ["QLIPEncoderLayer"]
1042
+
1043
+ def __init__(self, config: QLIPVisionConfig):
1044
+ super().__init__(config)
1045
+ self.vision_model = QLIPVisionTransformer(config)
1046
+ # Initialize weights and apply final processing
1047
+ self.post_init()
1048
+
1049
+ def get_input_embeddings(self) -> nn.Module:
1050
+ return self.vision_model.embeddings.patch_embedding
1051
+
1052
+ @add_start_docstrings_to_model_forward(CLIP_VISION_INPUTS_DOCSTRING)
1053
+ @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=QLIPVisionConfig)
1054
+ def forward(
1055
+ self,
1056
+ pixel_values: Optional[torch.FloatTensor] = None,
1057
+ output_attentions: Optional[bool] = None,
1058
+ output_hidden_states: Optional[bool] = None,
1059
+ return_dict: Optional[bool] = None,
1060
+ ) -> Union[Tuple, BaseModelOutputWithPooling]:
1061
+ r"""
1062
+ Returns:
1063
+
1064
+ Examples:
1065
+
1066
+ ```python
1067
+ >>> from PIL import Image
1068
+ >>> import requests
1069
+ >>> from transformers import AutoProcessor, CLIPVisionModel
1070
+
1071
+ >>> model = CLIPVisionModel.from_pretrained("openai/clip-vit-base-patch32")
1072
+ >>> processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
1073
+
1074
+ >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
1075
+ >>> image = Image.open(requests.get(url, stream=True).raw)
1076
+
1077
+ >>> inputs = processor(images=image, return_tensors="pt")
1078
+
1079
+ >>> outputs = model(**inputs)
1080
+ >>> last_hidden_state = outputs.last_hidden_state
1081
+ >>> pooled_output = outputs.pooler_output # pooled CLS states
1082
+ ```"""
1083
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1084
+
1085
+ return self.vision_model(
1086
+ pixel_values=pixel_values,
1087
+ output_attentions=output_attentions,
1088
+ output_hidden_states=output_hidden_states,
1089
+ return_dict=return_dict,
1090
+ )
1091
+
1092
+
1093
+ @add_start_docstrings(CLIP_START_DOCSTRING)
1094
+ class QLIPModel(QLIPPreTrainedModel):
1095
+ config_class = QLIPConfig
1096
+
1097
+ def __init__(self, config: QLIPConfig):
1098
+ super().__init__(config)
1099
+
1100
+ if not isinstance(config.text_config, QLIPTextConfig):
1101
+ raise ValueError(
1102
+ "config.text_config is expected to be of type CLIPTextConfig but is of type"
1103
+ f" {type(config.text_config)}."
1104
+ )
1105
+
1106
+ if not isinstance(config.vision_config, QLIPVisionConfig):
1107
+ raise ValueError(
1108
+ "config.vision_config is expected to be of type CLIPVisionConfig but is of type"
1109
+ f" {type(config.vision_config)}."
1110
+ )
1111
+
1112
+ text_config = config.text_config
1113
+ vision_config = config.vision_config
1114
+ decoder_config = config.decoder_config
1115
+
1116
+ self.projection_dim = config.projection_dim
1117
+ self.text_embed_dim = text_config.hidden_size
1118
+ self.vision_embed_dim = vision_config.hidden_size
1119
+
1120
+ self.text_model = QLIPTextTransformer(text_config)
1121
+ self.vision_model = QLIPVisionTransformer(vision_config)
1122
+ self.vision_decoder = QLIPVisionTransformerDecoder(decoder_config)
1123
+
1124
+ self.visual_projection = nn.Linear(self.vision_embed_dim, self.projection_dim, bias=config.vision_projection_bias)
1125
+ self.text_projection = nn.Linear(self.text_embed_dim, self.projection_dim, bias=config.text_projection_bias)
1126
+ self.logit_scale = nn.Parameter(torch.tensor(self.config.logit_scale_init_value))
1127
+
1128
+ # Initialize weights and apply final processing
1129
+ self.post_init()
1130
+
1131
+ @add_start_docstrings_to_model_forward(CLIP_TEXT_INPUTS_DOCSTRING)
1132
+ def get_text_features(
1133
+ self,
1134
+ input_ids: Optional[torch.Tensor] = None,
1135
+ attention_mask: Optional[torch.Tensor] = None,
1136
+ position_ids: Optional[torch.Tensor] = None,
1137
+ output_attentions: Optional[bool] = None,
1138
+ output_hidden_states: Optional[bool] = None,
1139
+ return_dict: Optional[bool] = None,
1140
+ ) -> torch.FloatTensor:
1141
+ r"""
1142
+ Returns:
1143
+ text_features (`torch.FloatTensor` of shape `(batch_size, output_dim`): The text embeddings obtained by
1144
+ applying the projection layer to the pooled output of [`CLIPTextModel`].
1145
+
1146
+ Examples:
1147
+
1148
+ ```python
1149
+ >>> from transformers import AutoTokenizer, CLIPModel
1150
+
1151
+ >>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
1152
+ >>> tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")
1153
+
1154
+ >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
1155
+ >>> text_features = model.get_text_features(**inputs)
1156
+ ```"""
1157
+ # Use CLIP model's config for some fields (if specified) instead of those of vision & text components.
1158
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1159
+ output_hidden_states = (
1160
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1161
+ )
1162
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1163
+
1164
+ text_outputs = self.text_model(
1165
+ input_ids=input_ids,
1166
+ attention_mask=attention_mask,
1167
+ position_ids=position_ids,
1168
+ output_attentions=output_attentions,
1169
+ output_hidden_states=output_hidden_states,
1170
+ return_dict=return_dict,
1171
+ )
1172
+
1173
+ pooled_output = text_outputs[1]
1174
+ text_features = self.text_projection(pooled_output)
1175
+
1176
+ return text_features
1177
+
1178
+ @add_start_docstrings_to_model_forward(CLIP_VISION_INPUTS_DOCSTRING)
1179
+ def get_image_features(
1180
+ self,
1181
+ pixel_values: Optional[torch.FloatTensor] = None,
1182
+ output_attentions: Optional[bool] = None,
1183
+ output_hidden_states: Optional[bool] = None,
1184
+ return_dict: Optional[bool] = None,
1185
+ ) -> torch.FloatTensor:
1186
+ r"""
1187
+ Returns:
1188
+ image_features (`torch.FloatTensor` of shape `(batch_size, output_dim`): The image embeddings obtained by
1189
+ applying the projection layer to the pooled output of [`CLIPVisionModel`].
1190
+
1191
+ Examples:
1192
+
1193
+ ```python
1194
+ >>> from PIL import Image
1195
+ >>> import requests
1196
+ >>> from transformers import AutoProcessor, CLIPModel
1197
+
1198
+ >>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
1199
+ >>> processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
1200
+
1201
+ >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
1202
+ >>> image = Image.open(requests.get(url, stream=True).raw)
1203
+
1204
+ >>> inputs = processor(images=image, return_tensors="pt")
1205
+
1206
+ >>> image_features = model.get_image_features(**inputs)
1207
+ ```"""
1208
+ # Use CLIP model's config for some fields (if specified) instead of those of vision & text components.
1209
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1210
+ output_hidden_states = (
1211
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1212
+ )
1213
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1214
+
1215
+ vision_outputs = self.vision_model(
1216
+ pixel_values=pixel_values,
1217
+ output_attentions=output_attentions,
1218
+ output_hidden_states=output_hidden_states,
1219
+ return_dict=return_dict,
1220
+ )
1221
+
1222
+ pooled_output = vision_outputs[1] # pooled_output
1223
+ image_features = self.visual_projection(pooled_output)
1224
+
1225
+ return image_features
1226
+
1227
+ @add_start_docstrings_to_model_forward(CLIP_INPUTS_DOCSTRING)
1228
+ @replace_return_docstrings(output_type=QLIPOutput, config_class=QLIPConfig)
1229
+ def forward(
1230
+ self,
1231
+ input_ids: Optional[torch.LongTensor] = None,
1232
+ pixel_values: Optional[torch.FloatTensor] = None,
1233
+ attention_mask: Optional[torch.Tensor] = None,
1234
+ position_ids: Optional[torch.LongTensor] = None,
1235
+ return_loss: Optional[bool] = None,
1236
+ output_attentions: Optional[bool] = None,
1237
+ output_hidden_states: Optional[bool] = None,
1238
+ return_dict: Optional[bool] = None,
1239
+ ) -> Union[Tuple, QLIPOutput]:
1240
+ r"""
1241
+ Returns:
1242
+
1243
+ Examples:
1244
+
1245
+ ```python
1246
+ >>> from PIL import Image
1247
+ >>> import requests
1248
+ >>> from transformers import AutoProcessor, CLIPModel
1249
+
1250
+ >>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
1251
+ >>> processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
1252
+
1253
+ >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
1254
+ >>> image = Image.open(requests.get(url, stream=True).raw)
1255
+
1256
+ >>> inputs = processor(
1257
+ ... text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True
1258
+ ... )
1259
+
1260
+ >>> outputs = model(**inputs)
1261
+ >>> logits_per_image = outputs.logits_per_image # this is the image-text similarity score
1262
+ >>> probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
1263
+ ```"""
1264
+ # Use CLIP model's config for some fields (if specified) instead of those of vision & text components.
1265
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1266
+ output_hidden_states = (
1267
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1268
+ )
1269
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1270
+
1271
+ vision_outputs = self.vision_model(
1272
+ pixel_values=pixel_values,
1273
+ output_attentions=output_attentions,
1274
+ output_hidden_states=output_hidden_states,
1275
+ return_dict=return_dict,
1276
+ )
1277
+
1278
+ text_outputs = self.text_model(
1279
+ input_ids=input_ids,
1280
+ attention_mask=attention_mask,
1281
+ position_ids=position_ids,
1282
+ output_attentions=output_attentions,
1283
+ output_hidden_states=output_hidden_states,
1284
+ return_dict=return_dict,
1285
+ )
1286
+
1287
+ image_embeds = vision_outputs[1]
1288
+ image_embeds = self.visual_projection(image_embeds)
1289
+
1290
+ text_embeds = text_outputs[1]
1291
+ text_embeds = self.text_projection(text_embeds)
1292
+
1293
+ last_hidden_state = vision_outputs[0]
1294
+ recon = self.vision_decoder(last_hidden_state)
1295
+
1296
+ # normalized features
1297
+ image_embeds = image_embeds / image_embeds.norm(p=2, dim=-1, keepdim=True)
1298
+ text_embeds = text_embeds / text_embeds.norm(p=2, dim=-1, keepdim=True)
1299
+
1300
+ # cosine similarity as logits
1301
+ logit_scale = self.logit_scale.exp()
1302
+ logits_per_text = torch.matmul(text_embeds, image_embeds.t()) * logit_scale
1303
+ logits_per_image = logits_per_text.t()
1304
+
1305
+ loss = None
1306
+ if return_loss:
1307
+ loss = clip_loss(logits_per_text)
1308
+
1309
+ if not return_dict:
1310
+ output = (logits_per_image, logits_per_text, text_embeds, image_embeds, text_outputs, vision_outputs)
1311
+ return ((loss,) + output) if loss is not None else output
1312
+
1313
+ return QLIPOutput(
1314
+ loss=loss,
1315
+ logits_per_image=logits_per_image,
1316
+ logits_per_text=logits_per_text,
1317
+ text_embeds=text_embeds,
1318
+ image_embeds=image_embeds,
1319
+ text_model_output=text_outputs,
1320
+ vision_model_output=vision_outputs,
1321
+ reconstructions=recon,
1322
+ )
1323
+
1324
+
1325
+ @add_start_docstrings(
1326
+ """
1327
+ CLIP Text Model with a projection layer on top (a linear layer on top of the pooled output).
1328
+ """,
1329
+ CLIP_START_DOCSTRING,
1330
+ )
1331
+ class QLIPTextModelWithProjection(QLIPPreTrainedModel):
1332
+ config_class = QLIPTextConfig
1333
+
1334
+ _no_split_modules = ["QLIPTextEmbeddings", "QLIPEncoderLayer"]
1335
+
1336
+ def __init__(self, config: QLIPTextConfig):
1337
+ super().__init__(config)
1338
+
1339
+ self.text_model = QLIPTextTransformer(config)
1340
+
1341
+ self.text_projection = nn.Linear(config.hidden_size, config.projection_dim, bias=False)
1342
+
1343
+ # Initialize weights and apply final processing
1344
+ self.post_init()
1345
+
1346
+ def get_input_embeddings(self) -> nn.Module:
1347
+ return self.text_model.embeddings.token_embedding
1348
+
1349
+ def set_input_embeddings(self, value):
1350
+ self.text_model.embeddings.token_embedding = value
1351
+
1352
+ @add_start_docstrings_to_model_forward(CLIP_TEXT_INPUTS_DOCSTRING)
1353
+ @replace_return_docstrings(output_type=QLIPTextModelOutput, config_class=QLIPTextConfig)
1354
+ def forward(
1355
+ self,
1356
+ input_ids: Optional[torch.Tensor] = None,
1357
+ attention_mask: Optional[torch.Tensor] = None,
1358
+ position_ids: Optional[torch.Tensor] = None,
1359
+ output_attentions: Optional[bool] = None,
1360
+ output_hidden_states: Optional[bool] = None,
1361
+ return_dict: Optional[bool] = None,
1362
+ ) -> Union[Tuple, QLIPTextModelOutput]:
1363
+ r"""
1364
+ Returns:
1365
+
1366
+ Examples:
1367
+
1368
+ ```python
1369
+ >>> from transformers import AutoTokenizer, CLIPTextModelWithProjection
1370
+
1371
+ >>> model = CLIPTextModelWithProjection.from_pretrained("openai/clip-vit-base-patch32")
1372
+ >>> tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")
1373
+
1374
+ >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
1375
+
1376
+ >>> outputs = model(**inputs)
1377
+ >>> text_embeds = outputs.text_embeds
1378
+ ```"""
1379
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1380
+
1381
+ text_outputs = self.text_model(
1382
+ input_ids=input_ids,
1383
+ attention_mask=attention_mask,
1384
+ position_ids=position_ids,
1385
+ output_attentions=output_attentions,
1386
+ output_hidden_states=output_hidden_states,
1387
+ return_dict=return_dict,
1388
+ )
1389
+
1390
+ pooled_output = text_outputs[1]
1391
+
1392
+ text_embeds = self.text_projection(pooled_output)
1393
+
1394
+ if not return_dict:
1395
+ outputs = (text_embeds, text_outputs[0]) + text_outputs[2:]
1396
+ return tuple(output for output in outputs if output is not None)
1397
+
1398
+ return QLIPTextModelOutput(
1399
+ text_embeds=text_embeds,
1400
+ last_hidden_state=text_outputs.last_hidden_state,
1401
+ hidden_states=text_outputs.hidden_states,
1402
+ attentions=text_outputs.attentions,
1403
+ )
1404
+
1405
+
1406
+ @add_start_docstrings(
1407
+ """
1408
+ CLIP Vision Model with a projection layer on top (a linear layer on top of the pooled output).
1409
+ """,
1410
+ CLIP_START_DOCSTRING,
1411
+ )
1412
+ class QLIPVisionModelWithProjection(QLIPPreTrainedModel):
1413
+ config_class = QLIPVisionConfig
1414
+ main_input_name = "pixel_values"
1415
+
1416
+ def __init__(self, config: QLIPVisionConfig):
1417
+ super().__init__(config)
1418
+
1419
+ self.vision_model = QLIPVisionTransformer(config)
1420
+
1421
+ self.visual_projection = nn.Linear(config.hidden_size, config.projection_dim, bias=False)
1422
+
1423
+ # Initialize weights and apply final processing
1424
+ self.post_init()
1425
+
1426
+ def get_input_embeddings(self) -> nn.Module:
1427
+ return self.vision_model.embeddings.patch_embedding
1428
+
1429
+ @add_start_docstrings_to_model_forward(CLIP_VISION_INPUTS_DOCSTRING)
1430
+ @replace_return_docstrings(output_type=QLIPVisionModelOutput, config_class=QLIPVisionConfig)
1431
+ def forward(
1432
+ self,
1433
+ pixel_values: Optional[torch.FloatTensor] = None,
1434
+ output_attentions: Optional[bool] = None,
1435
+ output_hidden_states: Optional[bool] = None,
1436
+ return_dict: Optional[bool] = None,
1437
+ ) -> Union[Tuple, QLIPVisionModelOutput]:
1438
+ r"""
1439
+ Returns:
1440
+
1441
+ Examples:
1442
+
1443
+ ```python
1444
+ >>> from PIL import Image
1445
+ >>> import requests
1446
+ >>> from transformers import AutoProcessor, CLIPVisionModelWithProjection
1447
+
1448
+ >>> model = CLIPVisionModelWithProjection.from_pretrained("openai/clip-vit-base-patch32")
1449
+ >>> processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
1450
+
1451
+ >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
1452
+ >>> image = Image.open(requests.get(url, stream=True).raw)
1453
+
1454
+ >>> inputs = processor(images=image, return_tensors="pt")
1455
+
1456
+ >>> outputs = model(**inputs)
1457
+ >>> image_embeds = outputs.image_embeds
1458
+ ```"""
1459
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1460
+
1461
+ vision_outputs = self.vision_model(
1462
+ pixel_values=pixel_values,
1463
+ output_attentions=output_attentions,
1464
+ output_hidden_states=output_hidden_states,
1465
+ return_dict=return_dict,
1466
+ )
1467
+
1468
+ pooled_output = vision_outputs[1] # pooled_output
1469
+
1470
+ image_embeds = self.visual_projection(pooled_output)
1471
+
1472
+ if not return_dict:
1473
+ outputs = (image_embeds, vision_outputs[0]) + vision_outputs[2:]
1474
+ return tuple(output for output in outputs if output is not None)
1475
+
1476
+ return QLIPVisionModelOutput(
1477
+ image_embeds=image_embeds,
1478
+ last_hidden_state=vision_outputs.last_hidden_state,
1479
+ hidden_states=vision_outputs.hidden_states,
1480
+ attentions=vision_outputs.attentions,
1481
+ )
preprocessor_config.json ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "crop_size": 256,
3
+ "do_center_crop": true,
4
+ "do_normalize": true,
5
+ "do_resize": true,
6
+ "feature_extractor_type": "CLIPFeatureExtractor",
7
+ "image_mean": [
8
+ 0.48145466,
9
+ 0.4578275,
10
+ 0.40821073
11
+ ],
12
+ "image_std": [
13
+ 0.26862954,
14
+ 0.26130258,
15
+ 0.27577711
16
+ ],
17
+ "resample": 3,
18
+ "size": 392
19
+ }
rope.py ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) 2024, NVIDIA Corporation & Affiliates. All rights reserved.
2
+ #
3
+ # This work is made available under the Nvidia Source Code License-NC.
4
+ # To view a copy of this license, visit
5
+ # https://github.com/NVlabs/QLIP/blob/main/LICENSE
6
+
7
+ # MIT License
8
+
9
+ # Copyright (c) 2022 BAAI-Vision
10
+
11
+ # Permission is hereby granted, free of charge, to any person obtaining a copy
12
+ # of this software and associated documentation files (the "Software"), to deal
13
+ # in the Software without restriction, including without limitation the rights
14
+ # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
15
+ # copies of the Software, and to permit persons to whom the Software is
16
+ # furnished to do so, subject to the following conditions:
17
+
18
+ # The above copyright notice and this permission notice shall be included in all
19
+ # copies or substantial portions of the Software.
20
+
21
+ # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
22
+ # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
23
+ # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
24
+ # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
25
+ # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
26
+ # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
27
+ # SOFTWARE.
28
+
29
+
30
+ from math import pi
31
+ import torch
32
+ from torch import nn
33
+ from einops import rearrange, repeat
34
+ import logging
35
+
36
+
37
+ def broadcat(tensors, dim = -1):
38
+ num_tensors = len(tensors)
39
+ shape_lens = set(list(map(lambda t: len(t.shape), tensors)))
40
+ assert len(shape_lens) == 1, 'tensors must all have the same number of dimensions'
41
+ shape_len = list(shape_lens)[0]
42
+ dim = (dim + shape_len) if dim < 0 else dim
43
+ dims = list(zip(*map(lambda t: list(t.shape), tensors)))
44
+ expandable_dims = [(i, val) for i, val in enumerate(dims) if i != dim]
45
+ assert all([*map(lambda t: len(set(t[1])) <= 2, expandable_dims)]), 'invalid dimensions for broadcastable concatentation'
46
+ max_dims = list(map(lambda t: (t[0], max(t[1])), expandable_dims))
47
+ expanded_dims = list(map(lambda t: (t[0], (t[1],) * num_tensors), max_dims))
48
+ expanded_dims.insert(dim, (dim, dims[dim]))
49
+ expandable_shapes = list(zip(*map(lambda t: t[1], expanded_dims)))
50
+ tensors = list(map(lambda t: t[0].expand(*t[1]), zip(tensors, expandable_shapes)))
51
+ return torch.cat(tensors, dim = dim)
52
+
53
+ def rotate_half(x):
54
+ x = rearrange(x, '... (d r) -> ... d r', r = 2)
55
+ x1, x2 = x.unbind(dim = -1)
56
+ x = torch.stack((-x2, x1), dim = -1)
57
+ return rearrange(x, '... d r -> ... (d r)')
58
+
59
+
60
+ class VisionRotaryEmbeddingFast(nn.Module):
61
+ def __init__(
62
+ self,
63
+ dim,
64
+ pt_seq_len,
65
+ ft_seq_len=None,
66
+ custom_freqs = None,
67
+ freqs_for = 'lang',
68
+ theta = 10000,
69
+ max_freq = 10,
70
+ num_freqs = 1,
71
+ patch_dropout = 0.
72
+ ):
73
+ super().__init__()
74
+ if custom_freqs:
75
+ freqs = custom_freqs
76
+ elif freqs_for == 'lang':
77
+ freqs = 1. / (theta ** (torch.arange(0, dim, 2)[:(dim // 2)].float() / dim))
78
+ elif freqs_for == 'pixel':
79
+ freqs = torch.linspace(1., max_freq / 2, dim // 2) * pi
80
+ elif freqs_for == 'constant':
81
+ freqs = torch.ones(num_freqs).float()
82
+ else:
83
+ raise ValueError(f'unknown modality {freqs_for}')
84
+
85
+ if ft_seq_len is None: ft_seq_len = pt_seq_len
86
+ t = torch.arange(ft_seq_len) / ft_seq_len * pt_seq_len
87
+
88
+ freqs = torch.einsum('..., f -> ... f', t, freqs)
89
+ freqs = repeat(freqs, '... n -> ... (n r)', r = 2)
90
+ freqs = broadcat((freqs[:, None, :], freqs[None, :, :]), dim = -1)
91
+
92
+ freqs_cos = freqs.cos().view(-1, freqs.shape[-1])
93
+ freqs_sin = freqs.sin().view(-1, freqs.shape[-1])
94
+
95
+ self.patch_dropout = patch_dropout
96
+
97
+ self.register_buffer("freqs_cos", freqs_cos)
98
+ self.register_buffer("freqs_sin", freqs_sin)
99
+
100
+ logging.info(f'Shape of rope freq: {self.freqs_cos.shape}')
101
+
102
+ def forward(self, t, patch_indices_keep=None):
103
+ if patch_indices_keep is not None:
104
+ batch = t.size()[0]
105
+ batch_indices = torch.arange(batch)
106
+ batch_indices = batch_indices[..., None]
107
+
108
+ freqs_cos = repeat(self.freqs_cos, 'i j -> n i m j', n=t.shape[0], m=t.shape[1])
109
+ freqs_sin = repeat(self.freqs_sin, 'i j -> n i m j', n=t.shape[0], m=t.shape[1])
110
+
111
+ freqs_cos = freqs_cos[batch_indices, patch_indices_keep]
112
+ freqs_cos = rearrange(freqs_cos, 'n i m j -> n m i j')
113
+ freqs_sin = freqs_sin[batch_indices, patch_indices_keep]
114
+ freqs_sin = rearrange(freqs_sin, 'n i m j -> n m i j')
115
+
116
+ return t * freqs_cos + rotate_half(t) * freqs_sin
117
+
118
+ return t * self.freqs_cos + rotate_half(t) * self.freqs_sin
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": {"content": "<|startoftext|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "eos_token": {"content": "<|endoftext|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "unk_token": {"content": "<|endoftext|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "pad_token": "<|endoftext|>"}
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": {"content": "<|endoftext|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "bos_token": {"content": "<|startoftext|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "eos_token": {"content": "<|endoftext|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "pad_token": "<|endoftext|>", "add_prefix_space": false, "errors": "replace", "do_lower_case": true, "name_or_path": "openai/clip-vit-base-patch32", "model_max_length": 77, "special_tokens_map_file": "/home/suraj/.cache/huggingface/transformers/18a566598f286c9139f88160c99f84eec492a26bd22738fa9cb44d5b7e0a5c76.cce1206abbad28826f000510f22f354e53e66a97f7c23745a7dfe27609cc07f5", "tokenizer_class": "CLIPTokenizer"}
vocab.json ADDED
The diff for this file is too large to render. See raw diff