timm
/

Image Feature Extraction
timm
PyTorch
Safetensors
Transformers
rwightman HF Staff commited on
Commit
f643008
·
verified ·
1 Parent(s): e0779db
Files changed (4) hide show
  1. README.md +160 -0
  2. config.json +33 -0
  3. model.safetensors +3 -0
  4. pytorch_model.bin +3 -0
README.md ADDED
@@ -0,0 +1,160 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - image-classification
4
+ - timm
5
+ - transformers
6
+ library_name: timm
7
+ license: apache-2.0
8
+ datasets:
9
+ - imagenet-22k
10
+ - text_documents-160gb
11
+ - laion-400m
12
+ - coyo-700m
13
+ - cc15m
14
+ - coco
15
+ - vg
16
+ ---
17
+ # Model card for beit3_base_patch16_224.indomain_pt
18
+
19
+ A BEiT-3 image classification model. Multimodal model pretrained on ImageNet-22k images, 160GB text documents, and web-scale image-text pairs with masked data modeling. Continued training with in-domain image-text pairs (COCO and Visual Genome). Converted for vision-only classification tasks.
20
+
21
+
22
+ ## Model Details
23
+ - **Model Type:** Image classification / feature backbone
24
+ - **Model Stats:**
25
+ - Params (M): 85.9
26
+ - GMACs: 17.6
27
+ - Activations (M): 23.9
28
+ - Image size: 224 x 224
29
+ - **Papers:**
30
+ - BEiT-3: Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks: https://arxiv.org/abs/2208.10442
31
+ - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929v2
32
+ - **Dataset:**
33
+ - ImageNet-22k
34
+ - Text_documents-160GB
35
+ - LAION-400M
36
+ - COYO-700M
37
+ - CC15M
38
+ - COCO
39
+ - VG
40
+ - **Original:** https://github.com/microsoft/unilm/tree/master/beit3
41
+
42
+ ## Model Usage
43
+ ### Image Classification
44
+ ```python
45
+ from urllib.request import urlopen
46
+ from PIL import Image
47
+ import timm
48
+
49
+ img = Image.open(urlopen(
50
+ 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
51
+ ))
52
+
53
+ model = timm.create_model('beit3_base_patch16_224.indomain_pt', pretrained=True)
54
+ model = model.eval()
55
+
56
+ # get model specific transforms (normalization, resize)
57
+ data_config = timm.data.resolve_model_data_config(model)
58
+ transforms = timm.data.create_transform(**data_config, is_training=False)
59
+
60
+ output = model(transforms(img).unsqueeze(0)) # unsqueeze single image into batch of 1
61
+
62
+ top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)
63
+ ```
64
+
65
+ ### Feature Map Extraction
66
+ ```python
67
+ from urllib.request import urlopen
68
+ from PIL import Image
69
+ import timm
70
+
71
+ img = Image.open(urlopen(
72
+ 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
73
+ ))
74
+
75
+ model = timm.create_model(
76
+ 'beit3_base_patch16_224.indomain_pt',
77
+ pretrained=True,
78
+ features_only=True,
79
+ )
80
+ model = model.eval()
81
+
82
+ # get model specific transforms (normalization, resize)
83
+ data_config = timm.data.resolve_model_data_config(model)
84
+ transforms = timm.data.create_transform(**data_config, is_training=False)
85
+
86
+ output = model(transforms(img).unsqueeze(0)) # unsqueeze single image into batch of 1
87
+
88
+ for o in output:
89
+ # print shape of each feature map in output
90
+ # e.g.:
91
+ # torch.Size([1, 768, 14, 14])
92
+ # torch.Size([1, 768, 14, 14])
93
+ # torch.Size([1, 768, 14, 14])
94
+
95
+ print(o.shape)
96
+ ```
97
+
98
+ ### Image Embeddings
99
+ ```python
100
+ from urllib.request import urlopen
101
+ from PIL import Image
102
+ import timm
103
+
104
+ img = Image.open(urlopen(
105
+ 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
106
+ ))
107
+
108
+ model = timm.create_model(
109
+ 'beit3_base_patch16_224.indomain_pt',
110
+ pretrained=True,
111
+ num_classes=0, # remove classifier nn.Linear
112
+ )
113
+ model = model.eval()
114
+
115
+ # get model specific transforms (normalization, resize)
116
+ data_config = timm.data.resolve_model_data_config(model)
117
+ transforms = timm.data.create_transform(**data_config, is_training=False)
118
+
119
+ output = model(transforms(img).unsqueeze(0)) # output is (batch_size, num_features) shaped tensor
120
+
121
+ # or equivalently (without needing to set num_classes=0)
122
+
123
+ output = model.forward_features(transforms(img).unsqueeze(0))
124
+ # output is unpooled, a (1, 197, 768) shaped tensor
125
+
126
+ output = model.forward_head(output, pre_logits=True)
127
+ # output is a (1, num_features) shaped tensor
128
+ ```
129
+
130
+ ## Model Comparison
131
+ Explore the dataset and runtime metrics of this model in timm [model results](https://github.com/huggingface/pytorch-image-models/tree/main/results).
132
+
133
+ ## Citation
134
+ ```bibtex
135
+ @article{wang2022beit3,
136
+ title={Image as a foreign language: Beit pretraining for vision and vision-language tasks},
137
+ author={Wang, Wenhui and Bao, Hangbo and Dong, Li and Bjorck, Johan and Peng, Zhiliang and Liu, Qiang and Aggarwal, Kriti and Mohammed, Owais Khan and Singhal, Saksham and Som, Subhojit and others},
138
+ journal={arXiv preprint arXiv:2208.10442},
139
+ year={2022}
140
+ }
141
+ ```
142
+ ```bibtex
143
+ @article{dosovitskiy2020vit,
144
+ title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
145
+ author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
146
+ journal={ICLR},
147
+ year={2021}
148
+ }
149
+ ```
150
+ ```bibtex
151
+ @misc{rw2019timm,
152
+ author = {Ross Wightman},
153
+ title = {PyTorch Image Models},
154
+ year = {2019},
155
+ publisher = {GitHub},
156
+ journal = {GitHub repository},
157
+ doi = {10.5281/zenodo.4414861},
158
+ howpublished = {\url{https://github.com/huggingface/pytorch-image-models}}
159
+ }
160
+ ```
config.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architecture": "beit3_base_patch16_224",
3
+ "num_classes": 0,
4
+ "num_features": 768,
5
+ "global_pool": "avg",
6
+ "pretrained_cfg": {
7
+ "tag": "indomain_pt",
8
+ "custom_load": false,
9
+ "input_size": [
10
+ 3,
11
+ 224,
12
+ 224
13
+ ],
14
+ "fixed_input_size": true,
15
+ "interpolation": "bicubic",
16
+ "crop_pct": 1.0,
17
+ "crop_mode": "center",
18
+ "mean": [
19
+ 0.485,
20
+ 0.456,
21
+ 0.406
22
+ ],
23
+ "std": [
24
+ 0.229,
25
+ 0.224,
26
+ 0.225
27
+ ],
28
+ "num_classes": 0,
29
+ "pool_size": null,
30
+ "first_conv": "patch_embed.proj",
31
+ "classifier": "head"
32
+ }
33
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1451b9a9871bfa636c45dd03612598754960d92fec6c0acbc802c3d97872e62b
3
+ size 343581672
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f11577a4685a7fc4c7a3d596251d164b1a61af0799d1a17157b3a17ed47bf10f
3
+ size 343634722