visheratin commited on
Commit
3ac1261
·
verified ·
1 Parent(s): c2e533a

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +118 -0
README.md ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - ar
5
+ - kn
6
+ - ar
7
+ - ka
8
+ - af
9
+ - kk
10
+ - am
11
+ - km
12
+ - ar
13
+ - ky
14
+ - ar
15
+ - ko
16
+ - as
17
+ - lo
18
+ - az
19
+ - ml
20
+ - az
21
+ - mr
22
+ - be
23
+ - mk
24
+ - bn
25
+ - my
26
+ - bs
27
+ - nl
28
+ - bg
29
+ - 'no'
30
+ - ca
31
+ - 'no'
32
+ - cs
33
+ - ne
34
+ - ku
35
+ - pl
36
+ - cy
37
+ - pt
38
+ - da
39
+ - ro
40
+ - de
41
+ - ru
42
+ - el
43
+ - sa
44
+ - en
45
+ - si
46
+ - eo
47
+ - sk
48
+ - et
49
+ - sl
50
+ - eu
51
+ - sd
52
+ - fi
53
+ - so
54
+ - fr
55
+ - es
56
+ - gd
57
+ - sr
58
+ - ga
59
+ - su
60
+ - gl
61
+ - sv
62
+ - gu
63
+ - sw
64
+ - ha
65
+ - ta
66
+ - he
67
+ - te
68
+ - hi
69
+ - th
70
+ - hr
71
+ - tr
72
+ - hu
73
+ - ug
74
+ - hy
75
+ - uk
76
+ - id
77
+ - ur
78
+ - is
79
+ - vi
80
+ - it
81
+ - xh
82
+ - jv
83
+ - zh
84
+ - ja
85
+ ---
86
+
87
+ ## Model Summary
88
+
89
+ MEXMA-SigLIP is a model that combines the [MEXMA](https://huggingface.co/facebook/MEXMA) multilingual text encoder and an image encoder from the
90
+ [SigLIP](https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384) model. This allows us to get a high-performance CLIP model for 80 languages.
91
+ MEXMA-SigLIP sets state-of-the-art on the [Crossmodal-3600](https://google.github.io/crossmodal-3600/) dataset across commercial use-friendly models.
92
+
93
+
94
+ ## How to use
95
+
96
+ ```
97
+ from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
98
+ from PIL import Image
99
+ import requests
100
+ import torch
101
+
102
+ model = AutoModel.from_pretrained("visheratin/mexma-siglip", torch_dtype=torch.bfloat16, trust_remote_code=True, optimized=True).to("cuda")
103
+ tokenizer = AutoTokenizer.from_pretrained("visheratin/mexma-siglip")
104
+ processor = AutoImageProcessor.from_pretrained("visheratin/mexma-siglip")
105
+
106
+ img = Image.open(requests.get("https://static.independent.co.uk/s3fs-public/thumbnails/image/2014/03/25/12/eiffel.jpg", stream=True).raw)
107
+ img = processor(images=img, return_tensors="pt")["pixel_values"]
108
+ img = img.to(torch.bfloat16).to("cuda")
109
+ with torch.inference_mode():
110
+ text = tokenizer(["кошка", "a dog", "एफिल टॉवर"], return_tensors="pt", padding=True).to("cuda")
111
+ image_logits, text_logits = model.get_logits(text["input_ids"], text["attention_mask"], img)
112
+ probs = image_logits.softmax(dim=-1)
113
+ print(probs)
114
+ ```
115
+
116
+ ## Acknowledgements
117
+
118
+ I thank [ML Collective](https://mlcollective.org/) and [Lambda](https://lambdalabs.com/) for providing compute resources to train the model.