visheratin
/

mexma-siglip

Zero-Shot Image Classification

Model card Files Files and versions Community

visheratin commited on Dec 4, 2024

Commit

3ac1261

·

verified ·

1 Parent(s): c2e533a

Create README.md

Files changed (1) hide show

README.md +118 -0

README.md ADDED Viewed

	@@ -0,0 +1,118 @@

+---
+license: mit
+language:
+- ar
+- kn
+- ar
+- ka
+- af
+- kk
+- am
+- km
+- ar
+- ky
+- ar
+- ko
+- as
+- lo
+- az
+- ml
+- az
+- mr
+- be
+- mk
+- bn
+- my
+- bs
+- nl
+- bg
+- 'no'
+- ca
+- 'no'
+- cs
+- ne
+- ku
+- pl
+- cy
+- pt
+- da
+- ro
+- de
+- ru
+- el
+- sa
+- en
+- si
+- eo
+- sk
+- et
+- sl
+- eu
+- sd
+- fi
+- so
+- fr
+- es
+- gd
+- sr
+- ga
+- su
+- gl
+- sv
+- gu
+- sw
+- ha
+- ta
+- he
+- te
+- hi
+- th
+- hr
+- tr
+- hu
+- ug
+- hy
+- uk
+- id
+- ur
+- is
+- vi
+- it
+- xh
+- jv
+- zh
+- ja
+---
+## Model Summary
+MEXMA-SigLIP is a model that combines the [MEXMA](https://huggingface.co/facebook/MEXMA) multilingual text encoder and an image encoder from the
+[SigLIP](https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384) model. This allows us to get a high-performance CLIP model for 80 languages.
+MEXMA-SigLIP sets state-of-the-art on the [Crossmodal-3600](https://google.github.io/crossmodal-3600/) dataset across commercial use-friendly models.
+## How to use
+```
+from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
+from PIL import Image
+import requests
+import torch
+model = AutoModel.from_pretrained("visheratin/mexma-siglip", torch_dtype=torch.bfloat16, trust_remote_code=True, optimized=True).to("cuda")
+tokenizer = AutoTokenizer.from_pretrained("visheratin/mexma-siglip")
+processor = AutoImageProcessor.from_pretrained("visheratin/mexma-siglip")
+img = Image.open(requests.get("https://static.independent.co.uk/s3fs-public/thumbnails/image/2014/03/25/12/eiffel.jpg", stream=True).raw)
+img = processor(images=img, return_tensors="pt")["pixel_values"]
+img = img.to(torch.bfloat16).to("cuda")
+with torch.inference_mode():
+    text = tokenizer(["кошка", "a dog", "एफिल टॉवर"], return_tensors="pt", padding=True).to("cuda")
+    image_logits, text_logits = model.get_logits(text["input_ids"], text["attention_mask"], img)
+    probs = image_logits.softmax(dim=-1)
+    print(probs)
+```
+## Acknowledgements
+I thank [ML Collective](https://mlcollective.org/) and [Lambda](https://lambdalabs.com/) for providing compute resources to train the model.