Can't reproduce results in even very basic cases?

#22

by dgaff - opened Jan 4

Jan 4

If I use the Inference API widget on the model card tab of this page, for the following case, (i.e, download this image, then enter as text cat,not_cat), I get a score of 0.53,0.47 cat/not_cat:

If I run what I would assume would be the identical code for computing the result, I get a wildly different score. What gives?

from PIL import Image
import requests

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=["cat", "not_cat"], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities

Yields: tensor([[0.2322, 0.7678]], grad_fn=<SoftmaxBackward0>)

This is totally unworkable if this is the case? I have to be missing something, right? I can't be this off-base. This is on Transformers 4.46.2, but seems to happen with latest as well?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment