cross modal similarity
when i input an img and a text to model
def get_output(url, text):
text = "a photo of 2 cats"
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = load_image(url)
inputs = processor(text=[text], images=image, padding="max_length", return_tensors="pt").to(model.device)
with torch.no_grad():
output = model(**inputs)
return output
the output logits is
SiglipOutput(loss=None, logits_per_image=tensor([[-15.5217]], device='cuda:0'), logits_per_text=tensor([[-15.5217]], device='cuda:0')
after sigmoid the output is ~0.
How can i fix it?