The model inference results are inconsistent with those from the Python Transformers version.

by kungrainbow - opened Apr 23

Discussion

kungrainbow

Apr 23

•

edited Apr 23

Using the same image and prompt, the results from the Python Transformers version and the ONNX version are as follows:

Python Transformers version:

prompt = MORE_DETAILED_CAPTION:

The image shows a modern gym with a curved ceiling and large windows. The floor is made of concrete and there are several treadmills and elliptical machines scattered throughout the space. There are several people working out on the machines, including a man on a treadmill, a woman on a stationary bike, and a man in a black t-shirt and shorts. The gym appears to be well-lit with natural light coming in from the windows on the right side of the image.

prompt=CAPTION

A group of people working out on treadmills in a gym.

ONXX version：

prompt = MORE_DETAILED_CAPTION:

A picture of a gym with a lot of equipment

prompt=CAPTION

unanswerable

ONXX version result is too short.

happyme531

Owner about 1 month ago

Could you test again using a fp32/fp16 version of the decoder?

kungrainbow

about 1 month ago

Could you test again using a fp32/fp16 version of the decoder?
where is fp32/16 version model, I only found decoder_model_merged_q4

happyme531

Owner about 1 month ago

https://huggingface.co/onnx-community/Florence-2-base-ft/tree/main/onnx

giangvinhloc610

26 days ago

Same issue here. When comparing florence2 onnx (embed_token fp16, vision_encoder fp16, encoder int8, decoder int8) on transformers.js vs on your script, the transformers.js version (https://huggingface.co/spaces/Xenova/florence2-webgpu) generate much longer result on the same image compared to your script. Model quantization level is the same (I ensured by hand)

giangvinhloc610

26 days ago

•

edited 26 days ago

Might the greedy decoding used in the script be the reason for this? I think transformers library uses beam search
EDIT: No I think it uses beam search with no of beam = 1 so it might just be the same as greedy decoding

giangvinhloc610

26 days ago

•

edited 26 days ago

I don't know if it's useful to anybody, but i got my output to match transformers.js by:
Adding image = ImageOps.exif_transpose(image) (to rotate the image from my phone's camera)
Do resize in the processor itself (instead of pillow resize):

# resize image to 512x512
#image = image.resize((512, 512))
# 3. prepare text
prompt = "<MORE_DETAILED_CAPTION>"
inputs = processor(text=prompt, images=image, return_tensors="np", do_resize=True)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment