The model inference results are inconsistent with those from the Python Transformers version.
Using the same image and prompt, the results from the Python Transformers version and the ONNX version are as follows:
- Python Transformers version:
prompt = MORE_DETAILED_CAPTION:
The image shows a modern gym with a curved ceiling and large windows. The floor is made of concrete and there are several treadmills and elliptical machines scattered throughout the space. There are several people working out on the machines, including a man on a treadmill, a woman on a stationary bike, and a man in a black t-shirt and shorts. The gym appears to be well-lit with natural light coming in from the windows on the right side of the image.
prompt=CAPTION
A group of people working out on treadmills in a gym.
- ONXX version:
prompt = MORE_DETAILED_CAPTION:
A picture of a gym with a lot of equipment
prompt=CAPTION
unanswerable
ONXX version result is too short.
Could you test again using a fp32/fp16 version of the decoder?
Could you test again using a fp32/fp16 version of the decoder?
where is fp32/16 version model, I only found decoder_model_merged_q4
Same issue here. When comparing florence2 onnx (embed_token fp16, vision_encoder fp16, encoder int8, decoder int8) on transformers.js vs on your script, the transformers.js version (https://huggingface.co/spaces/Xenova/florence2-webgpu) generate much longer result on the same image compared to your script. Model quantization level is the same (I ensured by hand)
Might the greedy decoding used in the script be the reason for this? I think transformers library uses beam search
EDIT: No I think it uses beam search with no of beam = 1 so it might just be the same as greedy decoding
I don't know if it's useful to anybody, but i got my output to match transformers.js by:
Adding image = ImageOps.exif_transpose(image)
(to rotate the image from my phone's camera)
Do resize in the processor itself (instead of pillow resize):
# resize image to 512x512
#image = image.resize((512, 512))
# 3. prepare text
prompt = "<MORE_DETAILED_CAPTION>"
inputs = processor(text=prompt, images=image, return_tensors="np", do_resize=True)