Using pre-computed embeddings for images/frames and using as input

by maxlun - opened 1 day ago

1 day ago

Is it possible to feed pre-computed embeddings from e.g. the pretrained model google/siglip-so400m-patch14-384, and possibly even for frames of a video? Or alternatively, can you pre-compute embeddings using the siglip model of SmolVLM2 for use at a later stage (to speed up inference time)?

orrzohar

Hugging Face TB Research org 1 day ago

If you have access to the images in advance, load the SmolVLMVisionTransformer (this is wrapper class for SigLIP encoder), use it to extract the embeddings, then pass them as image_hidden_states instead of pixel_values.

You have to use the SmolVLM2 encoder weights as the encoder was unfrozen during training. The original SigLIP will not work.

maxlun

about 17 hours ago

Thanks for the swift answer @orrzohar ! It's as I expected then. Did you try the SmolVLM2 SigLIP encoder on some standard retrieval benchmarks (MSCOCO, Flickr)?
Sorry if there is some paper/write-up for all this info, haven't found any yet.

orrzohar

Hugging Face TB Research org about 14 hours ago

It is not trivial to perform retrieval with encoders that are post-trained on the NTP loss, as they are no longer aligned with their original text encoder. To do this, I would:
a) Test retrieval with original SigLIP text encoder. Performance probably would be lower than before due to a lack of text-vision embedding alignment.
b) Test retrieval with the with the SmolLM2 text tokenizer
c) potentially re-align the SigLIP text encoder with the vision encoder (frozen) using LoRA or similar

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment