Using pre-computed embeddings for images/frames and using as input

#2
by maxlun - opened

Is it possible to feed pre-computed embeddings from e.g. the pretrained model google/siglip-so400m-patch14-384, and possibly even for frames of a video? Or alternatively, can you pre-compute embeddings using the siglip model of SmolVLM2 for use at a later stage (to speed up inference time)?

Hugging Face TB Research org

If you have access to the images in advance, load the SmolVLMVisionTransformer (this is wrapper class for SigLIP encoder), use it to extract the embeddings, then pass them as image_hidden_states instead of pixel_values.

You have to use the SmolVLM2 encoder weights as the encoder was unfrozen during training. The original SigLIP will not work.

Thanks for the swift answer @orrzohar ! It's as I expected then. Did you try the SmolVLM2 SigLIP encoder on some standard retrieval benchmarks (MSCOCO, Flickr)?
Sorry if there is some paper/write-up for all this info, haven't found any yet.

Hugging Face TB Research org

It is not trivial to perform retrieval with encoders that are post-trained on the NTP loss, as they are no longer aligned with their original text encoder. To do this, I would:
a) Test retrieval with original SigLIP text encoder. Performance probably would be lower than before due to a lack of text-vision embedding alignment.
b) Test retrieval with the with the SmolLM2 text tokenizer
c) potentially re-align the SigLIP text encoder with the vision encoder (frozen) using LoRA or similar

Sign up or log in to comment