Using pre-computed embeddings for images/frames and using as input
Is it possible to feed pre-computed embeddings from e.g. the pretrained model google/siglip-so400m-patch14-384, and possibly even for frames of a video? Or alternatively, can you pre-compute embeddings using the siglip model of SmolVLM2 for use at a later stage (to speed up inference time)?
If you have access to the images in advance, load the SmolVLMVisionTransformer
(this is wrapper class for SigLIP encoder), use it to extract the embeddings, then pass them as image_hidden_states
instead of pixel_values
.
You have to use the SmolVLM2 encoder weights as the encoder was unfrozen during training. The original SigLIP will not work.
It is not trivial to perform retrieval with encoders that are post-trained on the NTP loss, as they are no longer aligned with their original text encoder. To do this, I would:
a) Test retrieval with original SigLIP text encoder. Performance probably would be lower than before due to a lack of text-vision embedding alignment.
b) Test retrieval with the with the SmolLM2 text tokenizer
c) potentially re-align the SigLIP text encoder with the vision encoder (frozen) using LoRA or similar