VLMs consist of an image encoder block, a projection layer that projects image embeddings to text embedding space and then a text decoder sequentially connected 📖 This paper explores using intermediate states of image encoder and not a single output 🤩 The authors explore three different ways of instantiating dense connector: sparse token integration, sparse channel integration and dense channel integration. (see paper on how they do it Dense Connector for MLLMs (2405.13800))
They explore all three of them integrated to LLaVA 1.5 and found out each of the new models are superior to the original LLaVA 1.5 🥹 I tried the model and it seems to work very well. As part of the release, the authors have released various ckpts based on different decoders (Vicuna 7/13B and Llama 3-8B) that you can find in the collection 🤗
Explaining the 👑 of zero-shot open-vocabulary object detection: OWLv2 🦉 OWLv2 is scaled version of a model called OWL-ViT, so let's take a look at that first. 📝 OWLViT is an open vocabulary object detector, meaning, it can detect objects it didn't explicitly see during the training. 👀 What's cool is that it can take both image and text queries! This is thanks to how the image and text features aren't fused together.
Taking a look at the architecture, the authors firstly do contrastive pre-training of a vision and a text encoder (just like CLIP). They take that model, remove the final pooling layer and attach a lightweight classification and box detection head and fine-tune. During fine-tuning for object detection, they calculate the loss over bipartite matches. Simply put, loss is calculated over the predicted objects against ground truth objects and the goal is to find a perfect match of these two sets where each object is matched to one object in ground truth.
OWL-ViT is very scalable. You can easily scale most language models or vision-language models because they require no supervision, but this isn't the case for object detection: you still need weak supervision. Moreover, only scaling the encoders creates a bottleneck after a while.
The authors wanted to scale OWL-ViT with more data, so they used OWL-ViT for labelling to train a better detector, "self-train" a new detector on the labels, and fine-tune the model on human-annotated data.
Thanks to this, OWLv2 scaled very well and topped leaderboards on open vocabulary object detection 👑 If you'd like to try it out, I will leave couple of links with apps, notebooks and more in the comments! 🤗