Guillermo Ezequiel Mannsilla's picture

Guillermo Ezequiel Mannsilla PRO

gmansilla

AI & ML interests

None yet

Recent Activity

updated a Space 20 days ago
gmansilla/party_theme_tool
View all activity

Organizations

None yet

gmansilla's activity

replied to m-ric's post 7 days ago
reacted to m-ric's post with ๐Ÿš€ 7 days ago
view post
Post
2320
New king of open VLMs: InternVL3 takes Qwen 2.5's crown! ๐Ÿ‘‘

InternVL have been a wildly successful series of model : and the latest iteration has just taken back their crown thanks to their superior, natively multimodal vision training pipeline.

โžก๏ธ Most of the vision language models (VLMs) these days are built like Frankenstein : take a good text-only Large Language Model (LLM) backbone, stitch a specific vision transformer (ViT) on top of it. Then the training is sequential ๐Ÿ”ข : 1. Freeze the LLM weights while you train the ViT only to work with the LLM part, then 2. Unfreeze all weights to train all weights in order to work together.

๐Ÿ’ซ The Shanghai Lab decided to challenge this paradigm and chose this approach that they call "native". For each of their model sizes, they still start from a good LLM (mostly Qwen-2.5 series, did I tell you I'm a huge fan of Qwen? โค๏ธ), and stitch the ViT, but they don't freeze anything : they train all weights together with interleaved text and image understanding data in a single pre-training phase ๐ŸŽจ.

They claim it results in more seamless interactions between modalities. And the results prove them right: they took the crown of top VLMs, at nearly all sizes, from their Qwen-2.5 parents. ๐Ÿ‘‘
  • 2 replies
ยท