Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Abstract
Today's most advanced multimodal models remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed models into open ones. As a result, the community is still missing foundational knowledge about how to build performant VLMs from scratch. We present Molmo, a new family of VLMs that are state-of-the-art in their class of openness. Our key innovation is a novel, highly detailed image caption dataset collected entirely from human annotators using speech-based descriptions. To enable a wide array of user interactions, we also introduce a diverse dataset mixture for fine-tuning that includes in-the-wild Q&A and innovative 2D pointing data. The success of our approach relies on careful choices for the model architecture details, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets, all of which will be released. The best-in-class 72B model within the Molmo family not only outperforms others in the class of open weight and data models but also compares favorably against proprietary systems like GPT-4o, Claude 3.5, and Gemini 1.5 on both academic benchmarks and human evaluation. We will be releasing all of our model weights, captioning and fine-tuning data, and source code in the near future. Select model weights, inference code, and demo are available at https://molmo.allenai.org.
Community
Really interesting work! but its great performance seems biased for academic dataset because AI2D and Document dataset has good performances while mathvista and mmmu for more reasoning tasks are too low compared with Qwen2-VL. I would like to see the performances of more challenging benchmarks such as MM-Vet, MM-Vet-v2, MathVerse, LLaVA-wilder, and so on.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- xGen-MM (BLIP-3): A Family of Open Large Multimodal Models (2024)
- NVLM: Open Frontier-Class Multimodal LLMs (2024)
- LLaVaOLMoBitnet1B: Ternary LLM goes Multimodal! (2024)
- SynthVLM: High-Efficiency and High-Quality Synthetic Data for Vision Language Models (2024)
- Building and better understanding vision-language models: insights and future directions (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 4
Datasets citing this paper 0
No dataset linking this paper