VLMs - a Drishti Collection

Drishti 's Collections

Pretraining Datasets

VLMs

VLMs

updated Feb 20

The Evolution of Multimodal Model Architectures

Paper • 2405.17927 • Published May 28, 2024 • 1
What matters when building vision-language models?

Paper • 2405.02246 • Published May 3, 2024 • 104
Efficient Architectures for High Resolution Vision-Language Models

Paper • 2501.02584 • Published Jan 5
Building and better understanding vision-language models: insights and future directions

Paper • 2408.12637 • Published Aug 22, 2024 • 131
Improving Fine-grained Visual Understanding in VLMs through Text-Only Training

Paper • 2412.12940 • Published Dec 17, 2024
VILA: On Pre-training for Visual Language Models

Paper • 2312.07533 • Published Dec 12, 2023 • 23
Renaissance: Investigating the Pretraining of Vision-Language Encoders

Paper • 2411.06657 • Published Nov 11, 2024
Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions

Paper • 2404.07214 • Published Feb 20, 2024
NanoVLMs: How small can we go and still make coherent Vision Language Models?

Paper • 2502.07838 • Published Feb 11
POINTS: Improving Your Vision-language Model with Affordable Strategies

Paper • 2409.04828 • Published Sep 7, 2024 • 25
Unveiling Encoder-Free Vision-Language Models

Paper • 2406.11832 • Published Jun 17, 2024 • 55
Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers

Paper • 2410.14072 • Published Oct 17, 2024
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

Paper • 2501.03895 • Published Jan 7 • 53
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model

Paper • 2402.03766 • Published Feb 6, 2024 • 15
PaliGemma: A versatile 3B VLM for transfer

Paper • 2407.07726 • Published Jul 10, 2024 • 71
BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices

Paper • 2411.10640 • Published Nov 16, 2024 • 47
Scalable Vision Language Model Training via High Quality Data Curation

Paper • 2501.05952 • Published Jan 10 • 2
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Paper • 2403.18814 • Published Mar 27, 2024 • 48
VisionZip: Longer is Better but Not Necessary in Vision Language Models

Paper • 2412.04467 • Published Dec 5, 2024 • 111
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Paper • 2401.06209 • Published Jan 11, 2024
Model Composition for Multimodal Large Language Models

Paper • 2402.12750 • Published Feb 20, 2024
A Review of Multi-Modal Large Language and Vision Models

Paper • 2404.01322 • Published Mar 28, 2024
TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

Paper • 2312.16862 • Published Dec 28, 2023 • 31
TinyLLaVA: A Framework of Small-scale Large Multimodal Models

Paper • 2402.14289 • Published Feb 22, 2024 • 21
Towards Multi-Modal Mastery: A 4.5B Parameter Truly Multi-Modal Small Language Model

Paper • 2411.05903 • Published Nov 8, 2024
Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-Consistency Training

Paper • 2311.14109 • Published Nov 23, 2023
TinyLLaVA-Video: A Simple Framework of Small-scale Large Multimodal Models for Video Understanding

Paper • 2501.15513 • Published Jan 26
LLaVA-φ: Efficient Multi-Modal Assistant with Small Language Model

Paper • 2401.02330 • Published Jan 4, 2024 • 17
MM-LLMs: Recent Advances in MultiModal Large Language Models

Paper • 2401.13601 • Published Jan 24, 2024 • 49
Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance

Paper • 2410.16261 • Published Oct 21, 2024 • 4
The (R)Evolution of Multimodal Large Language Models: A Survey

Paper • 2402.12451 • Published Feb 19, 2024
Survey of Large Multimodal Model Datasets, Application Categories and Taxonomy

Paper • 2412.17759 • Published Dec 23, 2024
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Paper • 2409.12191 • Published Sep 18, 2024 • 78
Self-Adapting Large Visual-Language Models to Edge Devices across Visual Modalities

Paper • 2403.04908 • Published Mar 7, 2024
google/paligemma2-3b-mix-448

Image-Text-to-Text • Updated Feb 7 • 8.2k • 44
LLaVA-o1: Let Vision Language Models Reason Step-by-Step

Paper • 2411.10440 • Published Nov 15, 2024 • 125
InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning

Paper • 2502.11573 • Published Feb 17 • 8
OpenGVLab/Mini-InternVL-Chat-2B-V1-5

Image-Text-to-Text • Updated Mar 25 • 6.66k • 72
HuggingFaceTB/SmolVLM-256M-Instruct

Image-Text-to-Text • Updated Apr 8 • 498k • 223
Qwen/Qwen2.5-VL-3B-Instruct

Image-Text-to-Text • Updated Apr 6 • 2.42M • 358
MILVLG/imp-v1-3b

Text Generation • Updated May 26, 2024 • 191 • 201
marianna13/llava-phi-2-3b

Text Generation • Updated Jan 29, 2024 • 125 • 13
Exploring the Potential of Encoder-free Architectures in 3D LMMs

Paper • 2502.09620 • Published Feb 13 • 26