Drishti
's Collections
The Evolution of Multimodal Model Architectures
Paper
•
2405.17927
•
Published
•
1
What matters when building vision-language models?
Paper
•
2405.02246
•
Published
•
102
Efficient Architectures for High Resolution Vision-Language Models
Paper
•
2501.02584
•
Published
Building and better understanding vision-language models: insights and
future directions
Paper
•
2408.12637
•
Published
•
125
Improving Fine-grained Visual Understanding in VLMs through Text-Only
Training
Paper
•
2412.12940
•
Published
VILA: On Pre-training for Visual Language Models
Paper
•
2312.07533
•
Published
•
23
Renaissance: Investigating the Pretraining of Vision-Language Encoders
Paper
•
2411.06657
•
Published
Exploring the Frontier of Vision-Language Models: A Survey of Current
Methodologies and Future Directions
Paper
•
2404.07214
•
Published
NanoVLMs: How small can we go and still make coherent Vision Language
Models?
Paper
•
2502.07838
•
Published
POINTS: Improving Your Vision-language Model with Affordable Strategies
Paper
•
2409.04828
•
Published
•
24
Unveiling Encoder-Free Vision-Language Models
Paper
•
2406.11832
•
Published
•
51
Efficient Vision-Language Models by Summarizing Visual Tokens into
Compact Registers
Paper
•
2410.14072
•
Published
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One
Vision Token
Paper
•
2501.03895
•
Published
•
49
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
Paper
•
2402.03766
•
Published
•
14
PaliGemma: A versatile 3B VLM for transfer
Paper
•
2407.07726
•
Published
•
68
BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large
Language Models on Mobile Devices
Paper
•
2411.10640
•
Published
•
45
Scalable Vision Language Model Training via High Quality Data Curation
Paper
•
2501.05952
•
Published
•
1
Mini-Gemini: Mining the Potential of Multi-modality Vision Language
Models
Paper
•
2403.18814
•
Published
•
47
VisionZip: Longer is Better but Not Necessary in Vision Language Models
Paper
•
2412.04467
•
Published
•
107
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Paper
•
2401.06209
•
Published
Model Composition for Multimodal Large Language Models
Paper
•
2402.12750
•
Published
A Review of Multi-Modal Large Language and Vision Models
Paper
•
2404.01322
•
Published
TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones
Paper
•
2312.16862
•
Published
•
31
TinyLLaVA: A Framework of Small-scale Large Multimodal Models
Paper
•
2402.14289
•
Published
•
19
Towards Multi-Modal Mastery: A 4.5B Parameter Truly Multi-Modal Small
Language Model
Paper
•
2411.05903
•
Published
Boosting the Power of Small Multimodal Reasoning Models to Match Larger
Models with Self-Consistency Training
Paper
•
2311.14109
•
Published
TinyLLaVA-Video: A Simple Framework of Small-scale Large Multimodal
Models for Video Understanding
Paper
•
2501.15513
•
Published
LLaVA-φ: Efficient Multi-Modal Assistant with Small Language Model
Paper
•
2401.02330
•
Published
•
16
MM-LLMs: Recent Advances in MultiModal Large Language Models
Paper
•
2401.13601
•
Published
•
47
Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5%
Parameters and 90% Performance
Paper
•
2410.16261
•
Published
•
5
The (R)Evolution of Multimodal Large Language Models: A Survey
Paper
•
2402.12451
•
Published
Survey of Large Multimodal Model Datasets, Application Categories and
Taxonomy
Paper
•
2412.17759
•
Published
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at
Any Resolution
Paper
•
2409.12191
•
Published
•
76
Self-Adapting Large Visual-Language Models to Edge Devices across Visual
Modalities
Paper
•
2403.04908
•
Published
google/paligemma2-3b-mix-448
Image-Text-to-Text
•
Updated
•
1.43k
•
26
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
Paper
•
2411.10440
•
Published
•
114
InfiR : Crafting Effective Small Language Models and Multimodal Small
Language Models in Reasoning
Paper
•
2502.11573
•
Published
•
6
OpenGVLab/Mini-InternVL-Chat-2B-V1-5
Image-Text-to-Text
•
Updated
•
2.25k
•
71
HuggingFaceTB/SmolVLM-256M-Instruct
Image-Text-to-Text
•
Updated
•
38.2k
•
155
Qwen/Qwen2.5-VL-3B-Instruct
Image-Text-to-Text
•
Updated
•
424k
•
222
MILVLG/imp-v1-3b
Text Generation
•
Updated
•
373
•
202
marianna13/llava-phi-2-3b
Text Generation
•
Updated
•
173
•
12
Exploring the Potential of Encoder-free Architectures in 3D LMMs
Paper
•
2502.09620
•
Published
•
26