Collections

28

NVLM: Open Frontier-Class Multimodal LLMs

Paper • 2409.11402 • Published Sep 17 • 71
BRAVE: Broadening the visual encoding of vision-language models

Paper • 2404.07204 • Published Apr 10 • 18
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Paper • 2403.18814 • Published Mar 27 • 44
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Paper • 2409.17146 • Published Sep 25 • 101

31

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6 • 25
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6 • 12
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7 • 38
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7 • 19

NVLM: Open Frontier-Class Multimodal LLMs

BRAVE: Broadening the visual encoding of vision-language models

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

ScreenAI: A Vision-Language Model for UI and Infographics Understanding

EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions

ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks

Vision-Language Models as a Source of Rewards

StemGen: A music generation model that listens

The Llama 3 Herd of Models

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Baichuan Alignment Technical Report

A Survey of Small Language Models

Qwen2.5-Coder Technical Report

Attention Heads of Large Language Models: A Survey

Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency

OmniGen: Unified Image Generation

Training Language Models to Self-Correct via Reinforcement Learning

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection

Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

LLMs + Persona-Plug = Personalized LLMs

Building and better understanding vision-language models: insights and future directions

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Building and better understanding vision-language models: insights and future directions

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

LLaVA-OneVision: Easy Visual Task Transfer

VILA^2: VILA Augmented VILA

PaliGemma: A versatile 3B VLM for transfer

openbmb/MiniCPM-V-2_6