-
73
Chat with Kimi-VL-A3B-Thinking
🤔Interact with a chatbot that understands text and images
-
26
Chat with Kimi-VL (Image, Agent, Video, PDF)
🚀Chat with Kimi-VL-A3B-Instruct using text, images, and videos
-
moonshotai/Kimi-VL-A3B-Thinking
Image-Text-to-Text • Updated • 10.2k • 318 -
moonshotai/Kimi-VL-A3B-Instruct
Image-Text-to-Text • Updated • 9.71k • 158
Collections
Discover the best community collections!
Collections including paper arxiv:2504.07491
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 28 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
-
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization
Paper • 2503.10615 • Published • 16 -
UniGoal: Towards Universal Zero-shot Goal-oriented Navigation
Paper • 2503.10630 • Published • 6 -
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Paper • 2503.09516 • Published • 27 -
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL
Paper • 2503.07536 • Published • 84
-
MLLM-as-a-Judge for Image Safety without Human Labeling
Paper • 2501.00192 • Published • 30 -
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Paper • 2501.00958 • Published • 107 -
Xmodel-2 Technical Report
Paper • 2412.19638 • Published • 27 -
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
Paper • 2412.18925 • Published • 101
-
LinFusion: 1 GPU, 1 Minute, 16K Image
Paper • 2409.02097 • Published • 35 -
Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion
Paper • 2409.11406 • Published • 28 -
Diffusion Models Are Real-Time Game Engines
Paper • 2408.14837 • Published • 126 -
Segment Anything with Multiple Modalities
Paper • 2408.09085 • Published • 23
-
Gemma 3 Technical Report
Paper • 2503.19786 • Published • 45 -
Kimi-VL Technical Report
Paper • 2504.07491 • Published • 109 -
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Paper • 2504.10479 • Published • 185 -
FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
Paper • 2504.09925 • Published • 36