LightLab: Controlling Light Sources in Images with Diffusion Models Paper • 2505.09608 • Published May 14 • 32
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning Paper • 2504.06958 • Published Apr 9 • 11
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale Paper • 2504.16030 • Published Apr 22 • 35
Vidi: Large Multimodal Models for Video Understanding and Editing Paper • 2504.15681 • Published Apr 22 • 15
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model Paper • 2504.07615 • Published Apr 10 • 32
SmolVLM: Redefining small and efficient multimodal models Paper • 2504.05299 • Published Apr 7 • 192
view article Article Open-Source Handwritten Signature Detection Model By samuellimabraz • Mar 14 • 114
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models Paper • 2503.06749 • Published Mar 9 • 30
view article Article SmolVLM2: Bringing Video Understanding to Every Device By orrzohar and 6 others • Feb 20 • 279
Reflecting Reality: Enabling Diffusion Models to Produce Faithful Mirror Reflections Paper • 2409.14677 • Published Sep 23, 2024 • 16
Building and better understanding vision-language models: insights and future directions Paper • 2408.12637 • Published Aug 22, 2024 • 132
BRAT: Bonus oRthogonAl Token for Architecture Agnostic Textual Inversion Paper • 2408.04785 • Published Aug 8, 2024 • 9
Perturbed Attention Guidance pipelines Collection Pipelines for Perturbed Attention Guidance with 🧨 library • 8 items • Updated Jun 26, 2024 • 6
Reproducibility Study of CDUL: CLIP-Driven Unsupervised Learning for Multi-Label Image Classification Paper • 2405.11574 • Published May 19, 2024 • 1