Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents Paper • 2507.04009 • Published Jul 5 • 45
StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation Paper • 2508.08248 • Published Aug 11 • 27
LightLab: Controlling Light Sources in Images with Diffusion Models Paper • 2505.09608 • Published May 14 • 35
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning Paper • 2504.06958 • Published Apr 9 • 11
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale Paper • 2504.16030 • Published Apr 22 • 37
Vidi: Large Multimodal Models for Video Understanding and Editing Paper • 2504.15681 • Published Apr 22 • 14
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model Paper • 2504.07615 • Published Apr 10 • 33
SmolVLM: Redefining small and efficient multimodal models Paper • 2504.05299 • Published Apr 7 • 198
view article Article Open-Source Handwritten Signature Detection Model By samuellimabraz • Mar 14 • 119
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models Paper • 2503.06749 • Published Mar 9 • 31
view article Article SmolVLM2: Bringing Video Understanding to Every Device By orrzohar and 6 others • Feb 20 • 302
Reflecting Reality: Enabling Diffusion Models to Produce Faithful Mirror Reflections Paper • 2409.14677 • Published Sep 23, 2024 • 16
Building and better understanding vision-language models: insights and future directions Paper • 2408.12637 • Published Aug 22, 2024 • 133
BRAT: Bonus oRthogonAl Token for Architecture Agnostic Textual Inversion Paper • 2408.04785 • Published Aug 8, 2024 • 9
Perturbed Attention Guidance pipelines Collection Pipelines for Perturbed Attention Guidance with 🧨 library • 8 items • Updated Jun 26, 2024 • 6
Reproducibility Study of CDUL: CLIP-Driven Unsupervised Learning for Multi-Label Image Classification Paper • 2405.11574 • Published May 19, 2024 • 1