view article Article Vision Language Model Alignment in TRL ⚡️ By sergiopaniego and 4 others • Aug 7 • 93
view article Article PP-OCRv5 on Hugging Face: A Specialized Approach to OCR By baidu and 5 others • 28 days ago • 103
PP-OCRv5 Collection PP-OCRv5 is the latest text recognition solution, supporting Simplified Chinese, Chinese Pinyin, Traditional Chinese, English, and Japanese • 13 items • Updated 23 days ago • 46
view article Article Welcome the NVIDIA Llama Nemotron Nano VLM to Hugging Face Hub By nvidia and 11 others • Jun 27 • 28
V-JEPA 2 Collection A frontier video understanding model developed by FAIR, Meta, which extends the pretraining objectives of https://ai.meta.com/blog/v-jepa-yann • 8 items • Updated Jun 13 • 164
view article Article ScreenSuite - The most comprehensive evaluation suite for GUI Agents! Jun 6 • 54
Holo1 Collection Vision-Language Action Model for use in Surfer-H web navigation agent • 6 items • Updated Jun 10 • 48
AGUVIS: Unified Pure Vision GUI Agents Collection https://aguvis-project.github.io • 3 items • Updated Dec 20, 2024 • 7
MiniCPM-o & MiniCPM-V Collection Multimodal models with leading performance. • 28 items • Updated Sep 1 • 54
view article Article Vision Language Models (Better, Faster, Stronger) By merve and 4 others • May 12 • 538
video-effects datasets Collection Smol datasets to emulate cool video effects like "squish", "dissolve", etc. Inspired by Pika effects. • 4 items • Updated Jan 28 • 4
AIMv2 Collection A collection of AIMv2 vision encoders that supports a number of resolutions, native resolution, and a distilled checkpoint. • 19 items • Updated Aug 25 • 82