the method is simple: find which tokens have the highest attention score, merge rest of the tokens based on similarity, then merge both
their method is both training-free and for fine-tuning the authors report 5 point improvement on average of vision language tasks + 8x improvement in prefilling time for Llava-Next 7B and 13B π€―
removing redundant tokens improve image token quality too π₯Ή
we have launched Kernel Hub: easy optimized kernels for all models on Hugging Face π₯ use them right away! it's where the community populates optimized kernels π€
this release comes in three parts > Kernel Hub: contains (as of now) 14 kernels > kernels: Python library to load kernels from Kernel Hub > kernel-builder: Nix package to build kernels for PyTorch (made using PyTorch C++ frontend)
when building models, your regular workflow should be pulling kernels from Hub and building your model with them π€ here's a practical example with RMSNorm: 1. pull the kernel from Hub with get_kernel 2. decorate with use_kernel_forward_from_hub 3. inject it to your model we'd love to hear your feedback! ππ» we also welcome kernel contributions by community π₯Ήπ
Dolphin: new OCR model by ByteDance with MIT license π¬
the model first detects element in the layout (table, formula etc) and then parses each element in parallel for generation Model: ByteDance/Dolphin Try the demo: ByteDance/Dolphin
stop building parser pipelines ππ» there's a new document parser that is small, fast, Apache 2.0 licensed and is better than all the other ones! π±
echo840/MonkeyOCR is a 3B model that can parse everything (charts, formules, tables etc) in a document π€ > the authors show in the paper that document parsing pipelines often have errors propagating back > using singular e2e models are better but they're too heavy to use
this model addresses both: it's lighter, faster, stronger π₯
> based on ViT, different sizes (L/G/H) and resolution (286/384) > 0-day support in π€ transformers > comes with a physical reasoning (from video) benchmark: MVPBench, IntPhys 2, and CausalVQA facebook/physical_reasoning_leaderboard
Qwen2.5-Omni is soooo good that people build multimodal reasoning models off of it π₯Ή > KE-Team/Ke-Omni-R-3B is open-source audio reasoning model sota on average of benchmarks, based on Qwen/Qwen2.5-Omni-3B π£οΈ > Haoz0206/Omni-R1 is a video reasoning model with pixel level grounding (see below) and it's super competitive β―οΈ based on Qwen/Qwen2.5-Omni-7B
vision LMs are saturated over benchmarks, so we built vibe eval π¬
> compare different models with refreshed in-the-wild examples in different categories π€ > submit your favorite model for eval no numbers -- just vibes!
emerging trend: models that can understand image + text and generate image + text
don't miss out β€΅οΈ > MMaDA: single 8B diffusion model aligned with CoT (reasoning!) + UniGRPO Gen-Verse/MMaDA > BAGEL: 7B MoT model based on Qwen2.5, SigLIP-so-400M, Flux VAE ByteDance-Seed/BAGEL both by ByteDance! π±
multimodal π¬πΌοΈ > new moondream (VLM) is out: it's 4-bit quantized (with QAT) version of moondream-2b, runs on 2.5GB VRAM at 184 tps with only 0.6% drop in accuracy (OS) π > ByteDance released BAGEL-7B, an omni model that understands and generates both image + text. they also released Dolphin, a document parsing VLM π¬ (OS) > Google DeepMind dropped MedGemma in I/O, VLM that can interpret medical scans, and Gemma 3n, an omni model with competitive LLM performance
> MMaDa is a new 8B diffusion language model that can generate image and text
> first reasoning model for robotics > based on Qwen 2.5-VL-7B, use with Hugging Face transformers or vLLM π€ > comes with SFT & alignment datasets and a new benchmark π