Dolphin: new OCR model by ByteDance with MIT license π¬
the model first detects element in the layout (table, formula etc) and then parses each element in parallel for generation Model: ByteDance/Dolphin Try the demo: ByteDance/Dolphin
Brand new MCP Course has units are out, and now it's getting REAL! We've collaborated with Anthropic to dive deep into production ready and autonomous agents using MCP
This is what the new material covers and includes:
- Use Claude Code to build an autonomous PR agent - Integrate your agent with Slack and Github to integrate it with you Team - Get certified on your use case and share with the community - Build an autonomous PR cleanup agent on the Hugging Face hub and deploy it with spaces
The material goes deep into these problems and helps you to build applications that work. Weβre super excited to see what you build with it.
stop building parser pipelines ππ» there's a new document parser that is small, fast, Apache 2.0 licensed and is better than all the other ones! π±
echo840/MonkeyOCR is a 3B model that can parse everything (charts, formules, tables etc) in a document π€ > the authors show in the paper that document parsing pipelines often have errors propagating back > using singular e2e models are better but they're too heavy to use
this model addresses both: it's lighter, faster, stronger π₯
> based on ViT, different sizes (L/G/H) and resolution (286/384) > 0-day support in π€ transformers > comes with a physical reasoning (from video) benchmark: MVPBench, IntPhys 2, and CausalVQA facebook/physical_reasoning_leaderboard
Use this MCP server with tools like Claude Desktop, Cursor, VSCode, or Continue to do this:
- Define an ML problem like Image Classification, LLM fine-tuning, Text Classification, etc. - The AI can retrieve models and datasets from the hub using the hub MCP. - Training happens on a Hugging Face space, so no worries about hardware restraints. - Models are pushed to the hub to be used inference tools like Llama.cpp, vLLM, MLX, etc. - Built on top of the AutoTrain library, so it has full integration with transformers and other libraries.
Everything is still under active development, but Iβm super excited to hear what people build, and Iβm open to contributions!
Qwen2.5-Omni is soooo good that people build multimodal reasoning models off of it π₯Ή > KE-Team/Ke-Omni-R-3B is open-source audio reasoning model sota on average of benchmarks, based on Qwen/Qwen2.5-Omni-3B π£οΈ > Haoz0206/Omni-R1 is a video reasoning model with pixel level grounding (see below) and it's super competitive β―οΈ based on Qwen/Qwen2.5-Omni-7B
vision LMs are saturated over benchmarks, so we built vibe eval π¬
> compare different models with refreshed in-the-wild examples in different categories π€ > submit your favorite model for eval no numbers -- just vibes!
emerging trend: models that can understand image + text and generate image + text
don't miss out β€΅οΈ > MMaDA: single 8B diffusion model aligned with CoT (reasoning!) + UniGRPO Gen-Verse/MMaDA > BAGEL: 7B MoT model based on Qwen2.5, SigLIP-so-400M, Flux VAE ByteDance-Seed/BAGEL both by ByteDance! π±
multimodal π¬πΌοΈ > new moondream (VLM) is out: it's 4-bit quantized (with QAT) version of moondream-2b, runs on 2.5GB VRAM at 184 tps with only 0.6% drop in accuracy (OS) π > ByteDance released BAGEL-7B, an omni model that understands and generates both image + text. they also released Dolphin, a document parsing VLM π¬ (OS) > Google DeepMind dropped MedGemma in I/O, VLM that can interpret medical scans, and Gemma 3n, an omni model with competitive LLM performance
> MMaDa is a new 8B diffusion language model that can generate image and text
- Itβs still free! - Video 1 walks you through onboarding to the course - The first live session is next week! - You can now get a certificate via exam app - We improved and written material with interactive quizzes
If youβre studying MCP and want a live, interactive, visual, certified course, then join us on the hub!