Multimodal OCR with ReportLab? On Colab T4? (Nanonets OCR, Monkey OCR, OCRFlux 3B, Typhoo OCR 3B?) .. Yeah, it’s possible. I’ve made a dedicated Colab notebook to experiment with these models (all built on top of Qwen2.5 VL). 🤗🚀
✨Giant Tech are investing more in open source. -Alibaba: Full stack open ecosystem -Tecent: Hunyuan image/video/3D -Bytedance: Catching up fast in 2025 -Baidu: New player in open LLM
✨Startup list is shifting fast! Those who find a direction aligned with their strengths are the ones who endure. -DeepSeek -MiniMax -StepFun -Moonshot AI -Zhipu AI -OpenBMB
✨Research Lab & Community are making key contributions. -BAAI -Shanghai AI Lab -OpenMOSS -MAP
✨Baidu & MiniMax both launched open foundation models - Baidu: Ernie 4.5 ( from 0.3B -424B ) 🤯 - MiniMax: MiniMax -M1 ( Hybrid MoE reasoning model )
✨Multimodal AI is moving from fusion to full-stack reasoning: unified Any-to-Any pipelines across text, vision, audio, and 3D - Baidu: ERNIE-4.5-VL-424B - Moonshot AI: Kimi-VL-A3B - Alibaba: Ovis-U1 - BAAI: Video-XL-2/OmniGen2 - AntGroup: Ming-Lite-Omni - Chinese Academy of Science: Stream-Omni - Bytedance: SeedVR2-3B - Tencent: Hunyuan 3D 2.1/ SongGeneration - FishAudio: Openaudio-s1-mini
✨Domain specific models are rapidly emerging - Alibaba DAMO: Lingshu-7B (medical MLLM) - BAAI: RoboBrain (Robotics)
✨ So many small models! - OpenBMB: MiciCPM4 ( on device ) - Qwen: Embedding/Reranker (0.6B) - Alibaba: Ovis-U1-3B - Moonshot AI: Kimi-VL-A3B - Bytedance: SeedVR2-3B
✨ 9B base & Thinking - MIT license ✨ CoT + RL with Curriculum Sampling ✨ 64k context, 4K image, any aspect ratio ✨ Support English & Chinese ✨ Outperforms GPT 4O -2024/11/20
The bunch of comparable demos for Multimodal VLMs (excels in OCR, cinematography understanding, spatial reasoning, etc.) now up on the Hub 🤗 — max recent till Jun'25.
The demo for Camel-Doc-OCR-062825 (exp) is optimized for document retrieval and direct Markdown (.md) generation from images and PDFs. Additional demos include OCRFlux-3B (document OCR), VilaSR (spatial reasoning with visual drawing), and ShotVL (cinematic language understanding). 🐪
The community GPU grant was given by Hugging Face — special thanks to them. This space supports the following tasks: (image inference, video inference) with result markdown canvas and object detection/localization. 🤗🚀
. . . To know more about it, visit the model card of the respective model. !!
✨ From 0.3B to 424B total params ✨ Includes 47B & 3B active param MoE models + a 0.3B dense model ✨ Apache 2.0 ✨ 128K context length ✨ Text+Vision co-training with ViT & UPO