Humans often solve visual problems by sketching ideas in our minds. What if Vision-Language Models (VLMs) could do something similar, not by generating full images, but by using internal “mental sketches”?
That’s the idea behind Mirage, a new framework that empowers VLMs to reason using latent visual tokens. Instead of just thinking in words, Mirage mixes in abstract visual representations that help the model solve complex tasks.
These aren't photorealistic images. They're compact, internal representations optimized purely to support reasoning.
🔧 Mirage is trained in two phases:
1) Grounding: It learns to produce latent tokens anchored in real images. 2) Refinement: The model drops the images and learns to generate visual tokens on its own.
📈 And yes, it works! On challenging benchmarks like Visual Spatial Planning, Jigsaw puzzles, and Spatial Attention Tasks, Mirage clearly outperforms GPT-4o and other strong baselines. Smart sketches > empty words.
✨Giant Tech are investing more in open source. -Alibaba: Full stack open ecosystem -Tecent: Hunyuan image/video/3D -Bytedance: Catching up fast in 2025 -Baidu: New player in open LLM
✨Startup list is shifting fast! Those who find a direction aligned with their strengths are the ones who endure. -DeepSeek -MiniMax -StepFun -Moonshot AI -Zhipu AI -OpenBMB
✨Research Lab & Community are making key contributions. -BAAI -Shanghai AI Lab -OpenMOSS -MAP
✨Baidu & MiniMax both launched open foundation models - Baidu: Ernie 4.5 ( from 0.3B -424B ) 🤯 - MiniMax: MiniMax -M1 ( Hybrid MoE reasoning model )
✨Multimodal AI is moving from fusion to full-stack reasoning: unified Any-to-Any pipelines across text, vision, audio, and 3D - Baidu: ERNIE-4.5-VL-424B - Moonshot AI: Kimi-VL-A3B - Alibaba: Ovis-U1 - BAAI: Video-XL-2/OmniGen2 - AntGroup: Ming-Lite-Omni - Chinese Academy of Science: Stream-Omni - Bytedance: SeedVR2-3B - Tencent: Hunyuan 3D 2.1/ SongGeneration - FishAudio: Openaudio-s1-mini
✨Domain specific models are rapidly emerging - Alibaba DAMO: Lingshu-7B (medical MLLM) - BAAI: RoboBrain (Robotics)
✨ So many small models! - OpenBMB: MiciCPM4 ( on device ) - Qwen: Embedding/Reranker (0.6B) - Alibaba: Ovis-U1-3B - Moonshot AI: Kimi-VL-A3B - Bytedance: SeedVR2-3B
✨ 9B base & Thinking - MIT license ✨ CoT + RL with Curriculum Sampling ✨ 64k context, 4K image, any aspect ratio ✨ Support English & Chinese ✨ Outperforms GPT 4O -2024/11/20
✨ From 0.3B to 424B total params ✨ Includes 47B & 3B active param MoE models + a 0.3B dense model ✨ Apache 2.0 ✨ 128K context length ✨ Text+Vision co-training with ViT & UPO
Dataset Viewer for PDFs just landed on Hugging Face 📖🤗 you can now preview all the PDFs easier than before!
on top of this, there's PdfFolder format to load the PDF datasets quicker 💨 > to use it, your dataset should follow a directory format like folder/train/doc1.pdf, folder/train/doc1.pdf > if you want to include bounding boxes, labels etc. you can keep them in a metadata.csv file in the same folder 🤝