AI & ML interests

None defined yet.

Recent Activity

andito 
posted an update 2 days ago
view post
Post
3123
🧠👁️ Can AI visualize solutions?

Humans often solve visual problems by sketching ideas in our minds. What if Vision-Language Models (VLMs) could do something similar, not by generating full images, but by using internal “mental sketches”?

That’s the idea behind Mirage, a new framework that empowers VLMs to reason using latent visual tokens. Instead of just thinking in words, Mirage mixes in abstract visual representations that help the model solve complex tasks.

These aren't photorealistic images. They're compact, internal representations optimized purely to support reasoning.

🔧 Mirage is trained in two phases:

1) Grounding: It learns to produce latent tokens anchored in real images.
2) Refinement: The model drops the images and learns to generate visual tokens on its own.

📈 And yes, it works!
On challenging benchmarks like Visual Spatial Planning, Jigsaw puzzles, and Spatial Attention Tasks, Mirage clearly outperforms GPT-4o and other strong baselines.
Smart sketches > empty words.

By mimicking the way humans visualize solutions, Mirage gives AI a new kind of imagination, one that’s faster, more efficient, and more human-like.
Kudos to the teams at UMass Amherst and MIT behind this exciting work.
Check the paper: Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens (2506.17218)
  • 1 reply
·
AdinaY 
posted an update 2 days ago
view post
Post
1434
The Chinese Open Source Heatmap is live 🔥
You can now track the companies/ research labs/ communities powering China’s open source AI movement.

zh-ai-community/model-release-heatmap-zh

Some highlights:

✨Giant Tech are investing more in open source.
-Alibaba: Full stack open ecosystem
-Tecent: Hunyuan image/video/3D
-Bytedance: Catching up fast in 2025
-Baidu: New player in open LLM

✨New players emerging post–DeepSeek moment.
-Xiaomi
-Red Note
-Bilibili
-MiniMax
-Moonshot AI

✨Startup list is shifting fast! Those who find a direction aligned with their strengths are the ones who endure.
-DeepSeek
-MiniMax
-StepFun
-Moonshot AI
-Zhipu AI
-OpenBMB

✨Research Lab & Community are making key contributions.
-BAAI
-Shanghai AI Lab
-OpenMOSS
-MAP
sergiopaniego 
posted an update 2 days ago
view post
Post
1438
Updated my HF Space for vibe testing smol VLMs on object detection, visual grounding, keypoint detection & counting! 👓

🆕 Compare Qwen2.5 VL 3B vs Moondream 2B side-by-side with annotated images & text outputs.

Try examples or test your own images! 🏃

📱Space: sergiopaniego/vlm_object_understanding
AdinaY 
posted an update 3 days ago
view post
Post
3162
🔥 June highlights from China’s open source ecosystem.

zh-ai-community/june-2025-open-works-from-the-chinese-community-683d66c188f782dc5570ba15

✨Baidu & MiniMax both launched open foundation models
- Baidu: Ernie 4.5 ( from 0.3B -424B ) 🤯
- MiniMax: MiniMax -M1 ( Hybrid MoE reasoning model )

✨Multimodal AI is moving from fusion to full-stack reasoning: unified Any-to-Any pipelines across text, vision, audio, and 3D
- Baidu: ERNIE-4.5-VL-424B
- Moonshot AI: Kimi-VL-A3B
- Alibaba: Ovis-U1
- BAAI: Video-XL-2/OmniGen2
- AntGroup: Ming-Lite-Omni
- Chinese Academy of Science: Stream-Omni
- Bytedance: SeedVR2-3B
- Tencent: Hunyuan 3D 2.1/ SongGeneration
- FishAudio: Openaudio-s1-mini

✨Domain specific models are rapidly emerging
- Alibaba DAMO: Lingshu-7B (medical MLLM)
- BAAI: RoboBrain (Robotics)

✨ So many small models!
- OpenBMB: MiciCPM4 ( on device )
- Qwen: Embedding/Reranker (0.6B)
- Alibaba: Ovis-U1-3B
- Moonshot AI: Kimi-VL-A3B
- Bytedance: SeedVR2-3B
merve 
posted an update 3 days ago
view post
Post
558
SOOOO MANY MODEL RELEASES 😍
Here's some picks from past week 🤗

> ByteDance/XVerse is a new identity preserving image generation model 🖼️
> google/gemma-3n-E4B-it, any-to-text model supported by transformers 🤗
> nvidia/llama-nemoretriever-colembed-3b-v1 two new state-of-the-art visual document retrievers 📑
> New version of Dia TTS model is up nari-labs/Dia-1.6B-0626
> Black Forest Labs releases Kontext benchmark black-forest-labs/kontext-bench

Find more here merve/releases-june-27-6864e8eb17f7e3a8b444083c
AdinaY 
posted an update 3 days ago
view post
Post
222
MTVCraft 🔥 Veo3 style Audio-Video model by BAAI

Model:
BAAI/MTVCraft
Demo:
BAAI/MTVCraft

✨ Text > [Speech + SFX + BGM] > Synchronized Video
✨ Built with Qwen3 + ElevenLabs + MTV
merve 
posted an update 3 days ago
AdinaY 
posted an update 3 days ago
view post
Post
2211
GLM-4.1V-Thinking 🔥 New open vision reasoning model by Zhipu AI

THUDM/glm-41v-thinking-6862bbfc44593a8601c2578d

✨ 9B base & Thinking - MIT license
✨ CoT + RL with Curriculum Sampling
✨ 64k context, 4K image, any aspect ratio
✨ Support English & Chinese
✨ Outperforms GPT 4O -2024/11/20
sergiopaniego 
posted an update 5 days ago
view post
Post
959
📣 CALL FOR CONTRIBUTORS! 📣

Following last week’s full release of Gemma 3n, we launched a dedicated recipes repo to explore and share use cases. We already added some! 🧑‍🍳

Now we’re inviting the community to contribute and showcase how these models shine! ✨

Let them cook.

Check it out: https://github.com/huggingface/huggingface-gemma-recipes/issues/4
  • 1 reply
·
merve 
posted an update 5 days ago
AdinaY 
posted an update 5 days ago
AdinaY 
posted an update 5 days ago
view post
Post
302
Baidu kept its promise, releasing 10 open models on the very last day of June🚀 Let's meet ERNIE 4.5 🔥

baidu/ernie-45-6861cd4c9be84540645f35c9

✨ From 0.3B to 424B total params
✨ Includes 47B & 3B active param MoE models + a 0.3B dense model
✨ Apache 2.0
✨ 128K context length
✨ Text+Vision co-training with ViT & UPO
AdinaY 
posted an update 8 days ago
view post
Post
3063
Hunyuan-A13B 🔥 New MoE LLM by TencentHunyuan

tencent/Hunyuan-A13B-Instruct

✨80B total / 13B active params
✨256K context window
✨Dual-mode reasoning: fast & slow thinking
✨Efficient inference (GQA + quantization)
merve 
posted an update 9 days ago
view post
Post
559
Dataset Viewer for PDFs just landed on Hugging Face 📖🤗 you can now preview all the PDFs easier than before!

on top of this, there's PdfFolder format to load the PDF datasets quicker 💨
> to use it, your dataset should follow a directory format like folder/train/doc1.pdf, folder/train/doc1.pdf
> if you want to include bounding boxes, labels etc. you can keep them in a metadata.csv file in the same folder 🤝

read document dataset docs https://huggingface.co/docs/datasets/main/en/document_dataset
check all the document datasets here https://huggingface.co/datasets?modality=modality:document&sort=trending 📖
  • 1 reply
·
freddyaboulton 
posted an update 10 days ago