AI & ML interests

None defined yet.

Recent Activity

anditoย 
posted an update 3 days ago
view post
Post
3505
๐Ÿง ๐Ÿ‘๏ธ Can AI visualize solutions?

Humans often solve visual problems by sketching ideas in our minds. What if Vision-Language Models (VLMs) could do something similar, not by generating full images, but by using internal โ€œmental sketchesโ€?

Thatโ€™s the idea behind Mirage, a new framework that empowers VLMs to reason using latent visual tokens. Instead of just thinking in words, Mirage mixes in abstract visual representations that help the model solve complex tasks.

These aren't photorealistic images. They're compact, internal representations optimized purely to support reasoning.

๐Ÿ”ง Mirage is trained in two phases:

1) Grounding: It learns to produce latent tokens anchored in real images.
2) Refinement: The model drops the images and learns to generate visual tokens on its own.

๐Ÿ“ˆ And yes, it works!
On challenging benchmarks like Visual Spatial Planning, Jigsaw puzzles, and Spatial Attention Tasks, Mirage clearly outperforms GPT-4o and other strong baselines.
Smart sketches > empty words.

By mimicking the way humans visualize solutions, Mirage gives AI a new kind of imagination, one thatโ€™s faster, more efficient, and more human-like.
Kudos to the teams at UMass Amherst and MIT behind this exciting work.
Check the paper: Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens (2506.17218)
  • 1 reply
ยท
merveย 
posted an update 3 days ago
view post
Post
633
SOOOO MANY MODEL RELEASES ๐Ÿ˜
Here's some picks from past week ๐Ÿค—

> ByteDance/XVerse is a new identity preserving image generation model ๐Ÿ–ผ๏ธ
> google/gemma-3n-E4B-it, any-to-text model supported by transformers ๐Ÿค—
> nvidia/llama-nemoretriever-colembed-3b-v1 two new state-of-the-art visual document retrievers ๐Ÿ“‘
> New version of Dia TTS model is up nari-labs/Dia-1.6B-0626
> Black Forest Labs releases Kontext benchmark black-forest-labs/kontext-bench

Find more here merve/releases-june-27-6864e8eb17f7e3a8b444083c
m-ricย 
posted an update 3 days ago
view post
Post
3319
If you're using any HF libraries, you should enable the Hub MCP in your agentic coding tool!

The brand new Docs Semantic Search tool is intravenous caffeine supply for Cursor, enables to correct API errors in a few seconds, gj @mishig โšก๏ธโšก๏ธ

๐Ÿ‘‰ To enable Hub MCP, head to your account setting, under MCP, and it will give you everything you need!
merveย 
posted an update 4 days ago
merveย 
posted an update 6 days ago
merveย 
posted an update 10 days ago
view post
Post
563
Dataset Viewer for PDFs just landed on Hugging Face ๐Ÿ“–๐Ÿค— you can now preview all the PDFs easier than before!

on top of this, there's PdfFolder format to load the PDF datasets quicker ๐Ÿ’จ
> to use it, your dataset should follow a directory format like folder/train/doc1.pdf, folder/train/doc1.pdf
> if you want to include bounding boxes, labels etc. you can keep them in a metadata.csv file in the same folder ๐Ÿค

read document dataset docs https://huggingface.co/docs/datasets/main/en/document_dataset
check all the document datasets here https://huggingface.co/datasets?modality=modality:document&sort=trending ๐Ÿ“–
  • 1 reply
ยท
merveย 
posted an update 12 days ago
view post
Post
614
we've merged LightGlue keypoint matcher to Hugging Face transformers! it allows commercial use when paired with an open-source keypoint detector ๐Ÿ™๐Ÿป

it works very well, try it yourself: ETH-CVG/LightGlue

here's an in-the-wild test with two images of the same place โคต๏ธ
  • 1 reply
ยท
merveย 
posted an update 12 days ago
view post
Post
4309
Release picks of the past week is here! Find more models, datasets, Spaces here merve/june-20-releases-68594824d1f4dfa61aee3433

๐Ÿ–ผ๏ธ VLMs/OCR
> moonshotai/Kimi-VL-A3B-Thinking-2506 is a powerful reasoning vision LM, 3B active params, smarter with less tokens, supports long documents, videos ๐Ÿ‘ (OS)
> nanonets/Nanonets-OCR-s is 3.75B params OCR model based on Qwen2.5VL-3B-Instruct (OS)

๐Ÿ’ฌ LLMs
> moonshotai/Kimi-Dev-72B is a strong coding model based on Qwen2.5-72B (OS)
> Mistral released mistralai/Mistral-Small-3.2-24B-Instruct-2506, an update to their former model with better function calling & instruction following (OS)

๐Ÿ—ฃ๏ธ Audio
> Google released google/magenta-realtime, real time music generation & audio synthesis (cc-by-4)
> kyutai released new speech-to-text models that come in 1B & 2B ( kyutai/stt-1b-en_fr, stt-2b-en_fr) with 0.5s and 2.5s delay

3D
> Tencent released tencent/Hunyuan3D-2.1 an image-to-3D model (see below)
merveย 
posted an update 14 days ago
merveย 
posted an update 16 days ago
merveย 
posted an update 16 days ago
view post
Post
1903
stop using VLMs blindly โœ‹๐Ÿป

compare different VLM outputs on a huge variety of inputs (from reasoning to OCR!) ๐Ÿ”ฅ visionLMsftw/comparevlms

> has support for multiple VLMs: google/gemma-3-27b-it, Qwen/Qwen2.5-VL-7B-Instruct, Qwen/Qwen2.5-VL-32B-Instruct, meta-llama/Llama-4-Maverick-17B-128E-Instruct, HuggingFaceTB/SmolVLM2-2.2B-Instruct
> recommend us new models or inputs, we'll add ๐Ÿซก

so far I figured out
> for fact-checks, you need a relatively bigger size (7B is ok!)
> Gemma 3 gets downgrade without pan and scan (especially for ๐Ÿ“‘)
> Qwen2.5VL-32B is very talkative, great for reasoning but not good for simple tasks ๐Ÿ—ฃ๏ธ
  • 2 replies
ยท
multimodalartย 
posted an update 17 days ago
view post
Post
5288
Self-Forcing - a real-time video distilled model from Wan 2.1 by @adobe is out, and they open sourced it ๐Ÿ

I've built a live real time demo on Spaces ๐Ÿ“น๐Ÿ’จ

multimodalart/self-forcing
ยท
merveย 
posted an update 17 days ago
view post
Post
3607
Releases of the past week are here merve/releases-june-13-6852c3c1eaf1e0c24c958860

Here's our picks ๐Ÿค“
So many interesting models released past week in open AI! ๐Ÿค–

๐Ÿ–ผ๏ธ Computer Vision/VLMs
> nanonets/Nanonets-OCR-s is the new state-of-the-art OCR model that can handle checkboxes, watermarks, tables (OS)
> Meta released facebook/v-jepa-2-6841bad8413014e185b497a6, new sota video embeddings with two new classification models (OS)
> ByteDance-Seed/SeedVR2-3B is a new 3B video restoration model (OS)

Audio
> Stepfun released stepfun-ai/Step-Audio-AQAA, new large (137B ๐Ÿคฏ) audio language model that takes in audio and generates audio (OS)

๐Ÿค– Robotics
> nvidia released nvidia/GR00T-N1.5-3B, new open foundation vision language action model

3D
> tencent/Hunyuan3D-2.1 is the new version of Hunyuan by Tencent that can generate 3D assets from text and image prompts
merveย 
posted an update 18 days ago
view post
Post
3543
IN: video fine-tuning support for facebook V-JEPA 2 in HF transformers ๐Ÿ”ฅ

it comes with
> four models fine-tuned on Diving48 and SSv2 dataset facebook/v-jepa-2-6841bad8413014e185b497a6
> FastRTC demo on V-JEPA2 SSv2 qubvel-hf/vjepa2-streaming-video-classification
> fine-tuning script on UCF-101 https://gist.github.com/ariG23498/28bccc737c11d1692f6d0ad2a0d7cddb
> fine-tuning notebook on UCF-101 https://colab.research.google.com/drive/16NWUReXTJBRhsN3umqznX4yoZt2I7VGc?usp=sharing
we're looking forward to see what you will build! ๐Ÿค—
merveย 
posted an update 19 days ago
view post
Post
2446
#CVPR2025 Paper Picks #1
VisionZip is a compression technique that reduces number of visual tokens to improve performance AND prefill time for vision language models
demo: Senqiao/VisionZip
paper: VisionZip: Longer is Better but Not Necessary in Vision Language Models (2412.04467)
most of the image tokens are redundant for the LLM, so the authors ask "are all visual tokens necessary?"

the method is simple:
find which tokens have the highest attention score, merge rest of the tokens based on similarity, then merge both

their method is both training-free and for fine-tuning
the authors report 5 point improvement on average of vision language tasks + 8x improvement in prefilling time for Llava-Next 7B and 13B ๐Ÿคฏ

removing redundant tokens improve image token quality too ๐Ÿฅน
merveย 
posted an update 20 days ago
view post
Post
3679
stop writing CUDA kernels yourself

we have launched Kernel Hub: easy optimized kernels for all models on Hugging Face ๐Ÿ”ฅ use them right away!
it's where the community populates optimized kernels ๐Ÿค

this release comes in three parts
> Kernel Hub: contains (as of now) 14 kernels
> kernels: Python library to load kernels from Kernel Hub
> kernel-builder: Nix package to build kernels for PyTorch (made using PyTorch C++ frontend)

when building models, your regular workflow should be pulling kernels from Hub and building your model with them ๐Ÿค—
here's a practical example with RMSNorm:
1. pull the kernel from Hub with get_kernel
2. decorate with use_kernel_forward_from_hub
3. inject it to your model
we'd love to hear your feedback! ๐Ÿ™๐Ÿป
we also welcome kernel contributions by community ๐Ÿฅน๐Ÿ’—

- request kernels here: kernels-community/README#1
- check out this org: kernels-community
- read the blog: https://huggingface.co/blog/hello-hf-kernels
  • 1 reply
ยท
merveย 
posted an update 23 days ago
view post
Post
719
Dolphin: new OCR model by ByteDance with MIT license ๐Ÿฌ

the model first detects element in the layout (table, formula etc) and then parses each element in parallel for generation
Model: ByteDance/Dolphin
Try the demo: ByteDance/Dolphin
merveย 
posted an update 24 days ago
view post
Post
1377
stop building parser pipelines ๐Ÿ‘‹๐Ÿป
there's a new document parser that is small, fast, Apache 2.0 licensed and is better than all the other ones! ๐Ÿ˜ฑ

echo840/MonkeyOCR is a 3B model that can parse everything (charts, formules, tables etc) in a document ๐Ÿค 
> the authors show in the paper that document parsing pipelines often have errors propagating back
> using singular e2e models are better but they're too heavy to use

this model addresses both: it's lighter, faster, stronger ๐Ÿ”ฅ