AI & ML interests

None defined yet.

Recent Activity

sergiopaniegoΒ 
posted an update about 17 hours ago
view post
Post
142
Test SmolLM3, the newest fully open model released by @HuggingFaceTB !

It's smol (3B), multilingual (6 languages), comes with dual mode reasoning (think/no_think modes) and supports long-context (128k).

Try it now in the notebook below!! ⬇️

Colab notebook: https://colab.research.google.com/github/sergiopaniego/samples/blob/main/smollm3_3b_inference.ipynb
notebook: https://github.com/sergiopaniego/samples/blob/main/smollm3_3b_inference.ipynb
blog: https://huggingface.co/blog/smollm3
merveΒ 
posted an update about 22 hours ago
view post
Post
1207
GitHub refuses to render notebooks for a long time now πŸ’”

so smol-vision now lives in Hugging Face model repository πŸ€— merve/smol-vision
  • 1 reply
Β·
merveΒ 
posted an update 2 days ago
view post
Post
2836
ByteDance released Tar 1.5B and 7B: image-text in image-text out models, fully open-source πŸ‘ ByteDance-Seed/tar-6864cf0d9fe59a3b91cc4260

They have an image tokenizer unified with text, and they de-tokenize using either of two models (LLM and diffusion)
The model is actually a full LLM (Qwen2), the tokenizer converts image tokens 🀯
merveΒ 
posted an update 3 days ago
view post
Post
3527
Huge drops in open AI past week!
Find more models, datasets, demos here merve/releases-july-4-686bcc54ed7c45c341fbf654
Some of our picks 🫑
⏯️ BAAI/MTVCraft is a new Veo3-like text-to-video model, demo is here BAAI/MTVCraft
πŸ§‘πŸ»β€πŸ’» apple/diffucoder-6868139f56672ae046fe04e8 is a new family of diffusion LLMs (7B base and instruct) for coding
πŸ—£οΈ kyutai/tts-1.6b-en_fr is a new small TTS model for English and France
πŸ‘€ aharley/alltracker is a new pixel tracking model by Stanford, demo is here aharley/alltracker
πŸ“– racineai/OGC_MEGA_MultiDomain_DocRetrieval is a new large visual document retrieval dataset
  • 1 reply
Β·
anditoΒ 
posted an update 7 days ago
view post
Post
3824
πŸ§ πŸ‘οΈ Can AI visualize solutions?

Humans often solve visual problems by sketching ideas in our minds. What if Vision-Language Models (VLMs) could do something similar, not by generating full images, but by using internal β€œmental sketches”?

That’s the idea behind Mirage, a new framework that empowers VLMs to reason using latent visual tokens. Instead of just thinking in words, Mirage mixes in abstract visual representations that help the model solve complex tasks.

These aren't photorealistic images. They're compact, internal representations optimized purely to support reasoning.

πŸ”§ Mirage is trained in two phases:

1) Grounding: It learns to produce latent tokens anchored in real images.
2) Refinement: The model drops the images and learns to generate visual tokens on its own.

πŸ“ˆ And yes, it works!
On challenging benchmarks like Visual Spatial Planning, Jigsaw puzzles, and Spatial Attention Tasks, Mirage clearly outperforms GPT-4o and other strong baselines.
Smart sketches > empty words.

By mimicking the way humans visualize solutions, Mirage gives AI a new kind of imagination, one that’s faster, more efficient, and more human-like.
Kudos to the teams at UMass Amherst and MIT behind this exciting work.
Check the paper: Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens (2506.17218)
Β·
sergiopaniegoΒ 
posted an update 7 days ago
view post
Post
1902
Updated my HF Space for vibe testing smol VLMs on object detection, visual grounding, keypoint detection & counting! πŸ‘“

πŸ†• Compare Qwen2.5 VL 3B vs Moondream 2B side-by-side with annotated images & text outputs.

Try examples or test your own images! πŸƒ

πŸ“±Space: sergiopaniego/vlm_object_understanding
merveΒ 
posted an update 8 days ago
view post
Post
873
SOOOO MANY MODEL RELEASES 😍
Here's some picks from past week πŸ€—

> ByteDance/XVerse is a new identity preserving image generation model πŸ–ΌοΈ
> google/gemma-3n-E4B-it, any-to-text model supported by transformers πŸ€—
> nvidia/llama-nemoretriever-colembed-3b-v1 two new state-of-the-art visual document retrievers πŸ“‘
> New version of Dia TTS model is up nari-labs/Dia-1.6B-0626
> Black Forest Labs releases Kontext benchmark black-forest-labs/kontext-bench

Find more here merve/releases-june-27-6864e8eb17f7e3a8b444083c
merveΒ 
posted an update 8 days ago
sergiopaniegoΒ 
posted an update 10 days ago
view post
Post
1006
πŸ“£ CALL FOR CONTRIBUTORS! πŸ“£

Following last week’s full release of Gemma 3n, we launched a dedicated recipes repo to explore and share use cases. We already added some! πŸ§‘β€πŸ³

Now we’re inviting the community to contribute and showcase how these models shine! ✨

Let them cook.

Check it out: https://github.com/huggingface/huggingface-gemma-recipes/issues/4
  • 1 reply
Β·
merveΒ 
posted an update 10 days ago
merveΒ 
posted an update 14 days ago
view post
Post
581
Dataset Viewer for PDFs just landed on Hugging Face πŸ“–πŸ€— you can now preview all the PDFs easier than before!

on top of this, there's PdfFolder format to load the PDF datasets quicker πŸ’¨
> to use it, your dataset should follow a directory format like folder/train/doc1.pdf, folder/train/doc1.pdf
> if you want to include bounding boxes, labels etc. you can keep them in a metadata.csv file in the same folder 🀝

read document dataset docs https://huggingface.co/docs/datasets/main/en/document_dataset
check all the document datasets here https://huggingface.co/datasets?modality=modality:document&sort=trending πŸ“–
  • 1 reply
Β·
merveΒ 
posted an update 16 days ago
view post
Post
631
we've merged LightGlue keypoint matcher to Hugging Face transformers! it allows commercial use when paired with an open-source keypoint detector πŸ™πŸ»

it works very well, try it yourself: ETH-CVG/LightGlue

here's an in-the-wild test with two images of the same place ‡️
  • 1 reply
Β·
sergiopaniegoΒ 
posted an update 17 days ago
merveΒ 
posted an update 17 days ago
view post
Post
4323
Release picks of the past week is here! Find more models, datasets, Spaces here merve/june-20-releases-68594824d1f4dfa61aee3433

πŸ–ΌοΈ VLMs/OCR
> moonshotai/Kimi-VL-A3B-Thinking-2506 is a powerful reasoning vision LM, 3B active params, smarter with less tokens, supports long documents, videos πŸ‘ (OS)
> nanonets/Nanonets-OCR-s is 3.75B params OCR model based on Qwen2.5VL-3B-Instruct (OS)

πŸ’¬ LLMs
> moonshotai/Kimi-Dev-72B is a strong coding model based on Qwen2.5-72B (OS)
> Mistral released mistralai/Mistral-Small-3.2-24B-Instruct-2506, an update to their former model with better function calling & instruction following (OS)

πŸ—£οΈ Audio
> Google released google/magenta-realtime, real time music generation & audio synthesis (cc-by-4)
> kyutai released new speech-to-text models that come in 1B & 2B ( kyutai/stt-1b-en_fr, stt-2b-en_fr) with 0.5s and 2.5s delay

3D
> Tencent released tencent/Hunyuan3D-2.1 an image-to-3D model (see below)
merveΒ 
posted an update 18 days ago
merveΒ 
posted an update 20 days ago
merveΒ 
posted an update 21 days ago
view post
Post
1914
stop using VLMs blindly βœ‹πŸ»

compare different VLM outputs on a huge variety of inputs (from reasoning to OCR!) πŸ”₯ visionLMsftw/comparevlms

> has support for multiple VLMs: google/gemma-3-27b-it, Qwen/Qwen2.5-VL-7B-Instruct, Qwen/Qwen2.5-VL-32B-Instruct, meta-llama/Llama-4-Maverick-17B-128E-Instruct, HuggingFaceTB/SmolVLM2-2.2B-Instruct
> recommend us new models or inputs, we'll add 🫑

so far I figured out
> for fact-checks, you need a relatively bigger size (7B is ok!)
> Gemma 3 gets downgrade without pan and scan (especially for πŸ“‘)
> Qwen2.5VL-32B is very talkative, great for reasoning but not good for simple tasks πŸ—£οΈ
  • 2 replies
Β·
merveΒ 
posted an update 22 days ago
view post
Post
3617
Releases of the past week are here merve/releases-june-13-6852c3c1eaf1e0c24c958860

Here's our picks πŸ€“
So many interesting models released past week in open AI! πŸ€–

πŸ–ΌοΈ Computer Vision/VLMs
> nanonets/Nanonets-OCR-s is the new state-of-the-art OCR model that can handle checkboxes, watermarks, tables (OS)
> Meta released facebook/v-jepa-2-6841bad8413014e185b497a6, new sota video embeddings with two new classification models (OS)
> ByteDance-Seed/SeedVR2-3B is a new 3B video restoration model (OS)

Audio
> Stepfun released stepfun-ai/Step-Audio-AQAA, new large (137B 🀯) audio language model that takes in audio and generates audio (OS)

πŸ€– Robotics
> nvidia released nvidia/GR00T-N1.5-3B, new open foundation vision language action model

3D
> tencent/Hunyuan3D-2.1 is the new version of Hunyuan by Tencent that can generate 3D assets from text and image prompts