AI & ML interests

None defined yet.

Recent Activity

MolbapΒ 
posted an update 1 day ago
view post
Post
1683
πŸš€ New blog: Maintain the unmaintainable – 1M+ Python LOC, 400+ models

How do you stop a million-line library built by thousands of contributors from collapsing under its own weight?
At πŸ€— Transformers, we do it with explicit software-engineering tenets, principles that make the codebase hackable at scale.

πŸ” Inside the post:
– One Model, One File: readability first β€” you can still open a modeling file and see the full logic, top to bottom.
– Modular Transformers: visible inheritance that cuts maintenance cost by ~15Γ— while keeping models readable.
– Config-Driven Performance: FlashAttention, tensor parallelism, and attention scheduling are config-level features, not rewrites.

Written with @lysandre ,@pcuenq and @yonigozlan , this is a deep dive into how Transformers stays fast, open, and maintainable.

Read it here β†’ transformers-community/Transformers-tenets
merveΒ 
posted an update 15 days ago
view post
Post
5971
large AI labs open-sourced a ton of models last week πŸ”₯
here's few picks, find even more here merve/sep-16-releases-68d13ea4c547f02f95842f05 🀝
> IBM released a new Docling model with 258M params based on Granite (A2.0) πŸ“ ibm-granite/granite-docling-258M
> Xiaomi released 7B audio LM with base and instruct variants (MIT) XiaomiMiMo/mimo-audio-68cc7202692c27dae881cce0
> DecartAI released Lucy Edit, open Nano Banana 🍌 (NC) decart-ai/Lucy-Edit-Dev
> OpenGVLab released a family of agentic computer use models (3B/7B/32B) with the dataset πŸ’» OpenGVLab/scalecua-68c912cf56f7ff4c8e034003
> Meituan Longcat released thinking version of LongCat-Flash πŸ’­ meituan-longcat/LongCat-Flash-Thinking
  • 2 replies
Β·
merveΒ 
posted an update 20 days ago
view post
Post
3168
IBM just released small swiss army knife for the document models: granite-docling-258M on Hugging Face πŸ”₯

> not only a document converter but also can do document question answering, understand multiple languages 🀯
> best part: released with Apache 2.0 license πŸ‘ use it with your commercial projects!
> it supports transformers, vLLM and MLX from the get-go! πŸ€—
> built on SigLIP2 & granite-165M

model: ibm-granite/granite-docling-258M
demo: ibm-granite/granite-docling-258m-demo πŸ’—
lysandreΒ 
posted an update 22 days ago
view post
Post
5843
We're kick-starting the process of Transformers v5, with @ArthurZ and @cyrilvallez !

v5 should be significant: we're using it as a milestone for performance optimizations, saner defaults, and a much cleaner code base worthy of 2025.

Fun fact: v4.0.0-rc-1 came out on Nov 19, 2020, nearly five years ago!
  • 6 replies
Β·
merveΒ 
posted an update 22 days ago
view post
Post
1070
a ton of image/video generation models and LLMs from big labs πŸ”₯

> Meta released facebook/mobilellm-r1-68c4597b104fac45f28f448e, smol LLMs for on-device use πŸ’¬
> Tencent released tencent/SRPO, high res image generation model and tencent/POINTS-Reader, cutting edge OCR πŸ“
> ByteDance released bytedance-research/HuMo, video generation from any input ⏯️

find more models, datasets, demos here merve/sep-11-releases-68c7dbfa26bea8cd921fa0ac
merveΒ 
posted an update 26 days ago
view post
Post
905
fan-favorite vision LM Florence-2 is now officially supported in transformers πŸ€—

find all the models in florence-community org 🫑
merveΒ 
posted an update 28 days ago
merveΒ 
posted an update 29 days ago
merveΒ 
posted an update about 1 month ago
view post
Post
6233
large AI labs have dropped so many open models last week πŸ”₯ don't miss out on them

β†’ Apple released on-device vision LMs apple/fastvlm-68ac97b9cd5cacefdd04872e & apple/mobileclip2-68ac947dcb035c54bcd20c47
β†’ OpenGVLab released InternVL3.5, 32 new vision LMs with one based on gpt-oss! (OS) OpenGVLab/internvl35-68ac87bd52ebe953485927fb
β†’ MSFT released a killer small TTS model (OS) microsoft/VibeVoice-1.5B

find more herehttps://huggingface.co/collections/merve/august-29-releases-68b5a3754cfb8abf59e2b486
  • 1 reply
Β·
merveΒ 
posted an update about 1 month ago
view post
Post
6020
first vision language model built off openai/gpt-oss-20b just dropped! πŸ”₯

InternVL3.5 comes with 32 models 🀯 pre-trained, fine-tuned, aligned in various sizes OpenGVLab/internvl35-68ac87bd52ebe953485927fb
comes with gpt-oss or Qwen3 for LLM part ‡️
  • 1 reply
Β·
merveΒ 
posted an update 2 months ago
view post
Post
3284
GPT-4.1-mini level model right in your iPhone 🀯

openbmb/MiniCPM-V-4 is only 4B while surpassing GPT-4.1-mini in vision benchmarks πŸ”₯

allows commercial use as well!
merveΒ 
posted an update 2 months ago
view post
Post
1161
we're all sleeping on this OCR model rednote-hilab/dots.ocr πŸ”₯

dots.ocr is a new 3B model with sota performance, support for 100 languages & allowing commercial use! 🀯

single e2e model to extract image, convert tables, formula, and more into markdown πŸ“
try it MohamedRashad/Dots-OCR
merveΒ 
posted an update 2 months ago
view post
Post
682
massive releases and tons of Flux 1. Krea LoRas past week!
here's some of the picks, find more models in collection 🫑 merve/releases-august-2-6890c14248203522b7d0267f

LLMs πŸ’¬
> Tencent dropped tencent/Hunyuan-7B-Instruct
> Qwen released Qwen/Qwen3-Coder-30B-A3B-Instruct, 30B MoE with 3B params for coding (OS)

vision/multimodal
> RedNote released rednote-hilab/dots.ocr - 3B OCR model (OS)
> Cohere released CohereLabs/command-a-vision-07-2025 - 112B (dense!) VLM for 6 languages
> StepFun-AI shipped stepfun-ai/step3 - 321B MoE VLM (OS)
> Skywork shipped Skywork/Skywork-UniPic-1.5B - new any-to-any model (image+text β†’ image+text) (OS)
merveΒ 
posted an update 2 months ago
merveΒ 
posted an update 2 months ago
view post
Post
3641
past week in open AI was insane πŸ”₯ here's some of picks, find more here merve/releases-july-25-688768ca47fe3693407e02d1

πŸ’¬ LLMs & VLMs
> Qwen/Qwen3-235B-A22B-Thinking-2507 had a new update (OS)
> Qwen/Qwen3-Coder-480B-A35B-Instruct is out with 480B total 35B active params 🀯 (OS)
> AllenAI dropped an update to allenai/olmOCR-7B-0725 πŸ“
> InternLM released internlm/Intern-S1 - 235B Qwen3 MoE + 6B InternViT encoder (OS)
> OmniSVG/OmniSVG is a new SVG generation VLM (OS)

πŸ–ΌοΈ image/video/3D generation
> WanAI released Wan2.2 series - both T2V and I2V 14B models for high-quality video generation (OS) multimodalart/wan-22-688767e313337b434ed55112
> Tencent dropped tencent/HunyuanWorld-1 - image-to-3D scene generation model
  • 1 reply
Β·
merveΒ 
posted an update 2 months ago
view post
Post
4384
🀯 241B VLM with apache-2.0 license internlm/Intern-S1

internlm released Intern-S1: multimodal reasoning model based on 235B MoE Qwen3 and 6B InternViT 😍

benchmarks look great (πŸ‘‘ best model βœ… best open model)
merveΒ 
posted an update 3 months ago
view post
Post
828
so many open LLMs and image LoRAs dropped past week, here's some picks for you 🫑 merve/releases-july-18-687e3fbd2ab9b39c51f9238b

LLMs
> ByteDance released a bunch of translation models called Seed-X-RM (7B) ByteDance-Seed/Seed-X-RM-7B
> NVIDIA released reasoning models of which 32B surpassing the giant Qwen3-235B with cc-by-4.0 license πŸ‘ nvidia/openreasoning-nemotron-687730dae0170059860f1f01
> LG released a new EXAONE model (32B) LGAI-EXAONE/EXAONE-4.0-32B

VLMs/any-to-any
> vidore/colqwen-omni-v0.1 is a new any-to-any retriever (MIT)
> HiDream-ai/HiDream-E1-1 is image+text in image+text out model (MIT)

LoRAs
> There's a bunch of LoRAs based on Flux Kontext, gotta check out the collection 🀠
merveΒ 
posted an update 3 months ago
merveΒ 
posted an update 3 months ago
merveΒ 
posted an update 3 months ago
view post
Post
2655
Fine-tune Gemma3n on videos with audios inside with Colab A100 πŸ”₯
Just dropped the notebook where you can learn how to fine-tune Gemma3n on images+audio+text at the same time!

keep in mind, it's made for educational purposes 🫑 we do LoRA, audio resampling & video downsampling to be able to train <40GB VRAM

stretch modalities and unfreeze layers as you wish! πŸ™πŸ» merve/smol-vision
  • 1 reply
Β·