Hugging Face Fellows

non-profit

AI & ML interests

The Fellowship is a network of exceptional people from different backgrounds who contribute to open-source machine learning πŸ§™β€β™‚οΈπŸ¦Έβ€β™€οΈπŸ¦ΉπŸ§β€β™‚οΈ

Recent Activity

hugging-fellows's activity

merveΒ 
posted an update about 19 hours ago
view post
Post
690
Everything that happened this week in open AI, a recap 🀠 merve/jan-17-releases-678a673a9de4a4675f215bf5

πŸ‘€ Multimodal
- MiniCPM-o 2.6 is a new sota any-to-any model by OpenBMB
(vision, speech and text!)
- VideoChat-Flash-Qwen2.5-2B is new video multimodal models by OpenGVLab that come in sizes 2B & 7B in resolutions 224 & 448
- ByteDance released larger SA2VA that comes in 26B parameters
- Dataset: VRC-Bench is a new diverse benchmark for multimodal LLM reasoning performance

πŸ’¬ LLMs
- MiniMax-Text-01 is a new huge language model (456B passive 45.9B active params) by MiniMaxAI with context length of 4M tokens 🀯
- Dataset: Sky-T1-data-17k is a diverse dataset used to train Sky-T1-32B
- kyutai released Helium-1-Preview-2B is a new small multilingual LM
- Wayfarer-12B is a new LLM able to write D&D πŸ§™πŸ»β€β™‚οΈ
- ReaderLM-v2 is a new HTML parsing model by Jina AI

- Dria released, Dria-Agent-a-3B, new agentic coding model (Pythonic function calling) based on Qwen2.5 Coder
- Unsloth released Phi-4, faster and memory efficient Llama 3.3

πŸ–ΌοΈ Vision
- MatchAnything is a new foundation model for matching
- FitDit is a high-fidelity VTON model based on DiT architecture

πŸ—£οΈ Audio
- OuteTTS-0.3-1B is a new multilingual text-to-speech model with voice cloning and emotion control capabilities

πŸ“– Retrieval
- lightblue released a new reranker based on Qwen2.5 LB-reranker-0.5B-v1.0 that can handle 95+ languages
- cde-small-v2 is a new sota small retrieval model by
@jxm
not-lainΒ 
posted an update about 20 hours ago
view post
Post
231
we now have more than 2000 public AI models using ModelHubMixinπŸ€—
merveΒ 
posted an update 1 day ago
ariG23498Β 
posted an update 2 days ago
tomaarsenΒ 
posted an update 3 days ago
view post
Post
4090
🏎️ Today I'm introducing a method to train static embedding models that run 100x to 400x faster on CPU than common embedding models, while retaining 85%+ of the quality! Including 2 fully open models: training scripts, datasets, metrics.

We apply our recipe to train 2 Static Embedding models that we release today! We release:
2️⃣ an English Retrieval model and a general-purpose Multilingual similarity model (e.g. classification, clustering, etc.), both Apache 2.0
🧠 my modern training strategy: ideation -> dataset choice -> implementation -> evaluation
πŸ“œ my training scripts, using the Sentence Transformers library
πŸ“Š my Weights & Biases reports with losses & metrics
πŸ“• my list of 30 training and 13 evaluation datasets

The 2 Static Embedding models have the following properties:
🏎️ Extremely fast, e.g. 107500 sentences per second on a consumer CPU, compared to 270 for 'all-mpnet-base-v2' and 56 for 'gte-large-en-v1.5'
0️⃣ Zero active parameters: No Transformer blocks, no attention, not even a matrix multiplication. Super speed!
πŸ“ No maximum sequence length! Embed texts at any length (note: longer texts may embed worse)
πŸ“ Linear instead of exponential complexity: 2x longer text takes 2x longer, instead of 2.5x or more.
πŸͺ† Matryoshka support: allow you to truncate embeddings with minimal performance loss (e.g. 4x smaller with a 0.56% perf. decrease for English Similarity tasks)

Check out the full blogpost if you'd like to 1) use these lightning-fast models or 2) learn how to train them with consumer-level hardware: https://huggingface.co/blog/static-embeddings

The blogpost contains a lengthy list of possible advancements; I'm very confident that our 2 models are only the tip of the iceberg, and we may be able to get even better performance.

Alternatively, check out the models:
* sentence-transformers/static-retrieval-mrl-en-v1
* sentence-transformers/static-similarity-mrl-multilingual-v1
  • 1 reply
Β·
merveΒ 
posted an update 5 days ago
view post
Post
3779
there's a new multimodal retrieval model in town 🀠
LlamaIndex released vdr-2b-multi-v1
> uses 70% less image tokens, yet outperforming other dse-qwen2 based models
> 3x faster inference with less VRAM πŸ’¨
> shrinkable with matryoshka πŸͺ†
> can do cross-lingual retrieval!
Collection: llamaindex/visual-document-retrieval-678151d19d2758f78ce910e1 (with models and datasets)
Demo: llamaindex/multimodal_vdr_demo
Learn more from their blog post here https://huggingface.co/blog/vdr-2b-multilingual πŸ“–
not-lainΒ 
posted an update 6 days ago
view post
Post
3633
Published a new blogpost πŸ“–
In this blogpost I have gone through the transformers' architecture emphasizing how shapes propagate throughout each layer.
πŸ”— https://huggingface.co/blog/not-lain/tensor-dims
some interesting takeaways :
merveΒ 
posted an update 8 days ago
view post
Post
3540
What a beginning to this year in open ML 🀠
Let's unwrap! merve/jan-10-releases-677fe34177759de0edfc9714

Multimodal πŸ–ΌοΈ
> ByteDance released SA2VA: a family of vision LMs that can take image, video, text and visual prompts
> moondream2 is out with new capabilities like outputting structured data and gaze detection!
> Dataset: Alibaba DAMO lab released multimodal textbook β€” 22k hours worth of samples from instruction videos 🀯
> Dataset: SciCap captioning on scientific documents benchmark dataset is released along with the challenge!

LLMs πŸ’¬
> Microsoft released Phi-4, sota open-source 14B language model πŸ”₯
> Dolphin is back with Dolphin 3.0 Llama 3.1 8B 🐬🐬
> Prime-RL released Eurus-2-7B-PRIME a new language model trained using PRIME alignment
> SmallThinker-3B is a new small reasoning LM based on Owen2.5-3B-Instruct πŸ’­
> Dataset: QWQ-LONGCOT-500K is the dataset used to train SmallThinker, generated using QwQ-32B-preview πŸ“•
> Dataset: @cfahlgren1 released React Code Instructions: a dataset of code instruction-code pairs πŸ“•
> Dataset: Qwen team is on the roll, they just released CodeElo, a dataset of code preferences πŸ‘©πŸ»β€πŸ’»

Embeddings πŸ”–
> @MoritzLaurer released zero-shot version of ModernBERT large πŸ‘
> KaLM is a new family of performant multilingual embedding models with MIT license built using Qwen2-0.5B

Image/Video Generation ⏯️
> NVIDIA released Cosmos, a new family of diffusion/autoregressive World Foundation Models generating worlds from images, videos and texts πŸ”₯
> Adobe released TransPixar: a new text-to-video model that can generate assets with transparent backgrounds (a first!)
> Dataset: fal released cosmos-openvid-1m Cosmos-tokenized OpenVid-1M with samples from OpenVid-1M

Others
> Prior Labs released TabPFNv2, the best tabular transformer is out for classification and regression
> Metagene-1 is a new RNA language model that can be used for pathogen detection, zero-shot embedding and genome understanding
merveΒ 
posted an update 9 days ago
view post
Post
1746
ByteDance just dropped SA2VA: a new family of vision LMs combining Qwen2VL/InternVL and SAM2 with MIT license πŸ’— ByteDance/sa2va-model-zoo-677e3084d71b5f108d00e093

> The models are capable of tasks involving vision-language understanding and visual referrals (referring segmentation) both for images and videos ⏯️

> The models come in 1B, 4B and 8B and are based on InternVL2.5 for base architecture and Qwen2, Qwen2.5 and InternLM2 for language model part (depending on the checkpoint)

> The model is very interesting, it has different encoders for different modalities each (visual prompt, text prompt, image and video) then it concatenates these to feed into LLM πŸ’¬

the output segmentation tokens are passed to SAM2, to sort of match text (captions or semantic classes) to masks ‡️

> Their annotation pipeline is also interesting, they seems to use two open large vision LMs to refine the annotations, and have different levels of descriptions to provide consistency.
  • 1 reply
Β·
clemΒ 
posted an update 15 days ago
view post
Post
3997
Cool to see @ylecun joining the top 10 of most followed on HF!

(and leaderboard by @mvaloatto is here: mvaloatto/TCTF)
  • 2 replies
Β·
tomaarsenΒ 
posted an update 18 days ago
view post
Post
2818
That didn't take long! Nomic AI has finetuned the new ModernBERT-base encoder model into a strong embedding model for search, classification, clustering and more!

Details:
πŸ€– Based on ModernBERT-base with 149M parameters.
πŸ“Š Outperforms both nomic-embed-text-v1 and nomic-embed-text-v1.5 on MTEB!
🏎️ Immediate FA2 and unpacking support for super efficient inference.
πŸͺ† Trained with Matryoshka support, i.e. 2 valid output dimensionalities: 768 and 256.
➑️ Maximum sequence length of 8192 tokens!
2️⃣ Trained in 2 stages: unsupervised contrastive data -> high quality labeled datasets.
βž• Integrated in Sentence Transformers, Transformers, LangChain, LlamaIndex, Haystack, etc.
πŸ›οΈ Apache 2.0 licensed: fully commercially permissible

Try it out here: nomic-ai/modernbert-embed-base

Very nice work by Zach Nussbaum and colleagues at Nomic AI.
merveΒ 
posted an update 18 days ago
view post
Post
4782
supercharge your LLM apps with smolagents πŸ”₯

however cool your LLM is, without being agentic it can only go so far

enter smolagents: a new agent library by Hugging Face to make the LLM write code, do analysis and automate boring stuff!

Here's our blog for you to get started https://huggingface.co/blog/smolagents
merveΒ 
posted an update 25 days ago
merveΒ 
posted an update about 1 month ago
view post
Post
2798
Aya by Cohere For AI can now see! πŸ‘€

C4AI community has built Maya 8B, a new open-source multilingual VLM built on SigLIP and Aya 8B 🌱 works on 8 languages! πŸ—£οΈ

The authors extend Llava dataset using Aya's translation capabilities with 558k examples!
ry it here kkr5155/maya_demo

Dataset maya-multimodal/pretrain

Model maya-multimodal/maya πŸ‘
kudos @nahidalam and team
  • 1 reply
Β·
clemΒ 
posted an update about 1 month ago
view post
Post
1886
Coming back to Paris Friday to open our new Hugging Face office!

We're at capacity for the party but add your name in the waiting list as we're trying to privatize the passage du Caire for extra space for robots πŸ€–πŸ¦ΎπŸ¦Ώ

https://t.co/enkFXjWndJ
  • 1 reply
Β·
merveΒ 
posted an update about 1 month ago
view post
Post
3348
Apollo is a new family of open-source video language models by Meta, where 3B model outperforms most 7B models and 7B outperforms most 30B models 🧢

✨ the models come in 1.5B https://huggingface.co/Apollo-LMMs/Apollo-1_5B-t32, 3B https://huggingface.co/Apollo-LMMs/Apollo-3B-t32 and 7B https://huggingface.co/Apollo-LMMs/Apollo-7B-t32 with A2.0 license, based on Qwen1.5 & Qwen2
✨ the authors also release a benchmark dataset https://huggingface.co/spaces/Apollo-LMMs/ApolloBench

The paper has a lot of experiments (they trained 84 models!) about what makes the video LMs work ⏯️

Try the demo for best setup here https://huggingface.co/spaces/Apollo-LMMs/Apollo-3B
they evaluate sampling strategies, scaling laws for models and datasets, video representation and more!
> The authors find out that whatever design decision was applied to small models also scale properly when the model and dataset are scaled πŸ“ˆ scaling dataset has diminishing returns for smaller models
> They evaluate frame sampling strategies, and find that FPS sampling is better than uniform sampling, and they find 8-32 tokens per frame optimal
> They also compare image encoders, they try a variation of models from shape optimized SigLIP to DINOv2
they find google/siglip-so400m-patch14-384 to be most powerful πŸ”₯
> they also compare freezing different parts of models, training all stages with some frozen parts give the best yield

They eventually release three models, where Apollo-3B outperforms most 7B models and Apollo 7B outperforms 30B models πŸ”₯
Β·
merveΒ 
posted an update about 1 month ago
view post
Post
1773
A complete RAG pipeline includes a reranker, which ranks the documents to find the best document πŸ““
Same goes for multimodal RAG, multimodal rerankers which we can integrate to multimodal RAG pipelines!
Learn how to build a complete multimodal RAG pipeline with vidore/colqwen2-v1.0 as retriever, lightonai/MonoQwen2-VL-v0.1 as reranker, Qwen/Qwen2-VL-7B-Instruct as VLM in this notebook that runs on a GPU as small as L4 πŸ”₯ https://huggingface.co/learn/cookbook/multimodal_rag_using_document_retrieval_and_reranker_and_vlms