The Fellowship is a network of exceptional people from different backgrounds who contribute to open-source machine learning π§ββοΈπ¦ΈββοΈπ¦Ήπ§ββοΈ
βΌοΈSentence Transformers v5.0 is out! The biggest update yet introduces Sparse Embedding models, encode methods improvements, Router module for asymmetric models & much more. Sparse + Dense = π₯ hybrid search performance! Details:
1οΈβ£ Sparse Encoder Models Brand new support for sparse embedding models that generate high-dimensional embeddings (30,000+ dims) where <1% are non-zero:
- Full SPLADE, Inference-free SPLADE, and CSR architecture support - 4 new modules, 12 new losses, 9 new evaluators - Integration with @elastic-co, @opensearch-project, @NAVER LABS Europe, @qdrant, @IBM, etc. - Decode interpretable embeddings to understand token importance - Hybrid search integration to get the best of both worlds
2οΈβ£ Enhanced Encode Methods & Multi-Processing - Introduce encode_query & encode_document automatically use predefined prompts - No more manual pool management - just pass device list directly to encode() - Much cleaner and easier to use than the old multi-process approach
3οΈβ£ Router Module & Advanced Training - Router module with different processing paths for queries vs documents - Custom learning rates for different parameter groups - Composite loss logging - see individual loss components - Perfect for two-tower architectures
4οΈβ£ Comprehensive Documentation & Training - New Training Overview, Loss Overview, API Reference docs - 6 new training example documentation pages - Full integration examples with major search engines - Extensive blogpost on training sparse models
What's next? We would love to hear from the community! What sparse encoder models would you like to see? And what new capabilities should Sentence Transformers handle - multimodal embeddings, late interaction models, or something else? Your feedback shapes our roadmap!
Dataset Viewer for PDFs just landed on Hugging Face ππ€ you can now preview all the PDFs easier than before!
on top of this, there's PdfFolder format to load the PDF datasets quicker π¨ > to use it, your dataset should follow a directory format like folder/train/doc1.pdf, folder/train/doc1.pdf > if you want to include bounding boxes, labels etc. you can keep them in a metadata.csv file in the same folder π€
we've merged LightGlue keypoint matcher to Hugging Face transformers! it allows commercial use when paired with an open-source keypoint detector ππ»
πΌοΈ VLMs/OCR > moonshotai/Kimi-VL-A3B-Thinking-2506 is a powerful reasoning vision LM, 3B active params, smarter with less tokens, supports long documents, videos π (OS) > nanonets/Nanonets-OCR-s is 3.75B params OCR model based on Qwen2.5VL-3B-Instruct (OS)
π£οΈ Audio > Google released google/magenta-realtime, real time music generation & audio synthesis (cc-by-4) > kyutai released new speech-to-text models that come in 1B & 2B (kyutai/stt-1b-en_fr, stt-2b-en_fr) with 0.5s and 2.5s delay
y'all have been asking my opinion on how OCR models compare to each other π I will leave three apps to compare newest models by @prithivMLmods instead β€΅οΈ > compare Nanonets-OCR-s, Qwen2-VL-OCR-2B-Instruct, RolmOCR, Aya-Vision prithivMLmods/Multimodal-OCR > SmolDocling, Nanonets-OCR-s, MonkeyOCR, Typhoon-OCR-7B prithivMLmods/Multimodal-OCR2 > docscopeOCR, MonkeyOCR, coreOCR prithivMLmods/core-OCR
so far I figured out > for fact-checks, you need a relatively bigger size (7B is ok!) > Gemma 3 gets downgrade without pan and scan (especially for π) > Qwen2.5VL-32B is very talkative, great for reasoning but not good for simple tasks π£οΈ
the method is simple: find which tokens have the highest attention score, merge rest of the tokens based on similarity, then merge both
their method is both training-free and for fine-tuning the authors report 5 point improvement on average of vision language tasks + 8x improvement in prefilling time for Llava-Next 7B and 13B π€―
removing redundant tokens improve image token quality too π₯Ή
we have launched Kernel Hub: easy optimized kernels for all models on Hugging Face π₯ use them right away! it's where the community populates optimized kernels π€
this release comes in three parts > Kernel Hub: contains (as of now) 14 kernels > kernels: Python library to load kernels from Kernel Hub > kernel-builder: Nix package to build kernels for PyTorch (made using PyTorch C++ frontend)
when building models, your regular workflow should be pulling kernels from Hub and building your model with them π€ here's a practical example with RMSNorm: 1. pull the kernel from Hub with get_kernel 2. decorate with use_kernel_forward_from_hub 3. inject it to your model we'd love to hear your feedback! ππ» we also welcome kernel contributions by community π₯Ήπ
Dolphin: new OCR model by ByteDance with MIT license π¬
the model first detects element in the layout (table, formula etc) and then parses each element in parallel for generation Model: ByteDance/Dolphin Try the demo: ByteDance/Dolphin
stop building parser pipelines ππ» there's a new document parser that is small, fast, Apache 2.0 licensed and is better than all the other ones! π±
echo840/MonkeyOCR is a 3B model that can parse everything (charts, formules, tables etc) in a document π€ > the authors show in the paper that document parsing pipelines often have errors propagating back > using singular e2e models are better but they're too heavy to use
this model addresses both: it's lighter, faster, stronger π₯
> based on ViT, different sizes (L/G/H) and resolution (286/384) > 0-day support in π€ transformers > comes with a physical reasoning (from video) benchmark: MVPBench, IntPhys 2, and CausalVQA facebook/physical_reasoning_leaderboard
Qwen2.5-Omni is soooo good that people build multimodal reasoning models off of it π₯Ή > KE-Team/Ke-Omni-R-3B is open-source audio reasoning model sota on average of benchmarks, based on Qwen/Qwen2.5-Omni-3B π£οΈ > Haoz0206/Omni-R1 is a video reasoning model with pixel level grounding (see below) and it's super competitive β―οΈ based on Qwen/Qwen2.5-Omni-7B