sentence-transformers (Sentence Transformers)

posted an update about 12 hours ago

Post

233

‼️Sentence Transformers v5.0 is out! The biggest update yet introduces Sparse Embedding models, encode methods improvements, Router module & much more. Sparse + Dense = 🔥 hybrid search performance!

1️⃣ Sparse Encoder Models - New support for sparse embeddings (30k+ dims, <1% non-zero)

* Full SPLADE, Inference-free SPLADE, CSR support
* 4 new modules, 12 losses, 9 evaluators
* Integration with elastic, opensearch-project, Qdrant, ibm-granite
* Decode interpretable embeddings
* Hybrid search integration

2️⃣ Enhanced Encode Methods

* encode_query & encode_document with auto prompts
* Direct device list passing to encode()
* Cleaner multi-processing

3️⃣ Router Module & Training

* Different paths for queries vs documents
* Custom learning rates per parameter group
* Composite loss logging
* Perfect for two-tower architectures

4️⃣ Documentation & Training

* New Training/Loss Overview docs
* 6 training example pages
* Search engine integration examples

Read the comprehensive blogpost about training sparse embedding models: https://huggingface.co/blog/train-sparse-encoder

See the full release notes here: https://github.com/UKPLab/sentence-transformers/releases/v5.0.0

What's next? We would love to hear from the community! What sparse encoder models would you like to see? And what new capabilities should Sentence Transformers handle - multimodal embeddings, late interaction models, or something else? Your feedback shapes our roadmap!

I'm incredibly excited to see the community explore sparse embeddings and hybrid search! The interpretability alone makes this a game-changer for understanding what your models are actually doing.

🙏 Thanks to @tomaarsen for this incredible opportunity!

tomaarsen

posted an update about 12 hours ago

Post

102

‼️Sentence Transformers v5.0 is out! The biggest update yet introduces Sparse Embedding models, encode methods improvements, Router module for asymmetric models & much more. Sparse + Dense = 🔥 hybrid search performance! Details:

1️⃣ Sparse Encoder Models
Brand new support for sparse embedding models that generate high-dimensional embeddings (30,000+ dims) where <1% are non-zero:

- Full SPLADE, Inference-free SPLADE, and CSR architecture support
- 4 new modules, 12 new losses, 9 new evaluators
- Integration with @elastic-co , @opensearch-project , @NAVER LABS Europe, @qdrant , @IBM , etc.
- Decode interpretable embeddings to understand token importance
- Hybrid search integration to get the best of both worlds

2️⃣ Enhanced Encode Methods & Multi-Processing
- Introduce encode_query & encode_document automatically use predefined prompts
- No more manual pool management - just pass device list directly to encode()
- Much cleaner and easier to use than the old multi-process approach

3️⃣ Router Module & Advanced Training
- Router module with different processing paths for queries vs documents
- Custom learning rates for different parameter groups
- Composite loss logging - see individual loss components
- Perfect for two-tower architectures

4️⃣ Comprehensive Documentation & Training
- New Training Overview, Loss Overview, API Reference docs
- 6 new training example documentation pages
- Full integration examples with major search engines
- Extensive blogpost on training sparse models

Read the comprehensive blogpost about training sparse embedding models: https://huggingface.co/blog/train-sparse-encoder

See the full release notes here: https://github.com/UKPLab/sentence-transformers/releases/v5.0.0

What's next? We would love to hear from the community! What sparse encoder models would you like to see? And what new capabilities should Sentence Transformers handle - multimodal embeddings, late interaction models, or something else? Your feedback shapes our roadmap!

tomaarsen

updated a collection 8 days ago

MS MARCO Mined Triplets

Collection

These datasets contain MS MARCO Triplets gathered by mining hard negatives using various models. Each dataset has various subsets. • 15 items • Updated 8 days ago • 11

tomaarsen

in sentence-transformers/all-MiniLM-L6-v2 13 days ago

Languages efficiency

👍 1

1

#120 opened 13 days ago by

cepe78

tomaarsen

in sentence-transformers/static-similarity-mrl-multilingual-v1 about 1 month ago

dim is 1024？

2

#5 opened about 1 month ago by

chaochaoli

tomaarsen

in sentence-transformers/all-MiniLM-L6-v2 about 2 months ago

Updated feature-extraction API URL

👍 🤗 6

#116 opened about 2 months ago by

tomaarsen

Why pipeline/feature-extraction API does not work?

➕ 1

2

#115 opened about 2 months ago by

gkwan-guides-3

tomaarsen

in sentence-transformers/all-MiniLM-L6-v2 2 months ago

Error 422

4

#111 opened 2 months ago by

SparkFounders

tomaarsen

posted an update 3 months ago

Post

4045

I just released Sentence Transformers v4.1; featuring ONNX and OpenVINO backends for rerankers offering 2-3x speedups and improved hard negatives mining which helps prepare stronger training datasets. Details:

🏎️ ONNX, OpenVINO, Optimization, Quantization
- I've added ONNX and OpenVINO support with just one extra argument: "backend" when loading the CrossEncoder reranker, e.g.: CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2", backend="onnx")
- The export_optimized_onnx_model, export_dynamic_quantized_onnx_model, and export_static_quantized_openvino_model functions now work with CrossEncoder rerankers, allowing you to optimize (e.g. fusions, gelu approximations, etc.) or quantize (int8 weights) rerankers.
- I've uploaded ~340 ONNX & OpenVINO models for all existing models under the cross-encoder Hugging Face organization. You can use these without having to export when loading.

⛏ Improved Hard Negatives Mining
- Added 'absolute_margin' and 'relative_margin' arguments to mine_hard_negatives.
- absolute_margin ensures that sim(query, negative) < sim(query, positive) - absolute_margin, i.e. an absolute margin between the negative & positive similarities.
- relative_margin ensures that sim(query, negative) < sim(query, positive) * (1 - relative_margin), i.e. a relative margin between the negative & positive similarities.
- Inspired by the excellent NV-Retriever paper from NVIDIA.

And several other small improvements. Check out the full release notes here: https://github.com/UKPLab/sentence-transformers/releases/tag/v4.1.0

With this release, I introduce near-feature parity between the SentenceTransformer embedding & CrossEncoder reranker models, which I've wanted to do for quite some time! With rerankers very strongly supported now, it's time to look forward to other useful architectures!

tomaarsen

in sentence-transformers/stsb 3 months ago

Unable to download the Dataset

1

#1 opened 3 months ago by

ajaykrishna2222

pcuenq

authored a paper 3 months ago

SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published Apr 7 • 192

tomaarsen

in sentence-transformers/all-MiniLM-L6-v2 3 months ago

Error using model in python: NameError: name 'init_empty_weights' is not defined

👀 1

2

#108 opened 3 months ago by

Shira1234

tomaarsen

updated a collection 3 months ago

Embedding Model Datasets

Collection

A curated subset of the datasets that work out of the box with Sentence Transformers: https://huggingface.co/datasets?other=sentence-transformers • 70 items • Updated Apr 7 • 131

tomaarsen

updated a dataset 3 months ago

sentence-transformers/msmarco

Viewer • Updated Mar 31 • 527M • 1.31k • 5

tomaarsen

posted an update 3 months ago

Post

2657

‼️Sentence Transformers v4.0 is out! You can now train and finetune reranker models with multi-GPU training, bf16 support, loss logging, callbacks & much more. I also prove that finetuning on your domain helps much more than you might think.

1️⃣ Reranker Training Refactor
Reranker models can now be trained using an extensive trainer with a lot of powerful features:
- MultiGPU Training (Data Parallelism (DP) and Distributed Data Parallelism (DDP))
- bf16 training support; loss logging
- Evaluation datasets + evaluation loss
- Improved callback support + an excellent Weights & Biases integration
- Gradient checkpointing, gradient accumulation
- Model card generation
- Resuming from a training checkpoint without performance loss
- Hyperparameter Optimization
and much more!

Read my detailed blogpost to learn about the components that make up this new training approach: https://huggingface.co/blog/train-reranker
Notably, the release is fully backwards compatible: all deprecations are soft, meaning that they still work but emit a warning informing you how to upgrade.

2️⃣ New Reranker Losses
- 11 new losses:
- 2 traditional losses: BinaryCrossEntropy and CrossEntropy
- 2 distillation losses: MSE and MarginMSE
- 2 in-batch negatives losses: MNRL (a.k.a. InfoNCE) and CMNRL
- 5 learning to rank losses: Lambda, p-ListMLE, ListNet, RankNet, ListMLE

3️⃣ New Reranker Documentation
- New Training Overview, Loss Overview, API Reference docs
- 5 new, 1 refactored training examples docs pages
- 13 new, 6 refactored training scripts
- Migration guides (2.x -> 3.x, 3.x -> 4.x)

4️⃣ Blogpost
Alongside the release, I've written a blogpost where I finetune ModernBERT on a generic question-answer dataset. My finetunes easily outperform all general-purpose reranker models, even models 4x as big. Finetuning on your domain is definitely worth it: https://huggingface.co/blog/train-reranker

See the full release notes here: https://github.com/UKPLab/sentence-transformers/releases/v4.0.1

tomaarsen

in sentence-transformers/distiluse-base-multilingual-cased-v2 4 months ago

Update README.md

👍 1

1

#11 opened 4 months ago by

FischerWells

tomaarsen

posted an update 4 months ago

Post

6836

An assembly of 18 European companies, labs, and universities have banded together to launch 🇪🇺 EuroBERT! It's a state-of-the-art multilingual encoder for 15 European languages, designed to be finetuned for retrieval, classification, etc.

🇪🇺 15 Languages: English, French, German, Spanish, Chinese, Italian, Russian, Polish, Portuguese, Japanese, Vietnamese, Dutch, Arabic, Turkish, Hindi
3️⃣ 3 model sizes: 210M, 610M, and 2.1B parameters - very very useful sizes in my opinion
➡️ Sequence length of 8192 tokens! Nice to see these higher sequence lengths for encoders becoming more common.
⚙️ Architecture based on Llama, but with bi-directional (non-causal) attention to turn it into an encoder. Flash Attention 2 is supported.
🔥 A new Pareto frontier (stronger *and* smaller) for multilingual encoder models
📊 Evaluated against mDeBERTa, mGTE, XLM-RoBERTa for Retrieval, Classification, and Regression (after finetuning for each task separately): EuroBERT punches way above its weight.
📝 Detailed paper with all details, incl. data: FineWeb for English and CulturaX for multilingual data, The Stack v2 and Proof-Pile-2 for code.

Check out the release blogpost here: https://huggingface.co/blog/EuroBERT/release
* EuroBERT/EuroBERT-210m
* EuroBERT/EuroBERT-610m
* EuroBERT/EuroBERT-2.1B

The next step is for researchers to build upon the 3 EuroBERT base models and publish strong retrieval, zero-shot classification, etc. models for all to use. I'm very much looking forward to it!

1 reply

·

tomaarsen

updated a dataset 4 months ago

sentence-transformers/msmarco-msmarco-MiniLM-L6-v3

Viewer • Updated Mar 6 • 80.6M • 443 • 2

tomaarsen

updated a model 4 months ago

sentence-transformers/average_word_embeddings_levy_dependency

Sentence Similarity • Updated Mar 6

Sentence Transformers

AI & ML interests