What libraries can I use for Sentence Similarity?

The sentence-transformers, spacy, and transformers.js libraries are compatible with Sentence Similarity.

What models can I use for Sentence Similarity?

The sentence-transformers/all-mpnet-base-v2, BAAI/bge-m3, and HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1.5 models can be used for Sentence Similarity.

What datasets can I use for Sentence Similarity?

The microsoft/ms_marco dataset can be used for Sentence Similarity.

What metrics can I use for Sentence Similarity?

The Mean Reciprocal Rank and Cosine Similarity metrics can be used for Sentence Similarity.

Tasks

Sentence Similarity

Sentence Similarity is the task of determining how similar two texts are. Sentence similarity models convert input texts into vectors (embeddings) that capture semantic information and calculate how close (similar) they are between them. This task is particularly useful for information retrieval and clustering/grouping.

Inputs

Source sentence

Machine learning is so easy.

Sentences to compare to

Deep learning is so straightforward.

This is so difficult, like rocket science.

I can't believe how much I struggled with this.

Sentence Similarity Model

Output

Deep learning is so straightforward.

0.623

This is so difficult, like rocket science.

0.413

I can't believe how much I struggled with this.

0.256

About Sentence Similarity

Use Cases 🔍

Information Retrieval

You can extract information from documents using Sentence Similarity models. The first step is to rank documents using Passage Ranking models. You can then get to the top ranked document and search it with Sentence Similarity models by selecting the sentence that has the most similarity to the input query.

The Sentence Transformers library

The Sentence Transformers library is very powerful for calculating embeddings of sentences, paragraphs, and entire documents. An embedding is just a vector representation of a text and is useful for finding how similar two texts are.

You can find and use thousands of Sentence Transformers models from the Hub by directly using the library, playing with the widgets in the browser or using Inference Endpoints.

Task Variants

Passage Ranking

Passage Ranking is the task of ranking documents based on their relevance to a given query. The task is evaluated on Mean Reciprocal Rank. These models take one query and multiple documents and return ranked documents according to the relevancy to the query. 📄

You can infer with Passage Ranking models using Inference Endpoints. The Passage Ranking model inputs are a query for which we look for relevancy in the documents and the documents we want to search. The model will return scores according to the relevancy of these documents for the query.

import json
import requests

API_URL = "https://router.huggingface.co/hf-inference/models/sentence-transformers/msmarco-distilbert-base-tas-b"
headers = {"Authorization": f"Bearer {api_token}"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

data = query(
    {
        "inputs": {
            "source_sentence": "That is a happy person",
            "sentences": [
                "That is a happy dog",
                "That is a very happy person",
                "Today is a sunny day"
            ]
        }
    }
## [0.853, 0.981, 0.655]

Semantic Textual Similarity

Semantic Textual Similarity is the task of evaluating how similar two texts are in terms of meaning. These models take a source sentence and a list of sentences in which we will look for similarities and will return a list of similarity scores. The benchmark dataset is the Semantic Textual Similarity Benchmark. The task is evaluated on Pearson’s Rank Correlation.

import json
import requests

API_URL = "https://router.huggingface.co/hf-inference/models/sentence-transformers/all-MiniLM-L6-v2"
headers = {"Authorization": f"Bearer {api_token}"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

data = query(
    {
        "inputs": {
            "source_sentence": "I'm very happy",
            "sentences":["I'm filled with happiness", "I'm happy"]
        }
    })

## [0.605, 0.894]

You can also infer with the models in the Hub using Sentence Transformer models.

pip install -U sentence-transformers

from sentence_transformers import SentenceTransformer, util
sentences = ["I'm happy", "I'm full of happiness"]

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Compute embedding for both lists
embedding_1 = model.encode(sentences[0], convert_to_tensor=True)
embedding_2 = model.encode(sentences[1], convert_to_tensor=True)

util.pytorch_cos_sim(embedding_1, embedding_2)
## tensor([[0.6003]])

Useful Resources

Would you like to learn more about Sentence Transformers and Sentence Similarity? Awesome! Here you can find some curated resources that you may find helpful!

Deploy on Inference Endpoints

Compatible libraries

sentence-transformers

spaCy

Transformers.js

using sentence-transformers/all-MiniLM-L6-v2

Models for Sentence Similarity

Browse Models (19,076)

sentence-transformers/all-mpnet-base-v2

Note This model works well for sentences and paragraphs and can be used for clustering/grouping and semantic searches.

BAAI/bge-m3

Note A multilingual robust sentence similarity model.

HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1.5

Note A robust sentence similarity model.

Datasets for Sentence Similarity

Browse Datasets (1,222)

microsoft/ms_marco

Viewer • Updated Jan 4, 2024 • 1.11M • 22k • 244

Note Bing queries with relevant passages from various web sources.

Spaces using Sentence Similarity

💻

Gradio-Blocks/Ask_Questions_To_YouTube_Videos

Note An application that leverages sentence similarity to answer questions from YouTube videos.

📚🔎📄

Gradio-Blocks/pubmed-abstract-retriever

Note An application that retrieves relevant PubMed abstracts for a given online article which can be used as further references.

💻

nickmuchi/article-text-summarizer

Note An application that leverages sentence similarity to summarize text.

Metrics for Sentence Similarity

Mean Reciprocal Rank: Reciprocal Rank is a measure used to rank the relevancy of documents given a set of documents. Reciprocal Rank is the reciprocal of the rank of the document retrieved, meaning, if the rank is 3, the Reciprocal Rank is 0.33. If the rank is 1, the Reciprocal Rank is 1

Cosine Similarity: The similarity of the embeddings is evaluated mainly on cosine similarity. It is calculated as the cosine of the angle between two vectors. It is particularly useful when your texts are not the same length