What libraries can I use for Feature Extraction?

The sentence-transformers, transformers, and transformers.js libraries are compatible with Feature Extraction.

What models can I use for Feature Extraction?

The thenlper/gte-largeand Alibaba-NLP/gte-Qwen1.5-7B-instruct models can be used for Feature Extraction.

What datasets can I use for Feature Extraction?

The and wikipedia dataset can be used for Feature Extraction.

Tasks

Feature Extraction

Feature extraction is the task of extracting features learnt in a model.

Inputs

Input

India, officially the Republic of India, is a country in South Asia.

Feature Extraction Model

Output

Dimension 1	Dimension 2	Dimension 3
2.583383083343506	2.757075071334839	0.9023529887199402
8.29393482208252	1.1071064472198486	2.03399395942688
-0.7754912972450256	-1.647324562072754	-0.6113331913948059
0.07087723910808563	1.5942802429199219	1.4610432386398315

About Feature Extraction

Use Cases

Transfer Learning

Models trained on a specific dataset can learn features about the data. For instance, a model trained on an English poetry dataset learns English grammar at a very high level. This information can be transferred to a new model that is going to be trained on tweets. This process of extracting features and transferring to another model is called transfer learning. One can pass their dataset through a feature extraction pipeline and feed the result to a classifier.

Retrieval and Reranking

Retrieval is the process of obtaining relevant documents or information based on a user's search query. In the context of NLP, retrieval systems aim to find relevant text passages or documents from a large corpus of data that match the user's query. The goal is to return a set of results that are likely to be useful to the user. On the other hand, reranking is a technique used to improve the quality of retrieval results by reordering them based on their relevance to the query.

Retrieval Augmented Generation

Retrieval-augmented generation (RAG) is a technique in which user inputs to generative models are first queried through a knowledge base, and the most relevant information from the knowledge base is used to augment the prompt to reduce hallucinations during generation. Feature extraction models (primarily retrieval and reranking models) can be used in RAG to reduce model hallucinations and ground the model.

Inference

You can infer feature extraction models using pipeline of transformers library.

from transformers import pipeline
checkpoint = "facebook/bart-base"
feature_extractor = pipeline("feature-extraction", framework="pt", model=checkpoint)
text = "Transformers is an awesome library!"

#Reducing along the first dimension to get a 768 dimensional array
feature_extractor(text,return_tensors = "pt")[0].numpy().mean(axis=0)

'''tensor([[[ 2.5834,  2.7571,  0.9024,  ...,  1.5036, -0.0435, -0.8603],
         [-1.2850, -1.0094, -2.0826,  ...,  1.5993, -0.9017,  0.6426],
         [ 0.9082,  0.3896, -0.6843,  ...,  0.7061,  0.6517,  1.0550],
         ...,
         [ 0.6919, -1.1946,  0.2438,  ...,  1.3646, -1.8661, -0.1642],
         [-0.1701, -2.0019, -0.4223,  ...,  0.3680, -1.9704, -0.0068],
         [ 0.2520, -0.6869, -1.0582,  ...,  0.5198, -2.2106,  0.4547]]])'''

A very popular library for training similarity and search models is called sentence-transformers. To get started, install the library.

pip install -U sentence-transformers

You can infer with sentence-transformers models as follows.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]

embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.6660, 0.1046],
#         [0.6660, 1.0000, 0.1411],
#         [0.1046, 0.1411, 1.0000]])