Feature Extraction
Feature extraction is the task of extracting features learnt in a model.
Input
India, officially the Republic of India, is a country in South Asia.
Dimension 1 | Dimension 2 | Dimension 3 |
---|---|---|
2.583383083343506 | 2.757075071334839 | 0.9023529887199402 |
8.29393482208252 | 1.1071064472198486 | 2.03399395942688 |
-0.7754912972450256 | -1.647324562072754 | -0.6113331913948059 |
0.07087723910808563 | 1.5942802429199219 | 1.4610432386398315 |
About Feature Extraction
Use Cases
Transfer Learning
Models trained on a specific dataset can learn features about the data. For instance, a model trained on an English poetry dataset learns English grammar at a very high level. This information can be transferred to a new model that is going to be trained on tweets. This process of extracting features and transferring to another model is called transfer learning. One can pass their dataset through a feature extraction pipeline and feed the result to a classifier.
Retrieval and Reranking
Retrieval is the process of obtaining relevant documents or information based on a user's search query. In the context of NLP, retrieval systems aim to find relevant text passages or documents from a large corpus of data that match the user's query. The goal is to return a set of results that are likely to be useful to the user. On the other hand, reranking is a technique used to improve the quality of retrieval results by reordering them based on their relevance to the query.
Retrieval Augmented Generation
Retrieval-augmented generation (RAG) is a technique in which user inputs to generative models are first queried through a knowledge base, and the most relevant information from the knowledge base is used to augment the prompt to reduce hallucinations during generation. Feature extraction models (primarily retrieval and reranking models) can be used in RAG to reduce model hallucinations and ground the model.
Inference
You can infer feature extraction models using pipeline
of transformers library.
from transformers import pipeline
checkpoint = "facebook/bart-base"
feature_extractor = pipeline("feature-extraction", framework="pt", model=checkpoint)
text = "Transformers is an awesome library!"
#Reducing along the first dimension to get a 768 dimensional array
feature_extractor(text,return_tensors = "pt")[0].numpy().mean(axis=0)
'''tensor([[[ 2.5834, 2.7571, 0.9024, ..., 1.5036, -0.0435, -0.8603],
[-1.2850, -1.0094, -2.0826, ..., 1.5993, -0.9017, 0.6426],
[ 0.9082, 0.3896, -0.6843, ..., 0.7061, 0.6517, 1.0550],
...,
[ 0.6919, -1.1946, 0.2438, ..., 1.3646, -1.8661, -0.1642],
[-0.1701, -2.0019, -0.4223, ..., 0.3680, -1.9704, -0.0068],
[ 0.2520, -0.6869, -1.0582, ..., 0.5198, -2.2106, 0.4547]]])'''
A very popular library for training similarity and search models is called sentence-transformers
. To get started, install the library.
pip install -U sentence-transformers
You can infer with sentence-transformers
models as follows.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
sentences = [
"The weather is lovely today.",
"It's so sunny outside!",
"He drove to the stadium.",
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.6660, 0.1046],
# [0.6660, 1.0000, 0.1411],
# [0.1046, 0.1411, 1.0000]])
Text Embedding Inference
Text Embeddings Inference (TEI) is a toolkit to easily serve feature extraction models using few lines of code.
Useful resources
Compatible libraries
Note A powerful feature extraction model for natural language processing tasks.
Note A strong feature extraction model for retrieval.
No example dataset is defined for this task.
Note Contribute by proposing a dataset for this task !
Note A leaderboard to rank text feature extraction models based on a benchmark.
Note A leaderboard to rank best feature extraction models based on human feedback.
No example metric is defined for this task.
Note Contribute by proposing a metric for this task !