metadata
license: mit
datasets:
- mteb/twentynewsgroups-clustering
- mteb/biorxiv-clustering-s2s
- mteb/biorxiv-clustering-p2p
language:
- en
pipeline_tag: text-classification
library_name: sentence-transformers
tags:
- mteb
- text
- transformers
- text-embeddings-inference
- sparse-encoder
- sparse
- csr
model-index:
- name: CSR
results:
- dataset:
name: MTEB BiorxivClusteringP2P.v2
type: mteb/biorxiv_clustering_p2p
revision: f5dbc242e11dd8e24def4c4268607a49e02946dc
config: default
split: test
languages:
- eng-Latn
metrics:
- type: v_measure
value: 0.579338
- type: v_measure_std
value: 0.00337
- type: main_score
value: 0.579338
task:
type: Clustering
- dataset:
name: MTEB BiorxivClusteringS2S.v2
type: mteb/biorxiv_clustering_s2s
revision: eb4edb10386758d274cd161093eb351381a16dbf
config: default
split: test
languages:
- eng-Latn
metrics:
- type: v_measure
value: 0.540989
- type: v_measure_std
value: 0.005707
- type: main_score
value: 0.540989
task:
type: Clustering
- dataset:
name: MTEB TwentyNewsgroupsClustering
type: mteb/twenty_newsgroups_clustering
revision: 6125ec4e24fa026cec8a478383ee943acfbd5449
config: default
split: test
languages:
- eng-Latn
metrics:
- type: v_measure
value: 0.630936
- type: v_measure_std
value: 0.007942
- type: main_score
value: 0.007942
task:
type: Clustering
base_model:
- nvidia/NV-Embed-v2
For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our Github.
Usage
📌 Tip: For NV-Embed-V2, using Transformers versions later than 4.47.0 may lead to performance degradation, as model_type=bidir_mistral
in config.json
is no longer supported.
We recommend using Transformers 4.47.0.
Sentence Transformers Usage
You can evaluate this model loaded by Sentence Transformers with the following code snippet:
import mteb
from sentence_transformers import SparseEncoder
model = SparseEncoder(
"CSR-NV_Embed_v2-Clustering-Biorxiv_TwentyNews",
trust_remote_code=True
)
model.prompts = {
"BiorxivClusteringP2P.v2": "Instruct: Identify the main category of Biorxiv papers based on the titles and abstracts\nQuery:",
"BiorxivClusteringS2S.v2": "Instruct: Identify the main category of Biorxiv papers based on the titles\nQuery:",
"TwentyNewsgroupsClustering": "Instruct: Identify the topic or theme of the given news articles\nQuery:"
}
task = mteb.get_tasks(tasks=["BiorxivClusteringP2P.v2", "BiorxivClusteringS2S.v2", "TwentyNewsgroupsClustering"])
evaluation = mteb.MTEB(tasks=task)
evaluation.run(
model,
eval_splits=["test"],
output_folder="./results/clustering",
show_progress_bar=True
encode_kwargs={"convert_to_sparse_tensor": False, "batch_size": 8},
) # MTEB don't support sparse tensors yet, so we need to convert to dense tensors
Citation
@inproceedings{wenbeyond,
title={Beyond Matryoshka: Revisiting Sparse Coding for Adaptive Representation},
author={Wen, Tiansheng and Wang, Yifei and Zeng, Zequn and Peng, Zhong and Su, Yudi and Liu, Xinyang and Chen, Bo and Liu, Hongwei and Jegelka, Stefanie and You, Chenyu},
booktitle={Forty-second International Conference on Machine Learning}
}