gte-large-zh

General Text Embeddings (GTE) model. Towards General Text Embeddings with Multi-stage Contrastive Learning

The GTE models are trained by Alibaba DAMO Academy. They are mainly based on the BERT framework and currently offer different sizes of models for both Chinese and English Languages. The GTE models are trained on a large-scale corpus of relevance text pairs, covering a wide range of domains and scenarios. This enables the GTE models to be applied to various downstream tasks of text embeddings, including information retrieval, semantic textual similarity, text reranking, etc.

Model List

Models Language Max Sequence Length Dimension Model Size
GTE-large-zh Chinese 512 1024 0.67GB
GTE-base-zh Chinese 512 512 0.21GB
GTE-small-zh Chinese 512 512 0.10GB
GTE-large English 512 1024 0.67GB
GTE-base English 512 512 0.21GB
GTE-small English 512 384 0.10GB

Metrics

We compared the performance of the GTE models with other popular text embedding models on the MTEB (CMTEB for Chinese language) benchmark. For more detailed comparison results, please refer to the MTEB leaderboard.

  • Evaluation results on CMTEB
Model Model Size (GB) Embedding Dimensions Sequence Length Average (35 datasets) Classification (9 datasets) Clustering (4 datasets) Pair Classification (2 datasets) Reranking (4 datasets) Retrieval (8 datasets) STS (8 datasets)
gte-large-zh 0.65 1024 512 66.72 71.34 53.07 81.14 67.42 72.49 57.82
gte-base-zh 0.20 768 512 65.92 71.26 53.86 80.44 67.00 71.71 55.96
stella-large-zh-v2 0.65 1024 1024 65.13 69.05 49.16 82.68 66.41 70.14 58.66
stella-large-zh 0.65 1024 1024 64.54 67.62 48.65 78.72 65.98 71.02 58.3
bge-large-zh-v1.5 1.3 1024 512 64.53 69.13 48.99 81.6 65.84 70.46 56.25
stella-base-zh-v2 0.21 768 1024 64.36 68.29 49.4 79.96 66.1 70.08 56.92
stella-base-zh 0.21 768 1024 64.16 67.77 48.7 76.09 66.95 71.07 56.54
piccolo-large-zh 0.65 1024 512 64.11 67.03 47.04 78.38 65.98 70.93 58.02
piccolo-base-zh 0.2 768 512 63.66 66.98 47.12 76.61 66.68 71.2 55.9
gte-small-zh 0.1 512 512 60.04 64.35 48.95 69.99 66.21 65.50 49.72
bge-small-zh-v1.5 0.1 512 512 57.82 63.96 44.18 70.4 60.92 61.77 49.1
m3e-base 0.41 768 512 57.79 67.52 47.68 63.99 59.54 56.91 50.47
text-embedding-ada-002(openai) - 1536 8192 53.02 64.31 45.68 69.56 54.28 52.0 43.35

Usage

Code example

import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

input_texts = [
    "中国的首都是哪里",
    "你喜欢去哪里旅游",
    "北京",
    "今天中午吃什么"
]

tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-large-zh")
model = AutoModel.from_pretrained("thenlper/gte-large-zh")

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = outputs.last_hidden_state[:, 0]
 
# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())

Use with sentence-transformers:

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

sentences = ['That is a happy person', 'That is a very happy person']

model = SentenceTransformer('thenlper/gte-large-zh')
embeddings = model.encode(sentences)
print(cos_sim(embeddings[0], embeddings[1]))

Limitation

This model exclusively caters to Chinese texts, and any lengthy texts will be truncated to a maximum of 512 tokens.

Citation

If you find our paper or models helpful, please consider citing them as follows:

@article{li2023towards,
  title={Towards general text embeddings with multi-stage contrastive learning},
  author={Li, Zehan and Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan},
  journal={arXiv preprint arXiv:2308.03281},
  year={2023}
}
Downloads last month
9,305
Safetensors
Model size
326M params
Tensor type
I64
·
FP16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Spaces using thenlper/gte-large-zh 4

Evaluation results