bi-matrix/gmatrix-embedding

해당 모델은 KF-DeBERTa 모델과 KorSTS, KorNLI 데이터셋을 활용하였으며, sentence-transformers의 공식 문서 내 소개된 continue-learning 방법을 통해 아래와 같이 학습되었습니다.

NLI 데이터셋을 통해 nagative sampling 후 MultipleNegativeRankingLoss 활용 및 STS 데이터셋을 통해 CosineSimilarityLoss를 활용하여 Multi-task Learning 학습 10epoch 진행
Learning Rate를 1e-06으로 줄여서 4epoch 추가 Multi-task 학습 진행

This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

Usage (Sentence-Transformers)

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer("bi-matrix/gmatrix-embedding")
embeddings = model.encode(sentences)
print(embeddings)

Usage (HuggingFace Transformers)

Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("bi-matrix/gmatrix-embedding")
model = AutoModel.from_pretrained("bi-matrix/gmatrix-embedding")

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

Evaluation Results

KorSTS 평가 데이터셋으로 평가한 결과입니다.

Cosine Pearson: 85.77
Cosine Spearman: 86.30
Manhattan Pearson: 84.84
Manhattan Spearman: 85.33
Euclidean Pearson: 84.82
Euclidean Spearman: 85.29
Dot Pearson: 83.19
Dot Spearman: 83.19

model	cosine_pearson	cosine_spearman	euclidean_pearson	euclidean_spearman	manhattan_pearson	manhattan_spearman	dot_pearson	dot_spearman
gmatrix-embedding	85.77	86.30	84.82	85.29	84.84	85.33	83.19	83.19
kf-deberta-multitask	85.75	86.25	84.79	85.25	84.80	85.27	82.93	82.86
ko-sroberta-multitask	84.77	85.6	83.71	84.40	83.70	84.38	82.42	82.33
ko-sbert-multitask	84.13	84.71	82.42	82.66	82.41	82.69	80.05	79.69
ko-sroberta-base-nli	82.83	83.85	82.87	83.29	82.88	83.28	80.34	79.69
ko-sbert-nli	82.24	83.16	82.19	82.31	82.18	82.3	79.3	78.78
ko-sroberta-sts	81.84	81.82	81.15	81.25	81.14	81.25	79.09	78.54
ko-sbert-sts	81.55	81.23	79.94	79.79	79.9	79.75	76.02	75.31

G-MATRIX Embedding 데이터셋 측정 결과입니다. 사람 3명이서 0~5점으로 두 문장간의 유사도를 측정하여 점수를 내고 평균을 구하여 각 모델의 임베딩값을 통해

코사인 유사도, 유클리디안 거리, 맨하탄 거리, Dot-product를 구하여 피어슨, 스피어만 상관계수를 구한 값입니다.

Cosine Pearson: 75.86
Cosine Spearman: 65.75
Manhattan Pearson: 72.65
Manhattan Spearman: 65.20
Euclidean Pearson: 72.48
Euclidean Spearman: 65.32
Dot Pearson: 64.71
Dot Spearman: 53.90

model	cosine_pearson	cosine_spearman	euclidean_pearson	euclidean_spearman	manhattan_pearson	manhattan_spearman	dot_pearson	dot_spearman
gmatrix-embedding	75.86	65.75	72.65	65.20	72.48	65.32	64.71	53.90
ko-sroberta-multitask	71.78	63.16	70.80	63.47	70.89	63.72	53.57	44.23
bge-m3	64.15	60.65	61.88	60.68	61.88	60.19	64.16	60.71

G-MATRIX Embedding 레이블링 판단 기준 (KLUE-RoBERTa의 STS 데이터 생성 참고)

두 문장의 유사한 정도를 보고 0~5점으로 판단
맞춤법, 띄어쓰기, 온점이나 쉼표 차이는 판단 대상이 아님
문장의 의도, 표현이 담고 있는 의미를 비교
두 문장에 공통적으로 사용된 단어의 유무를 찾는 것이 아닌, 문장의 의미가 유사한지를 비교
0은 의미적 유사성이 없는 경우이고, 5는 의미적으로 동등함을 뜻함

Training

The model was trained with the parameters:

DataLoader:

torch.utils.data.dataloader.DataLoader of length 329 with parameters:

{'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss:

sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': True}) with Transformer model: DeBERTaV2Model 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

Citing & Authors

[MINSANG SONG] at BI-Matrix

bi-matrix
/

gmatrix-embedding1