This is a sentence-transformer model derived from xlm-roberta-base. It was tuned on the English MNLI data, a Czech machine-translated version of the MNLI, and for Arabic, German and Chinese, we used the machine-translated German NLI training data distributed with XNLI. Thus, the model is tuned on equivalent data in all five languages, but not with explicitly parallel data, and specifically it is a multilingual S-BERT model trained without a teacher-student setup.
We used a training script provided by the sentence-transformers library: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/nli/training_nli_v2.py
Usage (Sentence-Transformers)
Using this model becomes easy when you have sentence-transformers installed:
pip install -U sentence-transformers
Then you can use the model like this:
from sentence_transformers import SentenceTransformer, util
sentences = ["Etwa 9 Millionen Menschen leben in London.", "London is known for its financial district."]
model = SentenceTransformer('kathaem/xlm-roberta-base-sentence-transformer-nli-5langs')
embeddings = model.encode(sentences)
print(embeddings)
Usage (HuggingFace Transformers)
Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you apply a pooling operation on top of the contextualized word embeddings.
from transformers import AutoTokenizer, AutoModel
import torch
# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
sentences = ["Etwa 9 Millionen Menschen leben in London.", "London ist für sein Bankenviertel bekannt."]
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('kathaem/xlm-roberta-base-sentence-transformer-nli-5langs')
model = AutoModel.from_pretrained('kathaem/xlm-roberta-base-sentence-transformer-nli-5langs')
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling to get sentence embeddings
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print(sentence_embeddings)
Citation
If you find this model useful in your work, please cite our paper:
@inproceedings{haemmerl-etal-2023-speaking,
title = "Speaking Multiple Languages Affects the Moral Bias of Language Models",
author = {H{\"a}mmerl, Katharina and
Deiseroth, Bjoern and
Schramowski, Patrick and
Libovick{\'y}, Jind{\v{r}}ich and
Rothkopf, Constantin and
Fraser, Alexander and
Kersting, Kristian},
editor = "Rogers, Anna and
Boyd-Graber, Jordan and
Okazaki, Naoaki",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings-acl.134/",
doi = "10.18653/v1/2023.findings-acl.134",
pages = "2137--2156",
}
- Downloads last month
- 10