kathaem/xlm-roberta-base-sentence-transformer-nli-5langs

This is a sentence-transformer model derived from xlm-roberta-base. It was tuned on the English MNLI data, a Czech machine-translated version of the MNLI, and for Arabic, German and Chinese, we used the machine-translated German NLI training data distributed with XNLI. Thus, the model is tuned on equivalent data in all five languages, but not with explicitly parallel data, and specifically it is a multilingual S-BERT model trained without a teacher-student setup.

We used a training script provided by the sentence-transformers library: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/nli/training_nli_v2.py

Usage (Sentence-Transformers)

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer, util

sentences = ["Etwa 9 Millionen Menschen leben in London.", "London is known for its financial district."]
model = SentenceTransformer('kathaem/xlm-roberta-base-sentence-transformer-nli-5langs')
embeddings = model.encode(sentences)
print(embeddings)

Usage (HuggingFace Transformers)

Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you apply a pooling operation on top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch

# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

sentences = ["Etwa 9 Millionen Menschen leben in London.", "London ist für sein Bankenviertel bekannt."]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('kathaem/xlm-roberta-base-sentence-transformer-nli-5langs')
model = AutoModel.from_pretrained('kathaem/xlm-roberta-base-sentence-transformer-nli-5langs')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling to get sentence embeddings
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print(sentence_embeddings)

Citation

If you find this model useful in your work, please cite our paper:

@inproceedings{haemmerl-etal-2023-speaking,
    title = "Speaking Multiple Languages Affects the Moral Bias of Language Models",
    author = {H{\"a}mmerl, Katharina  and
      Deiseroth, Bjoern  and
      Schramowski, Patrick  and
      Libovick{\'y}, Jind{\v{r}}ich  and
      Rothkopf, Constantin  and
      Fraser, Alexander  and
      Kersting, Kristian},
    editor = "Rogers, Anna  and
      Boyd-Graber, Jordan  and
      Okazaki, Naoaki",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-acl.134/",
    doi = "10.18653/v1/2023.findings-acl.134",
    pages = "2137--2156",
}

kathaem
/

xlm-roberta-base-sentence-transformer-nli-5langs

Usage (Sentence-Transformers)

Usage (HuggingFace Transformers)

Citation

Dataset used to train kathaem/xlm-roberta-base-sentence-transformer-nli-5langs