MarcGrumpyOlejak commited on
Commit
7440eb8
·
verified ·
1 Parent(s): fb3490a

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -3,4 +3,132 @@ license: eupl-1.2
3
  language:
4
  - de
5
  - en
6
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  language:
4
  - de
5
  - en
6
+ tags:
7
+ - sentence-transformers
8
+ - sentence-similarity
9
+ - feature-extraction
10
+ - dense
11
+ - generated_from_trainer
12
+ - dataset_size:16753490
13
+ - loss:MatryoshkaLoss
14
+ - loss:MultipleNegativesRankingLoss
15
+ datasets:
16
+ - avemio/German-RAG-EMBEDDING-TRIPLES-HESSIAN-AI
17
+ - MarcGrumpyOlejak/germanrag-scored
18
+ - MarcGrumpyOlejak/ultradistil-intel-orca-dpo-de-scored
19
+ - Short-Answer-Feedback/saf_legal_domain_german
20
+ - jfeil/GermanDefinitionGeneration-Distillation
21
+ - google/wmt24pp
22
+ - jphme/synthia_german_experimental
23
+ - google-research-datasets/paws-x
24
+ - jinaai/parallel-sentences
25
+ - Polyglot-or-Not/Fact-Completion
26
+ pipeline_tag: sentence-similarity
27
+ library_name: sentence-transformers
28
+ ---
29
+
30
+ # A static embedding model tokenized with dbmdz/bert-base-german-uncased and mainly built on DE/EN-datasets as a base for further experiments.
31
+
32
+ This is a [sentence-transformers](https://www.SBERT.net) model trained on 74 datasets (full list at the bottom). It maps sentences & paragraphs to a 2048-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
33
+
34
+ Further explanations of how to build such a model, you can find in the [Static Embeddings blogpost](https://huggingface.co/blog/static-embeddings) by [Tom Aarsen](https://huggingface.co/tomaarsen) in January 2025. It took me until the end of May to start this tiny spare time experiment.
35
+
36
+ After some tests with different tokenizers I decided to pick one of the oldest as it has performed best by delivering the smallest size (~240MB) – [bert-base-german-uncased by the dbmdz-team](https://huggingface.co/dbmdz/bert-base-german-uncased).
37
+
38
+ * **99% performance:** Unexpectedly this model scored nearly 99% in comparison to [e5-base-sts-en-de](https://huggingface.co/danielheinz/e5-base-sts-en-de) during the GermanGovServiceRetrieval-Task in MTEB by taking only a 80th of the time (40.3 seconds vs. 0.49).
39
+ * **Matryoshka:** This model was trained with a [Matryoshka loss](https://huggingface.co/blog/matryoshka), allowing you to truncate the embeddings for faster retrieval at minimal performance costs.
40
+ * **Evaluations:** See [Evaluations](#evaluation) for details on performance on German MTEB, special GermanGovService retrieval, embedding speed, and Matryoshka dimensionality truncation.
41
+ * **Training Script:** See [base_train.py](base_train.py) for the training script used to train this model from scratch (be warned - it is wildly commented).
42
+
43
+ ## Model Details
44
+
45
+ ### Model Description
46
+ - **Model Type:** Sentence Transformer
47
+ <!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
48
+ - **Maximum Sequence Length:** inf tokens
49
+ - **Output Dimensionality:** 2048 dimensions
50
+ - **Similarity Function:** Cosine Similarity
51
+
52
+ ## Citation
53
+
54
+ ### BibTeX
55
+
56
+ #### Sentence Transformers
57
+ ```bibtex
58
+ @inproceedings{reimers-2019-sentence-bert,
59
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
60
+ author = "Reimers, Nils and Gurevych, Iryna",
61
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
62
+ month = "11",
63
+ year = "2019",
64
+ publisher = "Association for Computational Linguistics",
65
+ url = "https://arxiv.org/abs/1908.10084",
66
+ }
67
+ ```
68
+
69
+ #### MatryoshkaLoss
70
+ ```bibtex
71
+ @misc{kusupati2024matryoshka,
72
+ title={Matryoshka Representation Learning},
73
+ author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
74
+ year={2024},
75
+ eprint={2205.13147},
76
+ archivePrefix={arXiv},
77
+ primaryClass={cs.LG}
78
+ }
79
+ ```
80
+
81
+ #### MultipleNegativesRankingLoss
82
+ ```bibtex
83
+ @misc{henderson2017efficient,
84
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
85
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
86
+ year={2017},
87
+ eprint={1705.00652},
88
+ archivePrefix={arXiv},
89
+ primaryClass={cs.CL}
90
+ }
91
+ ```
92
+
93
+ #### GermanGovServiceRetrieval
94
+ ```bibtex
95
+ @software{lhm-dienstleistungen-qa,
96
+ author = {Schröder, Leon Marius and
97
+ Gutknecht, Clemens and
98
+ Alkiddeh, Oubada and
99
+ Susanne Weiß,
100
+ Lukas, Leon},
101
+ month = nov,
102
+ publisher = {it@M},
103
+ title = {LHM-Dienstleistungen-QA - german public domain question-answering dataset},
104
+ url = {https://huggingface.co/datasets/it-at-m/LHM-Dienstleistungen-QA},
105
+ year = {2022},
106
+ }
107
+ ```
108
+
109
+ #### MMTEB
110
+ ```bibtex
111
+ @article{enevoldsen2025mmtebmassivemultilingualtext,
112
+ title={MMTEB: Massive Multilingual Text Embedding Benchmark},
113
+ author={Kenneth Enevoldsen and Isaac Chung and Imene Kerboua and Márton Kardos and Ashwin Mathur and David Stap and Jay Gala and Wissam Siblini and Dominik Krzemiński and Genta Indra Winata and Saba Sturua and Saiteja Utpala and Mathieu Ciancone and Marion Schaeffer and Gabriel Sequeira and Diganta Misra and Shreeya Dhakal and Jonathan Rystrøm and Roman Solomatin and Ömer Çağatan and Akash Kundu and Martin Bernstorff and Shitao Xiao and Akshita Sukhlecha and Bhavish Pahwa and Rafał Poświata and Kranthi Kiran GV and Shawon Ashraf and Daniel Auras and Björn Plüster and Jan Philipp Harries and Loïc Magne and Isabelle Mohr and Mariya Hendriksen and Dawei Zhu and Hippolyte Gisserot-Boukhlef and Tom Aarsen and Jan Kostkan and Konrad Wojtasik and Taemin Lee and Marek Šuppa and Crystina Zhang and Roberta Rocca and Mohammed Hamdy and Andrianos Michail and John Yang and Manuel Faysse and Aleksei Vatolin and Nandan Thakur and Manan Dey and Dipam Vasani and Pranjal Chitale and Simone Tedeschi and Nguyen Tai and Artem Snegirev and Michael Günther and Mengzhou Xia and Weijia Shi and Xing Han Lù and Jordan Clive and Gayatri Krishnakumar and Anna Maksimova and Silvan Wehrli and Maria Tikhonova and Henil Panchal and Aleksandr Abramov and Malte Ostendorff and Zheng Liu and Simon Clematide and Lester James Miranda and Alena Fenogenova and Guangyu Song and Ruqiya Bin Safi and Wen-Ding Li and Alessia Borghini and Federico Cassano and Hongjin Su and Jimmy Lin and Howard Yen and Lasse Hansen and Sara Hooker and Chenghao Xiao and Vaibhav Adlakha and Orion Weller and Siva Reddy and Niklas Muennighoff},
114
+ publisher = {arXiv},
115
+ journal={arXiv preprint arXiv:2502.13595},
116
+ year={2025},
117
+ url={https://arxiv.org/abs/2502.13595},
118
+ doi = {10.48550/arXiv.2502.13595},
119
+ }
120
+ ```
121
+
122
+ #### MTEB
123
+ ```bibtex
124
+ @article{muennighoff2022mteb,
125
+ author = {Muennighoff, Niklas and Tazi, Nouamane and Magne, Lo{\"\i}c and Reimers, Nils},
126
+ title = {MTEB: Massive Text Embedding Benchmark},
127
+ publisher = {arXiv},
128
+ journal={arXiv preprint arXiv:2210.07316},
129
+ year = {2022}
130
+ url = {https://arxiv.org/abs/2210.07316},
131
+ doi = {10.48550/ARXIV.2210.07316},
132
+ }
133
+ ```
134
+
config_sentence_transformers.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "SentenceTransformer",
3
+ "__version__": {
4
+ "sentence_transformers": "5.0.0",
5
+ "transformers": "4.51.3",
6
+ "pytorch": "2.1.0+cu121"
7
+ },
8
+ "prompts": {
9
+ "query": "",
10
+ "document": ""
11
+ },
12
+ "default_prompt_name": null,
13
+ "similarity_fn_name": "cosine"
14
+ }
img/time_vs_MTEB-GGSR.png ADDED
img/time_vs_MTEB-deuV1.png ADDED
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f04f3a9ab3a5073cf11ca19b4b3114f7279c42a710ebdae2d9a69f4e3ffed414
3
+ size 254787680
modules.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.StaticEmbedding"
7
+ }
8
+ ]
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
train_base.py ADDED
@@ -0,0 +1,1267 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # German base static model for sentence comparisons, RAG & classifications.
2
+ # Inspired in January 25 by Tom Aarsens: "Train 400x faster Static Embedding Models with Sentence Transformers"
3
+ # check: https://huggingface.co/blog/static-embeddings#code
4
+ # and check: https://sbert.net/docs/sentence_transformer/training_overview.html
5
+ # for training parameters, check also: https://huggingface.co/docs/transformers/en/main_classes/trainer
6
+ # First test build since May, 25th as I found the time.
7
+ # The datasets are mainly based upon german and english european table dataset training snippets
8
+ # Main idea is to use only open licensed material that can also be used commercially.
9
+ #
10
+ # This is experimental minimal EN & mainly DE only.
11
+ #
12
+ # With local prepared texts building the train/test-split takes about 3 minutes.
13
+ # Training on a GTX-2070 SUPER 8GB (with prepared training material) needs ~2h.
14
+
15
+ from timeit import default_timer as timer
16
+ import gc
17
+ import os
18
+ import random
19
+ import logging
20
+ import datasets
21
+ from datasets import load_dataset, Dataset, DatasetDict, concatenate_datasets
22
+ from sentence_transformers import (
23
+ SentenceTransformer,
24
+ SentenceTransformerTrainer,
25
+ SentenceTransformerTrainingArguments,
26
+ SentenceTransformerModelCardData,
27
+ SimilarityFunction,
28
+ )
29
+ from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss
30
+ from sentence_transformers.training_args import BatchSamplers, MultiDatasetBatchSamplers
31
+ from sentence_transformers.models.StaticEmbedding import StaticEmbedding
32
+ from sentence_transformers.util import paraphrase_mining
33
+ from sentence_transformers.evaluation import NanoBEIREvaluator
34
+
35
+ from transformers import AutoTokenizer # sadly no blingfire
36
+
37
+ # as Sentence Transformers uses PyTorch AND TensorFlow - I need to tune it for my system
38
+ import tensorflow as tf
39
+ import torch
40
+
41
+ ## Model Version
42
+ version = '1'
43
+ sts_basename = 'sts-mrl-en-de-base'
44
+
45
+ ## MULTILINGUAL bert-base (original): ~414MB model
46
+ #tokenizer_model = 'google-bert/bert-base-multilingual-uncased'
47
+ ### follwing are some different tokenizers to play around with - all of them were tested and only 'dbmdz/bert-base-german-uncased' is more effective for the german language by only a size of 243MB.
48
+ ## GERMAN ONLY: ~243MB model
49
+ tokenizer_model = 'dbmdz/bert-base-german-uncased'
50
+ ## GERMAN ONLY: ~122MB model
51
+ #tokenizer_model = 'deepset/gelectra-base'
52
+ ## GERMAN ONLY; ~243MB model
53
+ #tokenizer_model = 'deepset/gbert-base'
54
+ ## MULTILINGUAL roBERTa: ~977MB model
55
+ #tokenizer_model = 'FacebookAI/xlm-roberta-base'
56
+ ## ModernBert: ~197MB model – as a test for v0.05a
57
+ #tokenizer_model = 'answerdotai/ModernBERT-base'
58
+
59
+ logging.basicConfig(
60
+ format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO
61
+ )
62
+ random.seed(12)
63
+
64
+ def load_train_eval_datasets():
65
+ """
66
+ Either load the train and eval datasets from disk or load them from the datasets library & save them to disk.
67
+
68
+ Upon saving to disk, we quit() to ensure that the datasets are not loaded into memory before training.
69
+
70
+ The order of sets here is not the same as later on in the full training/eval-sets!!!
71
+ """
72
+ try:
73
+ train_dataset = DatasetDict.load_from_disk("base_datasets/train_dataset")
74
+ eval_dataset = DatasetDict.load_from_disk("base_datasets/eval_dataset")
75
+ return train_dataset, eval_dataset
76
+ except FileNotFoundError:
77
+ print("No prepared dataset found. Building ...")
78
+ #
79
+ # Build the datasets.
80
+ # we do the biggest thing in the beginning
81
+ print("Loading mMARCO-distilled-de-hn dataset...")
82
+ # source: https://huggingface.co/datasets/MarcGrumpyOlejak/mmarco-de-distilled-scored
83
+ # original: https://huggingface.co/datasets/unicamp-dl/mmarco
84
+ # git: https://github.com/unicamp-dl/mMARCO
85
+ # License: Apache-2.0
86
+ # distilled & filtered: 254660
87
+ # Original set without Hard Negatives unused
88
+ #mmarco_de_scored = load_dataset('MarcGrumpyOlejak/mmarco-de-distilled-scored', split="train").filter(lambda _: _['score_sts'] >= 0.26)
89
+ #mmarco_de_scored = mmarco_de_scored.select_columns(['query', 'positive', 'negative'])
90
+ #mmarco_de_scored = mmarco_de_scored.train_test_split(test_size=10000, seed=12)
91
+ #mmarco_de_scored_train_ds: Dataset = mmarco_de_scored["train"]
92
+ #mmarco_de_scored_eval_ds: Dataset = mmarco_de_scored["test"]
93
+ #
94
+ # filtered, split as/with hard negatives and remaining sentences
95
+ mmarco_de_3hn_ds = load_dataset('parquet', data_files={'mmarco-de-distilled_3hn/3_hard_negatives/*.parquet'}, split="train")
96
+ mmarco_de_3hn_ds = mmarco_de_3hn_ds.train_test_split(test_size=0.02, seed=12)
97
+ mmarco_de_3hn_train_dataset: Dataset = mmarco_de_3hn_ds["train"]
98
+ mmarco_de_3hn_eval_dataset: Dataset = mmarco_de_3hn_ds["test"]
99
+ #
100
+ mmarco_de_2hn_ds = load_dataset('parquet', data_files={'mmarco-de-distilled_3hn/2_hard_negatives/*.parquet'}, split="train")
101
+ mmarco_de_2hn_ds = mmarco_de_2hn_ds.train_test_split(test_size=0.02, seed=12)
102
+ mmarco_de_2hn_train_dataset: Dataset = mmarco_de_2hn_ds["train"]
103
+ mmarco_de_2hn_eval_dataset: Dataset = mmarco_de_2hn_ds["test"]
104
+ #
105
+ mmarco_de_1hn_ds = load_dataset('parquet', data_files={'mmarco-de-distilled_3hn/1_hard_negatives/*.parquet'}, split="train")
106
+ mmarco_de_1hn_ds = mmarco_de_1hn_ds.train_test_split(test_size=0.02, seed=12)
107
+ mmarco_de_1hn_train_dataset: Dataset = mmarco_de_1hn_ds["train"]
108
+ mmarco_de_1hn_eval_dataset: Dataset = mmarco_de_1hn_ds["test"]
109
+ #
110
+ mmarco_de_0hn_ds = load_dataset('parquet', data_files={'mmarco-de-distilled_3hn/0_hard_negatives/*.parquet'}, split="train")
111
+ mmarco_de_0hn_ds = mmarco_de_0hn_ds.train_test_split(test_size=0.02, seed=12)
112
+ mmarco_de_0hn_train_dataset: Dataset = mmarco_de_0hn_ds["train"]
113
+ mmarco_de_0hn_eval_dataset: Dataset = mmarco_de_0hn_ds["test"]
114
+ print("Loaded mMARCO-distilled-de-hn dataset.")
115
+ #
116
+ print("Loading local prepared wikipedia-22-12-de datasets...")
117
+ # (need to upload the local version to build it)
118
+ # check: load_dataset('deutsche-telekom/wikipedia-22-12-de-dpr')
119
+ # License: MIT
120
+ # Copyright (c) 2023-2024 Philip May, Deutsche Telekom AG
121
+ # Licensed under the MIT License (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License by reviewing the file [LICENSE](https://github.com/telekom/mltb2/blob/main/LICENSE) in the repository.
122
+ # version without hard negatives not loaded
123
+ # reversed!!! deactivate hard negatives!
124
+ name_local = 'wikipedia-22-12-de-scored'
125
+ wp_2212_de_ds = DatasetDict.load_from_disk(f'{name_local}/{name_local}.hf')
126
+ wp_2212_de_train_dataset: Dataset = wp_2212_de_ds["train"].select_columns(['question', 'context'])
127
+ wp_2212_de_eval_dataset: Dataset = wp_2212_de_ds["test"].select_columns(['question', 'context'])
128
+ #
129
+ # instead load the hard negative version
130
+ #name_local = 'wikipedia-22-12-de_hn'
131
+ #wp_2212_de_train_dataset: Dataset = load_dataset('parquet', data_files={f'{name_local}/3_hard_negatives/train-*.parquet'}, split="train")
132
+ #wp_2212_de_eval_dataset: Dataset = load_dataset('parquet', data_files={f'{name_local}/3_hard_negatives/test-*.parquet'}, split="train")
133
+ #wp_2212_de_0_train_dataset: Dataset = load_dataset('parquet', data_files={f'{name_local}/0_hard_negatives/train-*.parquet'}, split="train")
134
+ #wp_2212_de_0_eval_dataset: Dataset = load_dataset('parquet', data_files={f'{name_local}/0_hard_negatives/test-*.parquet'}, split="train")
135
+
136
+ print("Loaded prepared full wikipedia-22-12-de dataset...")
137
+ #
138
+ print("Loading swim-ir-monolingual-de-scored dataset...")
139
+ # source: https://huggingface.co/datasets/MarcGrumpyOlejak/swim-ir-monolingual-de-scored
140
+ # original: https://huggingface.co/datasets/nthakur/swim-ir-monolingual
141
+ # entries: ~447000
142
+ # filtered: 356552
143
+ # combined: 713104
144
+ # License: CC-BY-SA-4.0
145
+ # Original set without Hard Negatives unsed
146
+ #swim_ir_de_ds = load_dataset("MarcGrumpyOlejak/swim-ir-monolingual-de-scored", split="train").filter(lambda _: _['score_sts'] >= 0.26 and _['score_sts'] < 0.99 and _['query'] != '')
147
+ #swim_ir_de_key_ds = swim_ir_de_ds.select_columns(['text', 'title'])
148
+ #swim_ir_de_key_ds = swim_ir_de_key_ds.rename_columns({'text': 'sentence1', 'title': 'sentence2'})
149
+ #swim_ir_de_ds = swim_ir_de_ds.select_columns(['query', 'text'])
150
+ #swim_ir_de_ds = swim_ir_de_ds.rename_columns({'query': 'sentence1', 'text': 'sentence2'})
151
+ #swim_ir_de_ds = concatenate_datasets([swim_ir_de_ds, swim_ir_de_key_ds])
152
+ #swim_ir_de_ds = swim_ir_de_ds.train_test_split(test_size=10000, seed=12)
153
+ #swim_ir_de_train_dataset: Dataset = swim_ir_de_ds["train"]
154
+ #swim_ir_de_eval_dataset: Dataset = swim_ir_de_ds["test"]
155
+ #
156
+ # filtered, split and with hard negatives and remaining sentences
157
+ swim_ir_de_ds = load_dataset('parquet', data_files={'swim-ir-monolingual-de_3hn/0_hard_negatives/*.parquet'}, split="train")
158
+ swim_ir_de_ds = swim_ir_de_ds.train_test_split(test_size=0.02, seed=12)
159
+ swim_ir_de_train_dataset: Dataset = swim_ir_de_ds["train"]
160
+ swim_ir_de_eval_dataset: Dataset = swim_ir_de_ds["test"]
161
+ swim_ir_de_3hn_ds = load_dataset('parquet', data_files={'swim-ir-monolingual-de_3hn/3_hard_negatives/*.parquet'}, split="train")
162
+ swim_ir_de_3hn_ds = swim_ir_de_3hn_ds.train_test_split(test_size=0.02, seed=12)
163
+ swim_ir_de_3hn_train_dataset: Dataset = swim_ir_de_3hn_ds["train"]
164
+ swim_ir_de_3hn_eval_dataset: Dataset = swim_ir_de_3hn_ds["test"]
165
+ #
166
+ swim_ir_de_title_ds = load_dataset('parquet', data_files={'swim-ir-monolingual-titles-de_3hn/0_hard_negatives/*.parquet'}, split="train")
167
+ swim_ir_de_title_3hn_ds = load_dataset('parquet', data_files={'swim-ir-monolingual-titles-de_3hn/3_hard_negatives/*.parquet'}, split="train")
168
+ swim_ir_de_title_ds = swim_ir_de_title_ds.train_test_split(test_size=0.02, seed=12)
169
+ swim_ir_de_title_3hn_ds = swim_ir_de_title_3hn_ds.train_test_split(test_size=0.02, seed=12)
170
+ swim_ir_de_title_train_dataset: Dataset = swim_ir_de_title_ds['train']
171
+ swim_ir_de_title_eval_dataset: Dataset = swim_ir_de_title_ds["test"]
172
+ swim_ir_de_title_3hn_train_dataset: Dataset = swim_ir_de_title_3hn_ds['train']
173
+ swim_ir_de_title_3hn_eval_dataset: Dataset = swim_ir_de_title_3hn_ds['test']
174
+ print("Loaded swim-ir-monolingual-de-scored dataset.")
175
+ #
176
+ print("Loading avemio_triples dataset...")
177
+ # source: https://huggingface.co/datasets/avemio/German-RAG-EMBEDDING-TRIPLES-HESSIAN-AI
178
+ # entries: 294234
179
+ # License: Apache-2.0
180
+ avemio_triples_dataset = load_dataset("avemio/German-RAG-EMBEDDING-TRIPLES-HESSIAN-AI", split="train")
181
+ avemio_triples_dataset_dict = avemio_triples_dataset.train_test_split(test_size=10000, seed=12)
182
+ avemio_triples_train_dataset: Dataset = avemio_triples_dataset_dict["train"]
183
+ avemio_triples_eval_dataset: Dataset = avemio_triples_dataset_dict["test"]
184
+ print("Loaded avemio_triples dataset.")
185
+ #
186
+ print("Loading avemio_pairs-hn dataset...")
187
+ # source: https://huggingface.co/datasets/avemio/German-RAG-EMBEDDING-PAIRS-HESSIAN-AI
188
+ # entries: 1036940
189
+ # License: Apache-2.0
190
+ # Original dataset unused
191
+ #avemio_pairs_dataset = load_dataset("avemio/German-RAG-EMBEDDING-PAIRS-HESSIAN-AI", split="train")
192
+ #avemio_pairs_dataset_dict = avemio_pairs_dataset.train_test_split(test_size=10000, seed=12)
193
+ #avemio_pairs_train_dataset: Dataset = avemio_pairs_dataset_dict["train"]
194
+ #avemio_pairs_eval_dataset: Dataset = avemio_pairs_dataset_dict["test"]
195
+ #
196
+ # filtered, split and with hard negatives and remaining sentences
197
+ avemio_pairs_3hn_ds = load_dataset('parquet', data_files={'German-RAG-EMBEDDING-PAIRS-HESSIAN-AI-3hn-350_3hn/3_hard_negatives/*.parquet', 'German-RAG-EMBEDDING-PAIRS-HESSIAN-AI-3hn-600_3hn/3_hard_negatives/*.parquet', 'German-RAG-EMBEDDING-PAIRS-HESSIAN-AI-3hn-600plus_3hn/3_hard_negatives/*.parquet',}, split="train")
198
+ avemio_pairs_3hn_ds = avemio_pairs_3hn_ds.train_test_split(test_size=10000, seed=12)
199
+ avemio_pairs_3hn_train_ds: Dataset = avemio_pairs_3hn_ds["train"]
200
+ avemio_pairs_3hn_eval_ds: Dataset = avemio_pairs_3hn_ds["test"]
201
+ del avemio_pairs_3hn_ds
202
+ #
203
+ avemio_pairs_0hn_ds = load_dataset('parquet', data_files={'German-RAG-EMBEDDING-PAIRS-HESSIAN-AI-3hn-350_3hn/0_hard_negatives/*.parquet', 'German-RAG-EMBEDDING-PAIRS-HESSIAN-AI-3hn-600_3hn/0_hard_negatives/*.parquet', 'German-RAG-EMBEDDING-PAIRS-HESSIAN-AI-3hn-600plus_3hn/0_hard_negatives/*.parquet',}, split="train")
204
+ avemio_pairs_0hn_ds = avemio_pairs_0hn_ds.train_test_split(test_size=10000, seed=12)
205
+ avemio_pairs_0hn_train_ds: Dataset = avemio_pairs_0hn_ds["train"]
206
+ avemio_pairs_0hn_eval_ds: Dataset = avemio_pairs_0hn_ds["test"]
207
+ del avemio_pairs_0hn_ds
208
+ print("Loaded avemio_pairs-hn dataset.")
209
+ #
210
+ print("Loading nq_german-hn dataset...")
211
+ # source: https://huggingface.co/datasets/oliverguhr/natural-questions-german
212
+ # entries: 100231
213
+ # original source: https://ai.google.com/research/NaturalQuestions
214
+ # License: cc-by-sa-3.0
215
+ # without hard negatives but unused
216
+ #nq_german_dataset = load_dataset("oliverguhr/natural-questions-german", split="train").select_columns(['query_de', 'answer_de'])
217
+ #nq_german_dataset_dict = nq_german_dataset.train_test_split(test_size=0.02, seed=12)
218
+ #nq_german_train_dataset: Dataset = nq_german_dataset_dict["train"]
219
+ #nq_german_eval_dataset: Dataset = nq_german_dataset_dict["test"]
220
+ #
221
+ # filtered, split and with hard negatives and remaining sentences
222
+ nq_german_en_de_a_3hn_ds = load_dataset('parquet', data_files={'natural-questions-german-en_de-a-sts_3hn/3_hard_negatives/*.parquet'}, split="train")
223
+ nq_german_en_de_a_3hn_ds = nq_german_en_de_a_3hn_ds.train_test_split(test_size=0.02, seed=12)
224
+ nq_german_en_de_a_3hn_train_ds: Dataset = nq_german_en_de_a_3hn_ds['train']
225
+ nq_german_en_de_a_3hn_eval_ds: Dataset = nq_german_en_de_a_3hn_ds['test']
226
+ #
227
+ nq_german_en_de_3hn_ds = load_dataset('parquet', data_files={'natural-questions-german-en_de-sts_3hn/3_hard_negatives/*.parquet'}, split="train")
228
+ nq_german_en_de_3hn_ds = nq_german_en_de_3hn_ds.train_test_split(test_size=0.02, seed=12)
229
+ nq_german_en_de_3hn_train_ds: Dataset = nq_german_en_de_3hn_ds['train']
230
+ nq_german_en_de_3hn_eval_ds: Dataset = nq_german_en_de_3hn_ds['test']
231
+ #
232
+ nq_german_3hn_ds = load_dataset('parquet', data_files={'natural-questions-german-sts_3hn/3_hard_negatives/*.parquet'}, split="train")
233
+ nq_german_3hn_ds = nq_german_3hn_ds.train_test_split(test_size=0.02, seed=12)
234
+ nq_german_3hn_train_ds: Dataset = nq_german_3hn_ds['train']
235
+ nq_german_3hn_eval_ds: Dataset = nq_german_3hn_ds['test']
236
+ #
237
+ nq_german_1hn_ds = load_dataset('parquet', data_files={'natural-questions-german-sts_3hn/1_hard_negatives/*.parquet'}, split="train")
238
+ nq_german_1hn_ds = nq_german_1hn_ds.train_test_split(test_size=0.02, seed=12)
239
+ nq_german_1hn_train_ds: Dataset = nq_german_1hn_ds['train']
240
+ nq_german_1hn_eval_ds: Dataset = nq_german_1hn_ds['test']
241
+ print("Loaded nq_german-hn dataset.")
242
+ #
243
+ print("Loading german-oasst1-qa-format-scored dataset...")
244
+ # source: https://huggingface.co/datasets/MarcGrumpyOlejak/german-oasst1-qa-format-scored
245
+ # original: https://huggingface.co/datasets/AgentWaller/german-oasst1-qa-format
246
+ # entries: ~9800
247
+ # License: apache-2.0
248
+ #german_oasst1 = load_dataset("MarcGrumpyOlejak/german-oasst1-qa-format-scored").filter(lambda _: _['score_sts'] >= 0.16 and _['score_sts'] < 0.99)
249
+ #german_oasst1_train_dataset: Dataset = german_oasst1["train"].select_columns(['input', 'output'])
250
+ #german_oasst1_eval_dataset: Dataset = german_oasst1['validation'].select_columns(['input', 'output'])
251
+ #
252
+ name_local = 'german-oasst1-qa-format-hn'
253
+ german_oasst1_hn_train_dataset: Dataset = load_dataset('parquet', data_files={f'{name_local}/3_hard_negatives/train-*.parquet'}, split="train")
254
+ german_oasst1_hn_eval_dataset: Dataset = load_dataset('parquet', data_files={f'{name_local}/3_hard_negatives/test-*.parquet'}, split="train")
255
+ print("Loaded german-oasst1-qa-format-scored dataset.")
256
+ #
257
+ print("Loading germanrag-scored dataset...")
258
+ # source: https://huggingface.co/datasets/MarcGrumpyOlejak/germanrag-scored
259
+ # german original: https://huggingface.co/datasets/DiscoResearch/germanrag
260
+ # original: https://huggingface.co/datasets/deepset/germandpr
261
+ # entries: ~3300
262
+ # filtered & modified: 4556
263
+ # License: cc-by-4.0
264
+ # Hint: one could 'refilter' the 'contexts' down to the selected 'answer' in 'positive_ctx_idx' and use the other answers as hard negatives.
265
+ def list_to_string(_):
266
+ _['contexts'] = ' '.join(_['contexts'])
267
+ return _
268
+ germanrag_short = load_dataset("MarcGrumpyOlejak/germanrag-scored", split='train').filter(lambda _: _['score_sts'] >= 0.16 and _['score_sts'] < 0.98 and _['positive_ctx_idx'] != -1)
269
+ germanrag_context = germanrag_short.select_columns(['answer', 'contexts'])
270
+ germanrag_context = germanrag_context.map(list_to_string)
271
+ germanrag_context = germanrag_context.rename_columns({'answer': 'sentence1', 'contexts': 'sentence2'})
272
+ germanrag_short = germanrag_short.select_columns(['question', 'answer'])
273
+ germanrag_short = germanrag_short.rename_columns({'question': 'sentence1', 'answer': 'sentence2'})
274
+ germanrag_short = concatenate_datasets([germanrag_short, germanrag_context])
275
+ germanrag_short = germanrag_short.train_test_split(test_size=0.02, seed=12)
276
+ germanrag_short_train_dataset: Dataset = germanrag_short["train"]
277
+ germanrag_short_eval_dataset: Dataset = germanrag_short["test"]
278
+ print("Loaded germanrag dataset.")
279
+ #
280
+ print("Loading slimorca_dedup_german_experimental-scored dataset...")
281
+ # source: https://huggingface.co/datasets/MarcGrumpyOlejak/slimorca_dedup_german_experimental-scored
282
+ # german original: https://huggingface.co/datasets/jphme/slimorca_dedup_german_experimental
283
+ # original: https://huggingface.co/datasets/Open-Orca/SlimOrca-Dedup
284
+ # entries: ~322000
285
+ # filtered: 305406
286
+ # License: MIT
287
+ # Original set without Hard Negatives unused
288
+ #slimorca_dedup_german = load_dataset("MarcGrumpyOlejak/slimorca_dedup_german_experimental-scored").filter(lambda _: _['score_sts'] >= 0.16 and _['score_sts'] < 0.98)
289
+ #slimorca_dedup_german = slimorca_dedup_german.select_columns(['instruction', 'response'])
290
+ #slimorca_dedup_german = slimorca_dedup_german['train'].train_test_split(test_size=0.02, seed=12)
291
+ #slimorca_dedup_german_train_dataset: Dataset = slimorca_dedup_german["train"]
292
+ #slimorca_dedup_german_eval_dataset: Dataset = slimorca_dedup_german["test"]
293
+ #
294
+ # FILTERED, SPLIT AND WITH HARD NEGATIVES
295
+ slimorca_dedup_3hn_ds = load_dataset('parquet', data_files={'slimorca_dedup_german_experimental-sts-negatives_3hn/3_hard_negatives/*.parquet'}, split="train")
296
+ slimorca_dedup_3hn_ds = slimorca_dedup_3hn_ds.train_test_split(test_size=0.02, seed=12)
297
+ slimorca_dedup_3hn_train_ds: Dataset = slimorca_dedup_3hn_ds['train']
298
+ slimorca_dedup_3hn_eval_ds: Dataset = slimorca_dedup_3hn_ds['test']
299
+ #
300
+ slimorca_dedup_2hn_ds = load_dataset('parquet', data_files={'slimorca_dedup_german_experimental-sts-negatives_3hn/2_hard_negatives/*.parquet'}, split="train")
301
+ slimorca_dedup_2hn_ds = slimorca_dedup_2hn_ds.train_test_split(test_size=0.02, seed=12)
302
+ slimorca_dedup_2hn_train_ds: Dataset = slimorca_dedup_2hn_ds['train']
303
+ slimorca_dedup_2hn_eval_ds: Dataset = slimorca_dedup_2hn_ds['test']
304
+ #
305
+ slimorca_dedup_1hn_ds = load_dataset('parquet', data_files={'slimorca_dedup_german_experimental-sts-negatives_3hn/1_hard_negatives/*.parquet'}, split="train")
306
+ slimorca_dedup_1hn_ds = slimorca_dedup_1hn_ds.train_test_split(test_size=0.02, seed=12)
307
+ slimorca_dedup_1hn_train_ds: Dataset = slimorca_dedup_1hn_ds['train']
308
+ slimorca_dedup_1hn_eval_ds: Dataset = slimorca_dedup_1hn_ds['test']
309
+ #
310
+ slimorca_dedup_0hn_ds = load_dataset('parquet', data_files={'slimorca_dedup_german_experimental-sts-negatives_3hn/0_hard_negatives/*.parquet'}, split="train")
311
+ slimorca_dedup_0hn_ds = slimorca_dedup_0hn_ds.train_test_split(test_size=0.02, seed=12)
312
+ slimorca_dedup_0hn_train_ds: Dataset = slimorca_dedup_0hn_ds['train']
313
+ slimorca_dedup_0hn_eval_ds: Dataset = slimorca_dedup_0hn_ds['test']
314
+ print("Loaded slimorca_dedup_german_experimental-scored dataset.")
315
+ #
316
+ print("Loading gpt-4-self-instruct-german-scored dataset...")
317
+ # source: https://huggingface.co/datasets/MarcGrumpyOlejak/gpt-4-self-instruct-german-scored
318
+ # original: https://huggingface.co/datasets/CausalLM/GPT-4-Self-Instruct-German
319
+ # entries: ~10000
320
+ # filtered: 9776
321
+ # License: CC-BY-4.0
322
+ #german_gpt4 = load_dataset("MarcGrumpyOlejak/gpt-4-self-instruct-german-scored").filter(lambda _: _['score_sts'] >= 0.16 and _['score_sts'] < 0.98).select_columns(['instruction', 'output'])
323
+ #german_gpt4 = german_gpt4['train'].train_test_split(test_size=0.02, seed=12)
324
+ #german_gpt4_train_dataset: Dataset = german_gpt4["train"]
325
+ #german_gpt4_eval_dataset: Dataset = german_gpt4["test"]
326
+ #
327
+ name_local = 'gpt-4-self-instruct-german-hn'
328
+ german_gpt4 = load_dataset('parquet', data_files={f'{name_local}/3_hard_negatives/train-*.parquet'}, split="train").train_test_split(test_size=0.02, seed=12)
329
+ german_gpt4_3hn_train_dataset: Dataset = german_gpt4["train"]
330
+ german_gpt4_3hn_eval_dataset: Dataset = german_gpt4["test"]
331
+ print("Loaded GPT-4-Self-Instruct-German dataset.")
332
+ #
333
+ print("Loading ultradistil-intel-orca-dpo-de-scored dataset...")
334
+ # source: https://huggingface.co/datasets/MarcGrumpyOlejak/ultradistil-intel-orca-dpo-de-scored
335
+ # original: https://huggingface.co/datasets/aari1995/ultradistil-intel-orca-dpo-de
336
+ # entries: ~6000
337
+ # filtered: ~5547
338
+ # License: apache-2.0
339
+ german_orca_dpo_ds = load_dataset("MarcGrumpyOlejak/ultradistil-intel-orca-dpo-de-scored").filter(lambda _: _['score_sts'] >= 0.16 and _['score_sts'] < 0.98)
340
+ german_orca_dpo_ds = german_orca_dpo_ds.select_columns(['input', 'chosen', 'rejected'])
341
+ german_orca_dpo_ds = german_orca_dpo_ds['train'].train_test_split(test_size=0.02, seed=12)
342
+ german_orca_dpo_train_dataset: Dataset = german_orca_dpo_ds["train"]
343
+ german_orca_dpo_eval_dataset: Dataset = german_orca_dpo_ds["test"]
344
+ print("Loaded ultradistil-intel-orca-dpo-de-scored dataset.")
345
+ #
346
+ #scored version of alpaca-gpt4_de-scored
347
+ print("Loading alpaca-gpt4_de-scored dataset...")
348
+ # source: https://huggingface.co/datasets/MarcGrumpyOlejak/alpaca-gpt4_de-scored
349
+ # german original: https://huggingface.co/datasets/mayflowergmbh/alpaca-gpt4_de
350
+ # original: https://huggingface.co/datasets/FreedomIntelligence/alpaca-gpt4-deutsch
351
+ # entries: ~50000
352
+ # filtered ~44845
353
+ # License: apache-2.0
354
+ # Original unused
355
+ #alpaca_gpt4_de_ds = load_dataset("MarcGrumpyOlejak/alpaca-gpt4_de-scored").filter(lambda _: _['score_sts'] >= 0.16 and _['score_sts'] < 0.94)
356
+ #alpaca_gpt4_de_ds = alpaca_gpt4_de_ds.select_columns(['instruction', 'output'])
357
+ #alpaca_gpt4_de_ds = alpaca_gpt4_de_ds['train'].train_test_split(test_size=0.02, seed=12)
358
+ #alpaca_gpt4_de_train_dataset: Dataset = alpaca_gpt4_de_ds["train"]
359
+ #alpaca_gpt4_de_eval_dataset: Dataset = alpaca_gpt4_de_ds["test"]
360
+ # filtered and hard negatives
361
+ alpaca_gpt4_de_3hn_ds = load_dataset('parquet', data_files={'alpaca-gpt4_de_3hn/3_hard_negatives/*.parquet'}, split="train").train_test_split(test_size=0.02, seed=12)
362
+ alpaca_gpt4_de_3hn_train_dataset: Dataset = alpaca_gpt4_de_3hn_ds['train']
363
+ alpaca_gpt4_de_3hn_eval_dataset: Dataset = alpaca_gpt4_de_3hn_ds['test']
364
+ alpaca_gpt4_de_0hn_ds = load_dataset('parquet', data_files={'alpaca-gpt4_de_3hn/0_hard_negatives/*.parquet'}, split="train").train_test_split(test_size=0.02, seed=12)
365
+ alpaca_gpt4_de_0hn_train_dataset: Dataset = alpaca_gpt4_de_0hn_ds['train']
366
+ alpaca_gpt4_de_0hn_eval_dataset: Dataset = alpaca_gpt4_de_0hn_ds['test']
367
+ print("Loaded alpaca-gpt4_de dataset.")
368
+ #
369
+ print("Loading DOLLY-15k (en-de) dataset...")
370
+ # source: https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-multilingual
371
+ # entries: ~15000
372
+ # License: cc-by-sa-3.0
373
+ # Original combined merged dataset unsused
374
+ #db_dolly = load_dataset("argilla/databricks-dolly-15k-curated-multilingual", split="de")
375
+ #db_dolly_en_de_inststruction = db_dolly.select_columns(['instruction_original_en', 'instruction']).filter(lambda _: _['instruction_original_en'] != "" and _['instruction'] != '')
376
+ #db_dolly_en_de_inststruction = db_dolly_en_de_inststruction.rename_columns({'instruction_original_en': 'sentence1', 'instruction': 'sentence2'})
377
+ #db_dolly_en_de_context = db_dolly.select_columns(['context_original_en', 'context']).filter(lambda _: _['context_original_en'] != "" and _['context'] != '')
378
+ #db_dolly_en_de_context = db_dolly_en_de_context.rename_columns({'context_original_en': 'sentence1', 'context': 'sentence2'})
379
+ #db_dolly_en_de_response = db_dolly.select_columns(['response_original_en', 'response']).filter(lambda _: _['response_original_en'] != "" and _['response'] != '')
380
+ #db_dolly_en_de_response = db_dolly_en_de_response.rename_columns({'response_original_en': 'sentence1', 'response': 'sentence2'})
381
+ #db_dolly_qa_de = db_dolly.select_columns(['instruction', 'response']).filter(lambda _: _['instruction'] != "" and _['response'] != '')
382
+ #db_dolly_qa_de = db_dolly_qa_de.rename_columns({'instruction': 'sentence1', 'response': 'sentence2'})
383
+ #db_dolly_qcontext_de = db_dolly.select_columns(['response', 'context']).filter(lambda _: _['response'] != "" and _['context'] != '')
384
+ #db_dolly_qcontext_de = db_dolly_qcontext_de.rename_columns({'response': 'sentence1', 'context': 'sentence2'})
385
+ #db_dolly_contextq_de = db_dolly.select_columns(['context', 'instruction']).filter(lambda _: _['context'] != "" and _['instruction'] != '')
386
+ #db_dolly_contextq_de = db_dolly_contextq_de.rename_columns({'context': 'sentence1', 'instruction': 'sentence2'})
387
+ # concat all small tables
388
+ #db_dolly = concatenate_datasets([db_dolly_en_de_inststruction, db_dolly_en_de_context, db_dolly_en_de_response, db_dolly_qa_de, db_dolly_qcontext_de, db_dolly_contextq_de])
389
+ #db_dolly_ds = db_dolly.train_test_split(test_size=0.02, seed=12)
390
+ #db_dolly_train_dataset: Dataset = db_dolly_ds["train"]
391
+ #db_dolly_eval_dataset: Dataset = db_dolly_ds["test"]
392
+ #
393
+ # hard negative versions and remaining sentences
394
+ dolly_context_de_3hn_ds = load_dataset('parquet', data_files={'databricks-dolly-15k-curated-de/context-de-hn/3_hard_negatives/*.parquet'}, split="train").train_test_split(test_size=0.02, seed=12)
395
+ dolly_context_de_3hn_train_ds: Dataset = dolly_context_de_3hn_ds['train']
396
+ dolly_context_de_3hn_eval_ds: Dataset = dolly_context_de_3hn_ds['test']
397
+ dolly_context_de_0hn_ds = load_dataset('parquet', data_files={'databricks-dolly-15k-curated-de/context-de-hn/0_hard_negatives/*.parquet'}, split="train").train_test_split(test_size=0.02, seed=12)
398
+ dolly_context_de_0hn_train_ds: Dataset = dolly_context_de_0hn_ds['train']
399
+ dolly_context_de_0hn_eval_ds: Dataset = dolly_context_de_0hn_ds['test']
400
+ dolly_context_ende_3hn_ds = load_dataset('parquet', data_files={'databricks-dolly-15k-curated-de/context-en_de-hn/3_hard_negatives/*.parquet'}, split="train").train_test_split(test_size=0.02, seed=12)
401
+ dolly_context_ende_3hn_train_ds: Dataset = dolly_context_ende_3hn_ds['train']
402
+ dolly_context_ende_3hn_eval_ds: Dataset = dolly_context_ende_3hn_ds['test']
403
+ # the next set is empty :D
404
+ #dolly_context_ende_0hn_ds = load_dataset('parquet', data_files={'databricks-dolly-15k-curated-de/context-en_de-hn/0_hard_negatives/*.parquet'}, split="train").train_test_split(test_size=0.02, seed=12)
405
+ dolly_instructions_de_3hn_ds = load_dataset('parquet', data_files={'databricks-dolly-15k-curated-de/instructions-de-hn/3_hard_negatives/*.parquet'}, split="train").train_test_split(test_size=0.02, seed=12)
406
+ dolly_instructions_de_3hn_train_ds: Dataset = dolly_instructions_de_3hn_ds['train']
407
+ dolly_instructions_de_3hn_eval_ds: Dataset = dolly_instructions_de_3hn_ds['test']
408
+ dolly_instructions_de_0hn_ds = load_dataset('parquet', data_files={'databricks-dolly-15k-curated-de/instructions-de-hn/0_hard_negatives/*.parquet'}, split="train").train_test_split(test_size=0.02, seed=12)
409
+ dolly_instructions_de_0hn_train_ds: Dataset = dolly_instructions_de_0hn_ds['train']
410
+ dolly_instructions_de_0hn_eval_ds: Dataset = dolly_instructions_de_0hn_ds['test']
411
+ dolly_instructions_ende_3hn_ds = load_dataset('parquet', data_files={'databricks-dolly-15k-curated-de/instructions-en_de-hn/3_hard_negatives/*.parquet'}, split="train").train_test_split(test_size=0.02, seed=12)
412
+ dolly_instructions_ende_3hn_train_ds: Dataset = dolly_instructions_ende_3hn_ds['train']
413
+ dolly_instructions_ende_3hn_eval_ds: Dataset = dolly_instructions_ende_3hn_ds['test']
414
+ dolly_instructions_ende_0hn_ds = load_dataset('parquet', data_files={'databricks-dolly-15k-curated-de/instructions-en_de-hn/0_hard_negatives/*.parquet'}, split="train").train_test_split(test_size=0.02, seed=12)
415
+ dolly_instructions_ende_0hn_train_ds: Dataset = dolly_instructions_ende_0hn_ds['train']
416
+ dolly_instructions_ende_0hn_eval_ds: Dataset = dolly_instructions_ende_0hn_ds['test']
417
+ dolly_responses_de_3hn_ds = load_dataset('parquet', data_files={'databricks-dolly-15k-curated-de/response-de-hn/3_hard_negatives/*.parquet'}, split="train").train_test_split(test_size=0.02, seed=12)
418
+ dolly_responses_de_3hn_train_ds: Dataset = dolly_responses_de_3hn_ds['train']
419
+ dolly_responses_de_3hn_eval_ds: Dataset = dolly_responses_de_3hn_ds['test']
420
+ dolly_responses_de_0hn_ds = load_dataset('parquet', data_files={'databricks-dolly-15k-curated-de/response-de-hn/0_hard_negatives/*.parquet'}, split="train").train_test_split(test_size=0.02, seed=12)
421
+ dolly_responses_de_0hn_train_ds: Dataset = dolly_responses_de_0hn_ds['train']
422
+ dolly_responses_de_0hn_eval_ds: Dataset = dolly_responses_de_0hn_ds['test']
423
+ dolly_responses_ende_3hn_ds = load_dataset('parquet', data_files={'databricks-dolly-15k-curated-de/response-en_de-hn/3_hard_negatives/*.parquet'}, split="train").train_test_split(test_size=0.02, seed=12)
424
+ dolly_responses_ende_3hn_train_ds: Dataset = dolly_responses_ende_3hn_ds['train']
425
+ dolly_responses_ende_3hn_eval_ds: Dataset = dolly_responses_ende_3hn_ds['test']
426
+ dolly_responses_ende_0hn_ds = load_dataset('parquet', data_files={'databricks-dolly-15k-curated-de/response-en_de-hn/0_hard_negatives/*.parquet'}, split="train").train_test_split(test_size=0.02, seed=12)
427
+ dolly_responses_ende_0hn_train_ds: Dataset = dolly_responses_ende_0hn_ds['train']
428
+ dolly_responses_ende_0hn_eval_ds: Dataset = dolly_responses_ende_0hn_ds['test']
429
+ print("Loaded DOLLY-15k (en-de) dataset.")
430
+ #
431
+ print("Loading 'saf-legal_domain_german' dataset...")
432
+ # source: https://huggingface.co/datasets/Short-Answer-Feedback/saf_legal_domain_german
433
+ # License: CC-BY-4.0
434
+ # entries: ~1600
435
+ # filtered: ~1100 (score >= 0.75) and recombined
436
+ saf_legal_de_train = load_dataset("Short-Answer-Feedback/saf_legal_domain_german", split="train").filter(lambda _: _['score'] >= 0.75)
437
+ saf_legal_de_qa_train = saf_legal_de_train.select_columns(['question', 'provided_answer']).rename_columns({'question': 'sentence1', 'provided_answer': 'sentence2'})
438
+ saf_legal_de_a_train = saf_legal_de_train.select_columns(['provided_answer', 'reference_answer']).rename_columns({'provided_answer': 'sentence1', 'reference_answer': 'sentence2'})
439
+ saf_legal_de_train_ds: Dataset = concatenate_datasets([saf_legal_de_qa_train, saf_legal_de_a_train])
440
+ # Loading & Preparing validation set
441
+ saf_legal_de_eval = load_dataset("Short-Answer-Feedback/saf_legal_domain_german", split="validation").filter(lambda _: _['score'] >= 0.75)
442
+ saf_legal_de_qa_eval = saf_legal_de_eval.select_columns(['question', 'provided_answer']).rename_columns({'question': 'sentence1', 'provided_answer': 'sentence2'})
443
+ saf_legal_de_a_eval = saf_legal_de_eval.select_columns(['provided_answer', 'reference_answer']).rename_columns({'provided_answer': 'sentence1', 'reference_answer': 'sentence2'})
444
+ saf_legal_de_eval_ds: Dataset = concatenate_datasets([saf_legal_de_qa_eval, saf_legal_de_a_eval])
445
+ print("Loaded 'saf-legal_domain_german' dataset.")
446
+ #
447
+ print("Loading GLS dataset...")
448
+ # German Legal Sentences (GLS)
449
+ # source: https://huggingface.co/datasets/lavis-nlp/german_legal_sentences
450
+ # https://lavis-nlp.github.io/german_legal_sentences/
451
+ # uses "custom code": https://huggingface.co/datasets/lavis-nlp/german_legal_sentences/blob/main/german_legal_sentences.py
452
+ # License: MIT - see https://github.com/lavis-nlp/GerDaLIR
453
+ # Original License: https://github.com/openlegaldata/oldp#MIT-1-ov-file
454
+ # interesting fields: query.text, related.text
455
+ # entries: 1404271
456
+ #
457
+ # Original unused
458
+ #gls_pairs_dataset_dict = load_dataset("lavis-nlp/german_legal_sentences", "pairs").select_columns(['query.text', 'related.text'])
459
+ #gls_pairs_train_dataset: Dataset = gls_pairs_dataset_dict["train"]
460
+ #gls_pairs_eval_dataset: Dataset = gls_pairs_dataset_dict["validation"]
461
+ #
462
+ # Distilled and hard mined negatives
463
+ gls_3hn = load_dataset('parquet', data_files={'german_legal_sentences_dist_3hn/3_hard_negatives/*.parquet'})['train'].train_test_split(test_size=0.02, seed=12)
464
+ gls_3hn_train_dataset: Dataset = gls_3hn['train']
465
+ gls_3hn_eval_dataset: Dataset = gls_3hn['test']
466
+ gls_2hn = load_dataset('parquet', data_files={'german_legal_sentences_dist_3hn/2_hard_negatives/*.parquet'})['train'].train_test_split(test_size=0.02, seed=12)
467
+ gls_2hn_train_dataset: Dataset = gls_2hn['train']
468
+ gls_2hn_eval_dataset: Dataset = gls_2hn['test']
469
+ gls_1hn = load_dataset('parquet', data_files={'german_legal_sentences_dist_3hn/1_hard_negatives/*.parquet'})['train'].train_test_split(test_size=0.02, seed=12)
470
+ gls_1hn_train_dataset: Dataset = gls_1hn['train']
471
+ gls_1hn_eval_dataset: Dataset = gls_1hn['test']
472
+ gls_0hn = load_dataset('parquet', data_files={'german_legal_sentences_dist_3hn/0_hard_negatives/*.parquet'})['train'].train_test_split(test_size=0.02, seed=12)
473
+ gls_0hn_train_dataset: Dataset = gls_0hn['train']
474
+ gls_0hn_eval_dataset: Dataset = gls_0hn['test']
475
+ print("Loaded GLS dataset.")
476
+ #
477
+ print("Loading europarl EN-DE dataset...")
478
+ # source: https://huggingface.co/datasets/sentence-transformers/parallel-sentences-europarl
479
+ # original: https://opus.nlpl.eu/Europarl/corpus/version/Europarl
480
+ # Info: https://opus.nlpl.eu/legacy/LREC2012.txt
481
+ # entries: ~1.9m
482
+ #europarl_dataset = load_dataset("sentence-transformers/parallel-sentences-europarl", "en-de", split="train")
483
+ #europarl_dataset_dict = europarl_dataset.train_test_split(test_size=10000, seed=12)
484
+ #europarl_train_dataset: Dataset = europarl_dataset_dict["train"]
485
+ #europarl_eval_dataset: Dataset = europarl_dataset_dict["test"]
486
+ #
487
+ # filtered and 3 hard negatives and 0 negatives
488
+ europarl_dataset_3hn = load_dataset('parquet', data_files={'parallel-sentences-europarl-redux_3hn/3_hard_negatives/*.parquet'})['train'].train_test_split(test_size=10000, seed=12)
489
+ europarl_3hn_train_dataset: Dataset = europarl_dataset_3hn["train"]
490
+ europarl_3hn_eval_dataset: Dataset = europarl_dataset_3hn["test"]
491
+ #
492
+ europarl_dataset_0hn = load_dataset('parquet', data_files={'parallel-sentences-europarl-redux_3hn/0_hard_negatives/*.parquet'})['train'].train_test_split(test_size=0.02, seed=12)
493
+ europarl_0hn_train_dataset: Dataset = europarl_dataset_0hn["train"]
494
+ europarl_0hn_eval_dataset: Dataset = europarl_dataset_0hn["test"]
495
+ print("Loaded europarl EN-DE dataset.")
496
+ #
497
+ print("Loading tatoeba EN-DE dataset...")
498
+ # source: https://huggingface.co/datasets/sentence-transformers/parallel-sentences-tatoeba
499
+ # original: https://tatoeba.org/
500
+ # entries: ~330k
501
+ #tatoeba_dataset = load_dataset("sentence-transformers/parallel-sentences-tatoeba", "en-de", split="train")
502
+ #tatoeba_dataset_dict = tatoeba_dataset.train_test_split(test_size=10000, seed=12)
503
+ #tatoeba_train_dataset: Dataset = tatoeba_dataset_dict["train"]
504
+ #tatoeba_eval_dataset: Dataset = tatoeba_dataset_dict["test"]
505
+ #
506
+ tatoeba_dataset_3hn = load_dataset('parquet', data_files={'parallel-sentences-tatoeba-en-de-hn/3_hard_negatives/*.parquet'})['train'].train_test_split(test_size=10000, seed=12)
507
+ tatoeba_3hn_train_dataset: Dataset = tatoeba_dataset_3hn["train"]
508
+ tatoeba_3hn_eval_dataset: Dataset = tatoeba_dataset_3hn["test"]
509
+ #
510
+ tatoeba_dataset_0hn = load_dataset('parquet', data_files={'parallel-sentences-tatoeba-en-de-hn/0_hard_negatives/*.parquet'})['train'].train_test_split(test_size=0.02, seed=12)
511
+ tatoeba_0hn_train_dataset: Dataset = tatoeba_dataset_0hn["train"]
512
+ tatoeba_0hn_eval_dataset: Dataset = tatoeba_dataset_0hn["test"]
513
+ print("Loaded tatoeba EN-DE dataset.")
514
+ #
515
+ print("Loading WikiMatrix EN-DE dataset...")
516
+ # source: (EN-DE) https://huggingface.co/datasets/sentence-transformers/parallel-sentences-wikimatrix
517
+ # License: CC BY-SA 4.0
518
+ # entries: ~344k
519
+ # Original dataset not used
520
+ #wikimatrix_dataset = load_dataset("sentence-transformers/parallel-sentences-wikimatrix", "en-de", split="train")
521
+ #wikimatrix_dataset_dict = wikimatrix_dataset.train_test_split(test_size=10000, seed=12)
522
+ #wikimatrix_train_dataset: Dataset = wikimatrix_dataset_dict["train"]
523
+ #wikimatrix_eval_dataset: Dataset = wikimatrix_dataset_dict["test"]
524
+ #
525
+ # scored and filtered hard negative version and remaining sentences
526
+ wikimatrix_3hn_ds = load_dataset('parquet', data_files={'parallel-sentences-wikimatrix-hn_3hn/3_hard_negatives/train-*.parquet'}, split='train')
527
+ wikimatrix_3hn_ds = wikimatrix_3hn_ds.train_test_split(test_size=10000, seed=12)
528
+ wikimatrix_3hn_train_ds: Dataset = wikimatrix_3hn_ds["train"]
529
+ wikimatrix_3hn_eval_ds: Dataset = wikimatrix_3hn_ds["test"]
530
+ #
531
+ wikimatrix_0hn_ds = load_dataset('parquet', data_files={'parallel-sentences-wikimatrix-hn_3hn/0_hard_negatives/train-*.parquet'}, split='train')
532
+ wikimatrix_0hn_ds = wikimatrix_0hn_ds.train_test_split(test_size=0.02, seed=12)
533
+ wikimatrix_0hn_train_ds: Dataset = wikimatrix_0hn_ds["train"]
534
+ wikimatrix_0hn_eval_ds: Dataset = wikimatrix_0hn_ds["test"]
535
+ #
536
+ print("Loaded WikiMatrix EN-DE dataset.")
537
+ #
538
+ print("Loading Wikipedia-Abstract DE dataset...")
539
+ # source: https://huggingface.co/datasets/laion/Wikipedia-Abstract
540
+ # License: MIT
541
+ # entries: 2.57M
542
+ # comment: relicensing a Wikipedia text to MIT is a bit unusual as it was Creative Commons Attribution-ShareAlike 4.0 and/or GNU Free Documentation License
543
+ # original version unused
544
+ #wikipedia_abstract_ds = load_dataset("laion/Wikipedia-Abstract", "German", split="train").select_columns(['Title', 'Abstract'])
545
+ #wikipedia_abstract_ds = wikipedia_abstract_ds.train_test_split(test_size=10000, seed=12)
546
+ #wikipedia_abstract_train_dataset: Dataset = wikipedia_abstract_ds["train"]
547
+ #wikipedia_abstract_eval_dataset: Dataset = wikipedia_abstract_ds["test"]
548
+ #
549
+ # hard negative version and remaining sentences
550
+ wikipedia_abstract_3hn_ds = load_dataset('parquet', data_files={'Wikipedia-Abstract-distilled_3hn/3_hard_negatives/train-*.parquet'}, split='train')
551
+ wikipedia_abstract_3hn_ds = wikipedia_abstract_3hn_ds.train_test_split(test_size=10000, seed=12)
552
+ wikipedia_abstract_3hn_train_dataset: Dataset = wikipedia_abstract_3hn_ds["train"]
553
+ wikipedia_abstract_3hn_eval_dataset: Dataset = wikipedia_abstract_3hn_ds["test"]
554
+ #
555
+ wikipedia_abstract_0hn_ds = load_dataset('parquet', data_files={'Wikipedia-Abstract-distilled_3hn/0_hard_negatives/train-*.parquet'}, split='train')
556
+ wikipedia_abstract_0hn_ds = wikipedia_abstract_0hn_ds.train_test_split(test_size=0.02, seed=12)
557
+ wikipedia_abstract_0hn_train_dataset: Dataset = wikipedia_abstract_0hn_ds["train"]
558
+ wikipedia_abstract_0hn_eval_dataset: Dataset = wikipedia_abstract_0hn_ds["test"]
559
+ print("Loaded Wikipedia-Abstract DE dataset.")
560
+ #
561
+ print("Loading wiktionary GDG-D DE dataset...")
562
+ # source: https://huggingface.co/jfeil/GermanDefinitionGeneration-Distillation
563
+ # License: gpl-3.0
564
+ # entries: ~900k
565
+ #
566
+ # GermanDefinitionGeneration-Distillation_3hn
567
+ wiktionary_gdg_de_3hn_train_ds: Dataset = load_dataset('parquet', data_files={'GermanDefinitionGeneration-Distillation_3hn/3_hard_negatives/train-*.parquet'}, split='train')
568
+ wiktionary_gdg_de_3hn_eval_ds: Dataset = load_dataset('parquet', data_files={'GermanDefinitionGeneration-Distillation_3hn/3_hard_negatives/validation-*.parquet'}, split='train')
569
+ #
570
+ # still needs optimisation
571
+ wiktionary_gdg_de_short_ds = load_dataset("jfeil/GermanDefinitionGeneration-Distillation")
572
+ wiktionary_gdg_de_short_ds = wiktionary_gdg_de_short_ds.select_columns(['context_sentence', 'title'])
573
+ wiktionary_gdg_de_short_train_dataset: Dataset = wiktionary_gdg_de_short_ds["train"]
574
+ wiktionary_gdg_de_short_eval_dataset: Dataset = wiktionary_gdg_de_short_ds["test"]
575
+ print("Loaded GDG-D DE dataset.")
576
+ #
577
+ print("Loading wmt24pp dataset...")
578
+ # source: https://huggingface.co/datasets/google/wmt24pp
579
+ # License: Apache-2.0
580
+ # interesting fields: source, target
581
+ # entries: 960 (after filtering of 'is_bad_source')
582
+ wmt24pp_dataset = load_dataset("google/wmt24pp", "en-de_DE", split="train").filter(lambda _: _["is_bad_source"] == False)
583
+ wmt24pp_dataset = wmt24pp_dataset.select_columns(['source', 'target'])
584
+ wmt24pp_dataset_dict = wmt24pp_dataset.train_test_split(test_size=0.02, seed=12)
585
+ wmt24pp_train_dataset: Dataset = wmt24pp_dataset_dict["train"]
586
+ wmt24pp_eval_dataset: Dataset = wmt24pp_dataset_dict["test"]
587
+ print("Loaded wmt24pp dataset.")
588
+ #
589
+ print("Loading synthia_german_experimental dataset...")
590
+ # source: https://huggingface.co/datasets/jphme/synthia_german_experimental
591
+ # original: https://huggingface.co/datasets/migtissera/Synthia-v1.3
592
+ # License: Apache-2.0
593
+ # interesting fields: instruction, response
594
+ # entries: ~100000
595
+ # final: 14453
596
+ # notes: filtered on scores, take only if all scores are "3" (best).
597
+ synthia_de_ds = load_dataset("jphme/synthia_german_experimental", split="train").filter(lambda _: _["score_deutsch"] == 3 and _["score_antwort"] == 3)
598
+ synthia_de_ds = synthia_de_ds.select_columns(["instruction", "response"])
599
+ synthia_de_ds = synthia_de_ds.train_test_split(test_size=0.02, seed=12)
600
+ synthia_de_train_dataset: Dataset = synthia_de_ds["train"]
601
+ synthia_de_eval_dataset: Dataset = synthia_de_ds["test"]
602
+ print("Loaded synthia_german_experimental dataset.")
603
+ #
604
+ print("Loading ger-backtrans-paraphrase dataset...")
605
+ # source: https://huggingface.co/datasets/deutsche-telekom/ger-backtrans-paraphrase
606
+ # License: CC-BY-SA-4.0
607
+ # entries: 21292789
608
+ # filtered: 862574 (tokens >= 25, cos_sim >=0.9)
609
+ # filtered: ~2.1M (tokens >= 17, cos_sim >=0.8) (once a try - results were really bad)
610
+ # notes: also thanks to Daniel Heinze for more filter examples
611
+ # source: https://huggingface.co/datasets/danielheinz/telekom-backtrans-paraphrase-filtered
612
+ # original dataset without hard negatives unused
613
+ #telekom_gbp_dataset = load_dataset("deutsche-telekom/ger-backtrans-paraphrase", split="train")
614
+ #telekom_gbp_dataset = telekom_gbp_dataset.filter(lambda _: _["cos_sim"] >= 0.9 and _["cos_sim"] < 0.999 and _["jaccard_similarity"] >= 0.3 and _["en_de_token_count"] >= 25 and _["de_token_count"] >= 25)
615
+ #telekom_gbp_dataset = telekom_gbp_dataset.select_columns(['en', 'de', 'en_de'])
616
+ # make a copy - but only with 'en_de' and 'de'
617
+ #telekom_gbp_ende_dataset = telekom_gbp_dataset.select_columns(['en_de', 'de'])
618
+ # build the 'original' set
619
+ #telekom_gbp_dataset_dict = telekom_gbp_dataset.train_test_split(test_size=0.05, seed=12)
620
+ #telekom_gbp_train_dataset: Dataset = telekom_gbp_dataset_dict["train"]
621
+ #telekom_gbp_eval_dataset: Dataset = telekom_gbp_dataset_dict["test"]
622
+ # now build a second set of 'bad' to 'good'
623
+ #telekom_gbp_ende_dataset_dict = telekom_gbp_ende_dataset.train_test_split(test_size=0.05, seed=12)
624
+ #telekom_gbp_ende_train_dataset: Dataset = telekom_gbp_ende_dataset_dict["train"]
625
+ #telekom_gbp_ende_eval_dataset: Dataset = telekom_gbp_ende_dataset_dict["test"]
626
+ #
627
+ # FILTERED, SPLIT AND WITH HARD NEGATIVES
628
+ gbp_3hn_ds = load_dataset('parquet', data_files={'ger-backtrans-paraphrase-350c-sts_3hn/3_hard_negatives/*.parquet'}, split="train")
629
+ gbp_3hn_add_ds = load_dataset('parquet', data_files={'ger-backtrans-paraphrase-200c-sts_3hn/3_hard_negatives/*.parquet'}, split="train")
630
+ gbp_3hn_ds = concatenate_datasets([gbp_3hn_ds, gbp_3hn_add_ds])
631
+ gbp_3hn_ds = gbp_3hn_ds.train_test_split(test_size=0.02, seed=12)
632
+ gbp_3hn_train_ds: Dataset = gbp_3hn_ds['train']
633
+ gbp_3hn_eval_ds: Dataset = gbp_3hn_ds['test']
634
+ #
635
+ gbp_0hn_ds = load_dataset('parquet', data_files={'ger-backtrans-paraphrase-350c-sts_3hn/0_hard_negatives/*.parquet'}, split="train")
636
+ gbp_0hn_add_ds = load_dataset('parquet', data_files={'ger-backtrans-paraphrase-200c-sts_3hn/0_hard_negatives/*.parquet'}, split="train")
637
+ gbp_0hn_ds = concatenate_datasets([gbp_0hn_ds, gbp_0hn_add_ds])
638
+ gbp_0hn_add_ds = load_dataset('parquet', data_files={'ger-backtrans-paraphrase-150c-sts_3hn/0_hard_negatives/*.parquet'}, split="train")
639
+ gbp_0hn_ds = concatenate_datasets([gbp_0hn_ds, gbp_0hn_add_ds])
640
+ gbp_0hn_ds = gbp_0hn_ds.train_test_split(test_size=0.02, seed=12)
641
+ gbp_0hn_train_ds: Dataset = gbp_0hn_ds['train']
642
+ gbp_0hn_eval_ds: Dataset = gbp_0hn_ds['test']
643
+ #
644
+ gbp_ende_3hn_ds = load_dataset('parquet', data_files={'ger-backtrans-paraphrase-en_de-350c-sts_3hn/3_hard_negatives/*.parquet'}, split="train")
645
+ gbp_ende_3hn_add_ds = load_dataset('parquet', data_files={'ger-backtrans-paraphrase-en_de-200c-sts_3hn/3_hard_negatives/*.parquet'}, split="train")
646
+ gbp_ende_3hn_ds = concatenate_datasets([gbp_ende_3hn_ds, gbp_ende_3hn_add_ds])
647
+ gbp_ende_3hn_add_ds = load_dataset('parquet', data_files={'ger-backtrans-paraphrase-en_de-150c-sts_3hn/3_hard_negatives/*.parquet'}, split="train")
648
+ gbp_ende_3hn_ds = concatenate_datasets([gbp_ende_3hn_ds, gbp_ende_3hn_add_ds])
649
+ gbp_ende_3hn_ds = gbp_ende_3hn_ds.train_test_split(test_size=0.02, seed=12)
650
+ gbp_ende_3hn_train_ds: Dataset = gbp_ende_3hn_ds['train']
651
+ gbp_ende_3hn_eval_ds: Dataset = gbp_ende_3hn_ds['test']
652
+ #
653
+ gbp_ende_0hn_ds = load_dataset('parquet', data_files={'ger-backtrans-paraphrase-en_de-350c-sts_3hn/0_hard_negatives/*.parquet'}, split="train")
654
+ gbp_ende_0hn_add_ds = load_dataset('parquet', data_files={'ger-backtrans-paraphrase-en_de-200c-sts_3hn/0_hard_negatives/*.parquet'}, split="train")
655
+ gbp_ende_0hn_ds = concatenate_datasets([gbp_ende_0hn_ds, gbp_ende_0hn_add_ds])
656
+ gbp_ende_0hn_add_ds = load_dataset('parquet', data_files={'ger-backtrans-paraphrase-en_de-150c-sts_3hn/0_hard_negatives/*.parquet'}, split="train")
657
+ gbp_ende_0hn_ds = concatenate_datasets([gbp_ende_0hn_ds, gbp_ende_0hn_add_ds])
658
+ gbp_ende_0hn_ds = gbp_ende_0hn_ds.train_test_split(test_size=0.02, seed=12)
659
+ gbp_ende_0hn_train_ds: Dataset = gbp_ende_0hn_ds['train']
660
+ gbp_ende_0hn_eval_ds: Dataset = gbp_ende_0hn_ds['test']
661
+ print("Loaded ger-backtrans-paraphrase dataset.")
662
+ #
663
+ print("Loading STSb Multi MT (de) dataset...")
664
+ # source: https://huggingface.co/datasets/PhilipMay/stsb_multi_mt
665
+ # License: CC-BY-SA-4.0 - https://github.com/PhilipMay/stsb-multi-mt/blob/main/LICENSE
666
+ # Original: https://ixa2.si.ehu.eus/stswiki/index.php/STSbenchmark
667
+ # entries: 5749
668
+ #stbs_de_dataset = load_dataset("PhilipMay/stsb_multi_mt", "de").filter(lambda _: _["similarity_score"] >= 1 and _["similarity_score"] < 5)
669
+ #stbs_de_dataset = stbs_de_dataset.select_columns(['sentence1', 'sentence2'])
670
+ #stbs_de_train_dataset: Dataset = stbs_de_dataset["train"]
671
+ #stbs_de_eval_dataset: Dataset = stbs_de_dataset["dev"]
672
+ #
673
+ stbs_de_3hn_train_dataset = load_dataset('parquet', data_files={'stsb_multi_mt-de-hn/3_hard_negatives/train*.parquet'}, split="train")
674
+ stbs_de_3hn_eval_dataset = load_dataset('parquet', data_files={'stsb_multi_mt-de-hn/3_hard_negatives/test*.parquet'}, split="train")
675
+ print("Loaded STSb Multi MT (de) dataset.")
676
+ #
677
+ print("Loading STSb Multi MT (en) dataset...")
678
+ # source: https://huggingface.co/datasets/PhilipMay/stsb_multi_mt
679
+ # License: CC-BY-SA-4.0 - https://github.com/PhilipMay/stsb-multi-mt/blob/main/LICENSE
680
+ # Original: https://ixa2.si.ehu.eus/stswiki/index.php/STSbenchmark
681
+ # entries: 5749
682
+ #stbs_en_dataset = load_dataset("PhilipMay/stsb_multi_mt", "en").filter(lambda _: _["similarity_score"] >= 1 and _["similarity_score"] < 5)
683
+ #stbs_en_dataset = stbs_en_dataset.select_columns(['sentence1', 'sentence2'])
684
+ #stbs_en_train_dataset: Dataset = stbs_en_dataset["train"]
685
+ #stbs_en_eval_dataset: Dataset = stbs_en_dataset["dev"]
686
+ #
687
+ stbs_en_3hn_train_dataset = load_dataset('parquet', data_files={'stsb_multi_mt-en-hn/3_hard_negatives/train*.parquet'}, split="train")
688
+ stbs_en_3hn_eval_dataset = load_dataset('parquet', data_files={'stsb_multi_mt-en-hn/3_hard_negatives/test*.parquet'}, split="train")
689
+ print("Loaded STSb Multi MT (en) dataset.")
690
+ #
691
+ print("Loading paws-x (de) dataset...")
692
+ # source: https://huggingface.co/datasets/google-research-datasets/paws-x
693
+ # License: Other - https://github.com/google-research-datasets/paws/blob/master/LICENSE
694
+ # License: The dataset may be freely used for any purpose, although acknowledgement of Google LLC ("Google") as the data source would be appreciated.
695
+ # entries: 49401
696
+ # Info: filtered only for "true" answers (["label"] == 1)
697
+ pawsx_de_dataset = load_dataset("google-research-datasets/paws-x", "de").filter(lambda _: _["label"] == 1)
698
+ pawsx_de_dataset = pawsx_de_dataset.select_columns(['sentence1', 'sentence2'])
699
+ pawsx_de_train_dataset: Dataset = pawsx_de_dataset["train"]
700
+ pawsx_de_eval_dataset: Dataset = pawsx_de_dataset["validation"]
701
+
702
+ print("Loaded paws-x (de) dataset.")
703
+ #
704
+ print("Loading paws-x (en) dataset...")
705
+ # source: https://huggingface.co/datasets/google-research-datasets/paws-x
706
+ # License: Other - https://github.com/google-research-datasets/paws/blob/master/LICENSE
707
+ # entries: 49401
708
+ pawsx_en_dataset = load_dataset("google-research-datasets/paws-x", "en").filter(lambda _: _["label"] == 1)
709
+ pawsx_en_dataset = pawsx_en_dataset.select_columns(['sentence1', 'sentence2'])
710
+ pawsx_en_train_dataset: Dataset = pawsx_en_dataset["train"]
711
+ pawsx_en_eval_dataset: Dataset = pawsx_en_dataset["validation"]
712
+ print("Loaded paws-x (en) dataset.")
713
+ #
714
+ print("Loading all NLI-26lang-2mil7 (local) datasets...")
715
+ # source: https://huggingface.co/datasets/MoritzLaurer/multilingual-NLI-26lang-2mil7
716
+ # License: MIT
717
+ # License-source: https://github.com/easonnie/combine-FEVER-NSMN
718
+ # entries: 25000
719
+ # info: 'label' – entailment (0), neutral (1), contradiction (2).
720
+ # for simple translations
721
+ main_name = 'multilingual-NLI-26lang-2mil7'
722
+ language = 'de'
723
+ entail = 'de_entailment'
724
+ transl = 'en_de'
725
+ subset = 'anli'
726
+ # anli entailments 3hn - de_anli_entail_3hn_train_ds
727
+ de_anli_entail_3hn_ds = load_dataset('parquet', data_files={f'{main_name}-{language}_{subset}-{entail}_hn/3_hard_negatives/*.parquet'}, split="train").train_test_split(test_size=0.02, seed=12)
728
+ de_anli_entail_3hn_train_ds: Dataset = de_anli_entail_3hn_ds['train']
729
+ de_anli_entail_3hn_eval_ds: Dataset = de_anli_entail_3hn_ds['test']
730
+ # anli entailments 0hn - de_anli_entail_0hn_train_ds
731
+ de_anli_entail_0hn_ds = load_dataset('parquet', data_files={f'{main_name}-{language}_{subset}-{entail}_hn/0_hard_negatives/*.parquet'}, split="train").train_test_split(test_size=0.02, seed=12)
732
+ de_anli_entail_0hn_train_ds: Dataset = de_anli_entail_0hn_ds['train']
733
+ de_anli_entail_0hn_eval_ds: Dataset = de_anli_entail_0hn_ds['test']
734
+ # anli translation 3hn - de_anli_transl_3hn_train_ds
735
+ de_anli_transl_3hn_ds = load_dataset('parquet', data_files={f'{main_name}-{language}_{subset}-{transl}_hn/3_hard_negatives/*.parquet'}, split="train").train_test_split(test_size=0.02, seed=12)
736
+ de_anli_transl_3hn_train_ds: Dataset = de_anli_transl_3hn_ds['train']
737
+ de_anli_transl_3hn_eval_ds: Dataset = de_anli_transl_3hn_ds['test']
738
+ # anli translation 0hn - de_anli_transl_0hn_train_ds
739
+ de_anli_transl_0hn_ds = load_dataset('parquet', data_files={f'{main_name}-{language}_{subset}-{transl}_hn/0_hard_negatives/*.parquet'}, split="train").train_test_split(test_size=0.02, seed=12)
740
+ de_anli_transl_0hn_train_ds: Dataset = de_anli_transl_0hn_ds['train']
741
+ de_anli_transl_0hn_eval_ds: Dataset = de_anli_transl_0hn_ds['test']
742
+ #
743
+ subset = 'fever'
744
+ # fever entailments 3hn - de_fever_entail_3hn_train_ds
745
+ de_fever_entail_3hn_ds = load_dataset('parquet', data_files={f'{main_name}-{language}_{subset}-{entail}_hn/3_hard_negatives/*.parquet'}, split="train").train_test_split(test_size=0.02, seed=12)
746
+ de_fever_entail_3hn_train_ds: Dataset = de_fever_entail_3hn_ds['train']
747
+ de_fever_entail_3hn_eval_ds: Dataset = de_fever_entail_3hn_ds['test']
748
+ # fever entailments 0hn - de_fever_entail_0hn_train_ds
749
+ de_fever_entail_0hn_ds = load_dataset('parquet', data_files={f'{main_name}-{language}_{subset}-{entail}_hn/0_hard_negatives/*.parquet'}, split="train").train_test_split(test_size=0.02, seed=12)
750
+ de_fever_entail_0hn_train_ds: Dataset = de_fever_entail_0hn_ds['train']
751
+ de_fever_entail_0hn_eval_ds: Dataset = de_fever_entail_0hn_ds['test']
752
+ # fever translation 3hn - de_fever_transl_3hn_train_ds
753
+ de_fever_transl_3hn_ds = load_dataset('parquet', data_files={f'{main_name}-{language}_{subset}-{transl}_hn/3_hard_negatives/*.parquet'}, split="train").train_test_split(test_size=0.02, seed=12)
754
+ de_fever_transl_3hn_train_ds: Dataset = de_fever_transl_3hn_ds['train']
755
+ de_fever_transl_3hn_eval_ds: Dataset = de_fever_transl_3hn_ds['test']
756
+ # fever translation 0hn - de_fever_transl_0hn_train_ds
757
+ de_fever_transl_0hn_ds = load_dataset('parquet', data_files={f'{main_name}-{language}_{subset}-{transl}_hn/0_hard_negatives/*.parquet'}, split="train").train_test_split(test_size=0.02, seed=12)
758
+ de_fever_transl_0hn_train_ds: Dataset = de_fever_transl_0hn_ds['train']
759
+ de_fever_transl_0hn_eval_ds: Dataset = de_fever_transl_0hn_ds['test']
760
+ #
761
+ subset = 'ling'
762
+ # ling entailments 3hn - de_ling_entail_3hn_train_ds
763
+ de_ling_entail_3hn_ds = load_dataset('parquet', data_files={f'{main_name}-{language}_{subset}-{entail}_hn/3_hard_negatives/*.parquet'}, split="train").train_test_split(test_size=0.02, seed=12)
764
+ de_ling_entail_3hn_train_ds: Dataset = de_ling_entail_3hn_ds['train']
765
+ de_ling_entail_3hn_eval_ds: Dataset = de_ling_entail_3hn_ds['test']
766
+ # ling entailments 0hn - de_ling_entail_0hn_train_ds
767
+ de_ling_entail_0hn_ds = load_dataset('parquet', data_files={f'{main_name}-{language}_{subset}-{entail}_hn/0_hard_negatives/*.parquet'}, split="train").train_test_split(test_size=0.02, seed=12)
768
+ de_ling_entail_0hn_train_ds: Dataset = de_ling_entail_0hn_ds['train']
769
+ de_ling_entail_0hn_eval_ds: Dataset = de_ling_entail_0hn_ds['test']
770
+ # ling translation 3hn - de_ling_transl_3hn_train_ds
771
+ de_ling_transl_3hn_ds = load_dataset('parquet', data_files={f'{main_name}-{language}_{subset}-{transl}_hn/3_hard_negatives/*.parquet'}, split="train").train_test_split(test_size=0.02, seed=12)
772
+ de_ling_transl_3hn_train_ds: Dataset = de_ling_transl_3hn_ds['train']
773
+ de_ling_transl_3hn_eval_ds: Dataset = de_ling_transl_3hn_ds['test']
774
+ # ling translation 0hn - de_ling_transl_0hn_train_ds
775
+ # this set is empty :D
776
+ #de_ling_transl_0hn_ds = load_dataset('parquet', data_files={f'{main_name}-{language}_{subset}-{transl}_hn/0_hard_negatives/*.parquet'}, split="train").train_test_split(test_size=0.02, seed=12)
777
+ #de_ling_transl_0hn_train_ds: Dataset = de_ling_transl_0hn_ds['train']
778
+ #de_ling_transl_0hn_eval_ds: Dataset = de_ling_transl_0hn_ds['test']
779
+ #
780
+ subset = 'mnli'
781
+ # mnli entailments 3hn - de_mnli_entail_3hn_train_ds
782
+ de_mnli_entail_3hn_ds = load_dataset('parquet', data_files={f'{main_name}-{language}_{subset}-{entail}_hn/3_hard_negatives/*.parquet'}, split="train").train_test_split(test_size=0.02, seed=12)
783
+ de_mnli_entail_3hn_train_ds: Dataset = de_mnli_entail_3hn_ds['train']
784
+ de_mnli_entail_3hn_eval_ds: Dataset = de_mnli_entail_3hn_ds['test']
785
+ # mnli entailments 0hn - de_mnli_entail_0hn_train_ds
786
+ de_mnli_entail_0hn_ds = load_dataset('parquet', data_files={f'{main_name}-{language}_{subset}-{entail}_hn/0_hard_negatives/*.parquet'}, split="train").train_test_split(test_size=0.02, seed=12)
787
+ de_mnli_entail_0hn_train_ds: Dataset = de_mnli_entail_0hn_ds['train']
788
+ de_mnli_entail_0hn_eval_ds: Dataset = de_mnli_entail_0hn_ds['test']
789
+ # mnli translation 3hn - de_mnli_transl_3hn_train_ds
790
+ de_mnli_transl_3hn_ds = load_dataset('parquet', data_files={f'{main_name}-{language}_{subset}-{transl}_hn/3_hard_negatives/*.parquet'}, split="train").train_test_split(test_size=0.02, seed=12)
791
+ de_mnli_transl_3hn_train_ds: Dataset = de_mnli_transl_3hn_ds['train']
792
+ de_mnli_transl_3hn_eval_ds: Dataset = de_mnli_transl_3hn_ds['test']
793
+ # mnli translation 0hn - de_mnli_transl_0hn_train_ds
794
+ de_mnli_transl_0hn_ds = load_dataset('parquet', data_files={f'{main_name}-{language}_{subset}-{transl}_hn/0_hard_negatives/*.parquet'}, split="train").train_test_split(test_size=0.02, seed=12)
795
+ de_mnli_transl_0hn_train_ds: Dataset = de_mnli_transl_0hn_ds['train']
796
+ de_mnli_transl_0hn_eval_ds: Dataset = de_mnli_transl_0hn_ds['test']
797
+ #
798
+ subset = 'wanli'
799
+ # wanli entailments 3hn - de_wanli_entail_3hn_train_ds
800
+ de_wanli_entail_3hn_ds = load_dataset('parquet', data_files={f'{main_name}-{language}_{subset}-{entail}_hn/3_hard_negatives/*.parquet'}, split="train").train_test_split(test_size=0.02, seed=12)
801
+ de_wanli_entail_3hn_train_ds: Dataset = de_wanli_entail_3hn_ds['train']
802
+ de_wanli_entail_3hn_eval_ds: Dataset = de_wanli_entail_3hn_ds['test']
803
+ # wanli entailments 0hn - de_wanli_entail_0hn_train_ds
804
+ de_wanli_entail_0hn_ds = load_dataset('parquet', data_files={f'{main_name}-{language}_{subset}-{entail}_hn/0_hard_negatives/*.parquet'}, split="train").train_test_split(test_size=0.02, seed=12)
805
+ de_wanli_entail_0hn_train_ds: Dataset = de_wanli_entail_0hn_ds['train']
806
+ de_wanli_entail_0hn_eval_ds: Dataset = de_wanli_entail_0hn_ds['test']
807
+ # wanli translation 3hn - de_wanli_transl_3hn_train_ds
808
+ de_wanli_transl_3hn_ds = load_dataset('parquet', data_files={f'{main_name}-{language}_{subset}-{transl}_hn/3_hard_negatives/*.parquet'}, split="train").train_test_split(test_size=0.02, seed=12)
809
+ de_wanli_transl_3hn_train_ds: Dataset = de_wanli_transl_3hn_ds['train']
810
+ de_wanli_transl_3hn_eval_ds: Dataset = de_wanli_transl_3hn_ds['test']
811
+ # wanli translation 0hn - de_wanli_transl_0hn_train_ds
812
+ de_wanli_transl_0hn_ds = load_dataset('parquet', data_files={f'{main_name}-{language}_{subset}-{transl}_hn/0_hard_negatives/*.parquet'}, split="train").train_test_split(test_size=0.02, seed=12)
813
+ de_wanli_transl_0hn_train_ds: Dataset = de_wanli_transl_0hn_ds['train']
814
+ de_wanli_transl_0hn_eval_ds: Dataset = de_wanli_transl_0hn_ds['test']
815
+ #
816
+ print("Loaded all NLI-26lang-2mil7 (local hn) datasets...")
817
+ #
818
+ # regular dataset unused
819
+ #print("Loading NLI-26lang-2mil7 (anli) dataset...")
820
+ # source: https://huggingface.co/datasets/MoritzLaurer/multilingual-NLI-26lang-2mil7
821
+ # License: MIT
822
+ # License-source: https://github.com/easonnie/combine-FEVER-NSMN
823
+ # entries: 25000
824
+ # info: 'label' – entailment (0), neutral (1), contradiction (2).
825
+ # for simple translations
826
+ #NLI_de_anli_dataset = load_dataset("MoritzLaurer/multilingual-NLI-26lang-2mil7", split="de_anli")
827
+ #NLI_de_anli_ende_dataset = NLI_de_anli_dataset.select_columns(['hypothesis_original', 'hypothesis']).rename_columns({'hypothesis_original': 'sentence1', 'hypothesis': 'sentence2'})
828
+ #NLI_de_anli_ende_dataset2 = NLI_de_anli_dataset.select_columns(['premise_original', 'premise']).rename_columns({'premise_original': 'sentence1', 'premise': 'sentence2'})
829
+ #NLI_de_anli_ende_dataset = concatenate_datasets([NLI_de_anli_ende_dataset, NLI_de_anli_ende_dataset2])
830
+ #del NLI_de_anli_ende_dataset2
831
+ #NLI_de_anli_ende_dataset = NLI_de_anli_ende_dataset.train_test_split(test_size=0.05, seed=12)
832
+ #NLI_de_anli_ende_train_dataset: Dataset = NLI_de_anli_ende_dataset["train"]
833
+ #NLI_de_anli_ende_eval_dataset: Dataset = NLI_de_anli_ende_dataset["test"]
834
+ #
835
+ # for simple entailments from "long" to "conclusion" (like classification)
836
+ #NLI_de_anli_de_entailment_dataset = NLI_de_anli_dataset.select_columns = NLI_de_anli_dataset.filter(lambda _: _["label"] == 0).select_columns(['premise', 'hypothesis']).rename_columns({'premise': 'sentence1', 'hypothesis': 'sentence2'})
837
+ #del NLI_de_anli_dataset
838
+ #NLI_de_anli_de_entailment_dataset = NLI_de_anli_de_entailment_dataset.train_test_split(test_size=0.05, seed=12)
839
+ #NLI_de_anli_entailment_train_dataset: Dataset = NLI_de_anli_de_entailment_dataset["train"]
840
+ #NLI_de_anli_entailment_eval_dataset: Dataset = NLI_de_anli_de_entailment_dataset["test"]
841
+ #print("Loaded NLI-26lang-2mil7 (anli) dataset.")
842
+ #
843
+ #print("Loading NLI-26lang-2mil7 (fever) dataset...")
844
+ # source: https://huggingface.co/datasets/MoritzLaurer/multilingual-NLI-26lang-2mil7
845
+ # License: MIT
846
+ # License-source: https://github.com/easonnie/combine-FEVER-NSMN
847
+ # entries: 25000
848
+ #NLI_de_fever_dataset = load_dataset("MoritzLaurer/multilingual-NLI-26lang-2mil7", split="de_fever")
849
+ #NLI_de_fever_dataset2 = NLI_de_fever_dataset
850
+ #NLI_de_fever_dataset3 = NLI_de_fever_dataset.filter(lambda _: _["label"] == 0).select_columns(['premise', 'hypothesis'])
851
+ #NLI_de_fever_dataset = NLI_de_fever_dataset.remove_columns(['label', 'hypothesis_original', 'hypothesis'])
852
+ #NLI_de_fever_dataset2 = NLI_de_fever_dataset2.remove_columns(['label', 'premise_original', 'premise'])
853
+ #NLI_de_fever_dataset2 = NLI_de_fever_dataset2.rename_column('hypothesis_original', 'sentence1')
854
+ #NLI_de_fever_dataset2 = NLI_de_fever_dataset2.rename_column('hypothesis', 'sentence2')
855
+ #NLI_de_fever_dataset3 = NLI_de_fever_dataset3.rename_column('hypothesis', 'sentence2')
856
+ #NLI_de_fever_dataset_dict = NLI_de_fever_dataset.train_test_split(test_size=0.05, seed=12)
857
+ #NLI_de_fever_dataset2_dict = NLI_de_fever_dataset2.train_test_split(test_size=0.05, seed=12)
858
+ #NLI_de_fever_dataset3_dict = NLI_de_fever_dataset3.train_test_split(test_size=0.05, seed=12)
859
+ #NLI_de_fever_train_dataset: Dataset = NLI_de_fever_dataset_dict["train"]
860
+ #NLI_de_fever_eval_dataset: Dataset = NLI_de_fever_dataset_dict["test"]
861
+ #NLI_de_fever_train2_dataset: Dataset = NLI_de_fever_dataset2_dict["train"]
862
+ #NLI_de_fever_eval2_dataset: Dataset = NLI_de_fever_dataset2_dict["test"]
863
+ #NLI_de_fever_train3_dataset: Dataset = NLI_de_fever_dataset3_dict["train"]
864
+ #NLI_de_fever_eval3_dataset: Dataset = NLI_de_fever_dataset3_dict["test"]
865
+ #print("Loaded NLI-26lang-2mil7 (fever) dataset.")
866
+ #
867
+ print("Loading Jina AI dataset...")
868
+ # source: https://huggingface.co/datasets/jinaai/parallel-sentences
869
+ # License: Apache-2.0
870
+ # entries: 1000
871
+ # info: sadly JinaAI delivers only 1000 pairs (we know we could do better by …)
872
+ # Info: Multilingual in different columns
873
+ jina_ai_ps_dataset = load_dataset("jinaai/parallel-sentences", split="train")
874
+ jina_ai_ps_dataset_3en = jina_ai_ps_dataset.select_columns(['anchor', 'entailment', 'negative'])
875
+ jina_ai_ps_dataset_en_de = jina_ai_ps_dataset.select_columns(['anchor', 'anchor_de'])
876
+ jina_ai_ps_dataset_de_de = jina_ai_ps_dataset.select_columns(['anchor_de', 'entailment_de'])
877
+ # splits
878
+ jina_ai_ps_dataset_3en_dict = jina_ai_ps_dataset_3en.train_test_split(test_size=0.05, seed=12)
879
+ jina_ai_ps_dataset_en_de_dict = jina_ai_ps_dataset_en_de.train_test_split(test_size=0.05, seed=12)
880
+ jina_ai_ps_dataset_de_de_dict = jina_ai_ps_dataset_de_de.train_test_split(test_size=0.05, seed=12)
881
+ jina_ai_ps_train_3en: Dataset = jina_ai_ps_dataset_3en_dict["train"]
882
+ jina_ai_ps_eval_3en: Dataset = jina_ai_ps_dataset_3en_dict["test"]
883
+ jina_ai_ps_train_en_de: Dataset = jina_ai_ps_dataset_en_de_dict["train"]
884
+ jina_ai_ps_eval_en_de: Dataset = jina_ai_ps_dataset_en_de_dict["test"]
885
+ jina_ai_ps_train_de_de: Dataset = jina_ai_ps_dataset_de_de_dict["train"]
886
+ jina_ai_ps_eval_de_de: Dataset = jina_ai_ps_dataset_de_de_dict["test"]
887
+ print("Loaded Jina AI dataset.")
888
+ #
889
+ print("Loading Polyglot-or-Not (de) dataset...")
890
+ # source: https://huggingface.co/datasets/Polyglot-or-Not/Fact-Completion/
891
+ # License: Apache-2.0
892
+ # entries: 16287
893
+ polyglot_de_dataset = load_dataset("Polyglot-or-Not/Fact-Completion", split="German").select_columns(['stem', 'true', 'false'])
894
+ polyglot_de_dict = polyglot_de_dataset.train_test_split(test_size=0.05, seed=12)
895
+ polyglot_de_train_dataset: Dataset = polyglot_de_dict["train"]
896
+ polyglot_de_eval_dataset: Dataset = polyglot_de_dict["test"]
897
+ print("Loaded Polyglot-or-Not (de) dataset.")
898
+ #
899
+ print("Loading Polyglot-or-Not (en) dataset...")
900
+ # source: https://huggingface.co/datasets/Polyglot-or-Not/Fact-Completion/
901
+ # License: Apache-2.0
902
+ # entries: 26254
903
+ polyglot_en_dataset = load_dataset("Polyglot-or-Not/Fact-Completion", split="English").select_columns(['stem', 'true', 'false'])
904
+ polyglot_en_dict = polyglot_en_dataset.train_test_split(test_size=0.05, seed=12)
905
+ polyglot_en_train_dataset: Dataset = polyglot_en_dict["train"]
906
+ polyglot_en_eval_dataset: Dataset = polyglot_en_dict["test"]
907
+ print("Loaded Polyglot-or-Not (de) dataset.")
908
+ #
909
+ print("Loading Tilde_MODEL_EESC (en_de) dataset...")
910
+ # Tilde MODEL - EESC is a multilingual corpus compiled from document texts of European Economic and Social Committee document portal. Source: http://dm.eesc.europa.eu/
911
+ # License: CC-BY - Creative Commons with Attribution
912
+ # Roberts Rozis, Raivis Skadins, 2017, Tilde MODEL - Multilingual Open Data for EU Languages. Proceedings of the 21th Nordic Conference of Computational Linguistics NODALIDA 2017.
913
+ # https://tilde-model.s3-eu-west-1.amazonaws.com/nodalida2017_Tilde_MODEL.pdf
914
+ # https://tilde-model.s3-eu-west-1.amazonaws.com/Tilde_MODEL_Corpus.html
915
+ #
916
+ # entries: 1860675
917
+ # filtered: 1683698
918
+ # Original (local) version without hard negatives ignored
919
+ #tilde_EESC_dataset = load_dataset("parquet", data_files={'Tilde_MODEL_EESC/EESC.de-en-distilled-scored.parquet.br'}, split='train').filter(lambda _: _['score_sts'] > 0.5 and _['score_sts'] < 1).select_columns(['en', 'de'])
920
+ #tilde_EESC_dataset = tilde_EESC_dataset.train_test_split(test_size=10000, seed=12)
921
+ #tilde_EESC_train_dataset: Dataset = tilde_EESC_dataset["train"]
922
+ #tilde_EESC_eval_dataset: Dataset = tilde_EESC_dataset["test"]
923
+ #del tilde_EESC_dataset
924
+ #
925
+ # loading version with 3 hard negative ignoring folder with 0 negatives
926
+ tilde_EESC_dataset = load_dataset("parquet", data_files={'Tilde_EESC-en-de_hn/3_hard_negatives/train-*.parquet'}, split='train')
927
+ tilde_EESC_dataset = tilde_EESC_dataset.train_test_split(test_size=10000, seed=12)
928
+ tilde_EESC_train_dataset: Dataset = tilde_EESC_dataset["train"]
929
+ tilde_EESC_eval_dataset: Dataset = tilde_EESC_dataset["test"]
930
+ del tilde_EESC_dataset
931
+ #
932
+ print("Loaded Tilde_MODEL_EESC (en_de) dataset.")
933
+ #
934
+ print("Loading Tilde_MODEL_RAPID (en_de) dataset...")
935
+ # Tilde MODEL - RAPID multilingual parallel corpus is compiled from all press releases of Press Release Database of European Commission released between 1975 and end of 2016 as available from http://europa.eu/rapid/.
936
+ # License: CC-BY - Creative Commons with Attribution
937
+ # Roberts Rozis, Raivis Skadins, 2017, Tilde MODEL - Multilingual Open Data for EU Languages. Proceedings of the 21th Nordic Conference of Computational Linguistics NODALIDA 2017.
938
+ # https://tilde-model.s3-eu-west-1.amazonaws.com/nodalida2017_Tilde_MODEL.pdf
939
+ # https://tilde-model.s3-eu-west-1.amazonaws.com/Tilde_MODEL_Corpus.html
940
+ #
941
+ # entries: 779236
942
+ # filtered: 727743
943
+ # original scored set needs to be uploaded
944
+ # Original (local) version without hard negatives ignored
945
+ #tilde_RAPID_dataset = load_dataset("parquet", data_files={'Tilde_MODEL_RAPID/RAPID_2019.UNIQUE.de-en-distilled-scored.parquet'}, split='train').filter(lambda _: _['score_sts'] > 0.5 and _['score_sts'] < 1).select_columns(['en', 'de'])
946
+ #tilde_RAPID_dataset = tilde_RAPID_dataset.train_test_split(test_size=10000, seed=12)
947
+ #tilde_RAPID_train_dataset: Dataset = tilde_RAPID_dataset["train"]
948
+ #tilde_RAPID_eval_dataset: Dataset = tilde_RAPID_dataset["test"]
949
+ #del tilde_RAPID_dataset
950
+ #
951
+ # loading version with 3 hard negative ignoring folder with 0 negatives
952
+ tilde_RAPID_dataset = load_dataset("parquet", data_files={'Tilde_RAPID_2019-en-de-hn/3_hard_negatives/train-*.parquet'}, split='train')
953
+ tilde_RAPID_dataset = tilde_RAPID_dataset.train_test_split(test_size=10000, seed=12)
954
+ tilde_RAPID_train_dataset: Dataset = tilde_RAPID_dataset["train"]
955
+ tilde_RAPID_eval_dataset: Dataset = tilde_RAPID_dataset["test"]
956
+ del tilde_RAPID_dataset
957
+ print("Loaded Tilde_MODEL_RAPID (en_de) dataset.")
958
+ #
959
+ print("Loading miracl (de) as classification dataset...")
960
+ miracl_de_dataset = load_dataset('parquet', data_files={'miracl-corpus-de-hn-*/3_hard_negatives/train-*.parquet'}, split='train')
961
+ miracl_de_dataset = miracl_de_dataset.train_test_split(test_size=10000, seed=12)
962
+ miracl_de_train_dataset: Dataset = miracl_de_dataset["train"]
963
+ miracl_de_eval_dataset: Dataset = miracl_de_dataset["test"]
964
+ #
965
+ miracl_de_0hn_dataset = load_dataset('parquet', data_files={'miracl-corpus-de-hn_hn/0_hard_negatives/train-*.parquet'}, split='train')
966
+ miracl_de_0hn_dataset = miracl_de_0hn_dataset.train_test_split(test_size=0.02, seed=12)
967
+ miracl_de_0hn_train_dataset: Dataset = miracl_de_0hn_dataset['train']
968
+ miracl_de_0hn_eval_dataset: Dataset = miracl_de_0hn_dataset['test']
969
+ print("Loaded miracl (de) as classification dataset.")
970
+ #
971
+ train_dataset = DatasetDict({
972
+ 'mmarco_3hn': mmarco_de_3hn_train_dataset,
973
+ 'mmarco_2hn': mmarco_de_2hn_train_dataset,
974
+ 'mmarco_1hn': mmarco_de_1hn_train_dataset,
975
+ 'mmarco_0hn': mmarco_de_0hn_train_dataset,
976
+ 'wp-22-12-de': wp_2212_de_train_dataset,
977
+ #'wp-22-12-de_3hn': wp_2212_de_train_dataset,
978
+ #'wp-22-12-de_0hn': wp_2212_de_0_train_dataset,
979
+ 'swim_ir_de': swim_ir_de_train_dataset,
980
+ 'swim_ir_de_3hn': swim_ir_de_3hn_train_dataset,
981
+ 'swim_ir_de_title_3hn': swim_ir_de_title_3hn_train_dataset,
982
+ 'swim_ir_de_title': swim_ir_de_title_train_dataset,
983
+ 'avemio_triples': avemio_triples_train_dataset,
984
+ 'avemio_pairs_3hn': avemio_pairs_3hn_train_ds,
985
+ 'avemio_pairs_0hn': avemio_pairs_0hn_train_ds,
986
+ 'nq_german_en_de_a_3hn': nq_german_en_de_a_3hn_train_ds,
987
+ 'nq_german_en_de_3hn': nq_german_en_de_3hn_train_ds,
988
+ 'nq_german_3hn': nq_german_3hn_train_ds,
989
+ 'nq_german_1hn': nq_german_1hn_train_ds,
990
+ #'german_oasst1': german_oasst1_train_dataset,
991
+ 'german_oasst1_hn': german_oasst1_hn_train_dataset,
992
+ 'germanrag_short': germanrag_short_train_dataset,
993
+ 'slimorca_dedup_3hn': slimorca_dedup_3hn_train_ds,
994
+ 'slimorca_dedup_2hn': slimorca_dedup_2hn_train_ds,
995
+ 'slimorca_dedup_1hn': slimorca_dedup_1hn_train_ds,
996
+ 'slimorca_dedup_0hn': slimorca_dedup_0hn_train_ds,
997
+ #'german_gpt4': german_gpt4_train_dataset,
998
+ 'german_gpt4_3hn': german_gpt4_3hn_train_dataset,
999
+ 'german_orca_dpo': german_orca_dpo_train_dataset,
1000
+ 'alpaca_gpt4_3hn': alpaca_gpt4_de_3hn_train_dataset,
1001
+ 'alpaca_gpt4_0hn': alpaca_gpt4_de_0hn_train_dataset,
1002
+ 'dolly_context_de_3hn': dolly_context_de_3hn_train_ds,
1003
+ #'dolly_context_de_0hn': dolly_context_de_0hn_train_ds,
1004
+ 'dolly_context_ende_3hn': dolly_context_ende_3hn_train_ds,
1005
+ 'dolly_instructions_de_3hn': dolly_instructions_de_3hn_train_ds,
1006
+ 'dolly_instructions_de_0hn': dolly_instructions_de_0hn_train_ds,
1007
+ 'dolly_instructions_ende_3hn': dolly_instructions_ende_3hn_train_ds,
1008
+ #'dolly_instructions_ende_0hn': dolly_instructions_ende_0hn_train_ds,
1009
+ 'dolly_responses_de_3hn': dolly_responses_de_3hn_train_ds,
1010
+ 'dolly_responses_de_0hn': dolly_responses_de_0hn_train_ds,
1011
+ 'dolly_responses_ende_3hn': dolly_responses_ende_3hn_train_ds,
1012
+ #'dolly_responses_ende_0hn': dolly_responses_ende_0hn_train_ds,
1013
+ 'saf_legal_de': saf_legal_de_train_ds,
1014
+ 'gls_3hn': gls_3hn_train_dataset,
1015
+ 'gls_2hn': gls_2hn_train_dataset,
1016
+ 'gls_1hn': gls_1hn_train_dataset,
1017
+ 'gls_0hn': gls_0hn_train_dataset,
1018
+ 'europarl_3hn': europarl_3hn_train_dataset,
1019
+ 'europarl_0hn': europarl_0hn_train_dataset,
1020
+ #'tatoeba': tatoeba_train_dataset,
1021
+ 'tatoeba_3hn': tatoeba_3hn_train_dataset,
1022
+ 'tatoeba_0hn': tatoeba_0hn_train_dataset,
1023
+ 'wikimatrix_3hn': wikimatrix_3hn_train_ds,
1024
+ #'wikimatrix_0hn': wikimatrix_0hn_train_ds,
1025
+ 'wikipedia_abstract_3hn': wikipedia_abstract_3hn_train_dataset,
1026
+ 'wikipedia_abstract_0hn': wikipedia_abstract_0hn_train_dataset,
1027
+ 'wiktionary_gdg_de_3hn': wiktionary_gdg_de_3hn_train_ds,
1028
+ 'wiktionary_gdg_de_short': wiktionary_gdg_de_short_train_dataset,
1029
+ 'wmt24pp': wmt24pp_train_dataset,
1030
+ 'synthia_de': synthia_de_train_dataset,
1031
+ 'gbp_3hn': gbp_3hn_train_ds,
1032
+ #'gbp_0hn': gbp_0hn_train_ds,
1033
+ 'gbp_ende_3hn': gbp_ende_3hn_train_ds,
1034
+ #'gbp_ende_0hn': gbp_ende_0hn_train_ds,
1035
+ #'stbs_de': stbs_de_train_dataset,
1036
+ 'stbs_de_3hn': stbs_de_3hn_train_dataset,
1037
+ #'stbs_en': stbs_en_train_dataset,
1038
+ 'stbs_en_3hn': stbs_en_3hn_train_dataset,
1039
+ 'pawsx_de': pawsx_de_train_dataset,
1040
+ 'pawsx_en': pawsx_en_train_dataset,
1041
+ 'nli_anli_entail_3hn': de_anli_entail_3hn_train_ds,
1042
+ 'nli_fever_entail_3hn': de_fever_entail_3hn_train_ds,
1043
+ 'nli_ling_entail_3hn': de_ling_entail_3hn_train_ds,
1044
+ 'nli_mnli_entail_3hn': de_mnli_entail_3hn_train_ds,
1045
+ 'nli_wanli_entail_3hn': de_wanli_entail_3hn_train_ds,
1046
+ #'nli_anli_entail_0hn': de_anli_entail_0hn_train_ds,
1047
+ #'nli_fever_entail_0hn': de_fever_entail_0hn_train_ds,
1048
+ #'nli_ling_entail_0hn': de_ling_entail_0hn_train_ds,
1049
+ #'nli_mnli_entail_0hn': de_mnli_entail_0hn_train_ds,
1050
+ #'nli_wanli_entail_0hn': de_wanli_entail_0hn_train_ds,
1051
+ 'nli_anli_transl_3hn': de_anli_transl_3hn_train_ds,
1052
+ 'nli_fever_transl_3hn': de_fever_transl_3hn_train_ds,
1053
+ 'nli_ling_transl_3hn': de_ling_transl_3hn_train_ds,
1054
+ 'nli_mnli_transl_3hn': de_mnli_transl_3hn_train_ds,
1055
+ 'nli_wanli_transl_3hn': de_wanli_transl_3hn_train_ds,
1056
+ #'nli_anli_transl_0hn': de_anli_transl_0hn_train_ds,
1057
+ #'nli_fever_transl_0hn': de_fever_transl_0hn_train_ds,
1058
+ #'nli_ling_transl_0hn': de_ling_transl_0hn_train_ds,
1059
+ #'nli_mnli_transl_0hn': de_mnli_transl_0hn_train_ds,
1060
+ #'nli_wanli_transl_0hn': de_wanli_transl_0hn_train_ds,
1061
+ 'jina_ai_3en': jina_ai_ps_train_3en,
1062
+ 'jina_ai_ende': jina_ai_ps_train_en_de,
1063
+ 'jina_ai_dede': jina_ai_ps_train_de_de,
1064
+ 'polyglot_de': polyglot_de_train_dataset,
1065
+ 'polyglot_en': polyglot_en_train_dataset,
1066
+ 'tilde_EESC': tilde_EESC_train_dataset,
1067
+ #'tilde_RAPID': tilde_RAPID_train_dataset,
1068
+ 'miracl_de_3hn': miracl_de_train_dataset,
1069
+ 'miracl_de_0hn': miracl_de_0hn_train_dataset,
1070
+ })
1071
+ eval_dataset = DatasetDict({
1072
+ 'mmarco_3hn': mmarco_de_3hn_eval_dataset,
1073
+ 'mmarco_2hn': mmarco_de_2hn_eval_dataset,
1074
+ 'mmarco_1hn': mmarco_de_1hn_eval_dataset,
1075
+ 'mmarco_0hn': mmarco_de_0hn_eval_dataset,
1076
+ 'wp-22-12-de': wp_2212_de_eval_dataset,
1077
+ #'wp-22-12-de_3hn': wp_2212_de_eval_dataset,
1078
+ #'wp-22-12-de_0hn': wp_2212_de_0_eval_dataset,
1079
+ 'swim_ir_de': swim_ir_de_eval_dataset,
1080
+ 'swim_ir_de_3hn': swim_ir_de_3hn_eval_dataset,
1081
+ 'swim_ir_de_title_3hn': swim_ir_de_title_3hn_eval_dataset,
1082
+ 'swim_ir_de_title': swim_ir_de_title_eval_dataset,
1083
+ 'avemio_triples': avemio_triples_eval_dataset,
1084
+ 'avemio_pairs_3hn': avemio_pairs_3hn_eval_ds,
1085
+ 'avemio_pairs_0hn': avemio_pairs_0hn_eval_ds,
1086
+ 'nq_german_en_de_a_3hn': nq_german_en_de_a_3hn_eval_ds,
1087
+ 'nq_german_en_de_3hn': nq_german_en_de_3hn_eval_ds,
1088
+ 'nq_german_3hn': nq_german_3hn_eval_ds,
1089
+ 'nq_german_1hn': nq_german_1hn_eval_ds,
1090
+ #'german_oasst1': german_oasst1_eval_dataset,
1091
+ 'german_oasst1_hn': german_oasst1_hn_eval_dataset,
1092
+ 'germanrag_short': germanrag_short_eval_dataset,
1093
+ 'slimorca_dedup_3hn': slimorca_dedup_3hn_eval_ds,
1094
+ 'slimorca_dedup_2hn': slimorca_dedup_2hn_eval_ds,
1095
+ 'slimorca_dedup_1hn': slimorca_dedup_1hn_eval_ds,
1096
+ 'slimorca_dedup_0hn': slimorca_dedup_0hn_eval_ds,
1097
+ #'german_gpt4': german_gpt4_eval_dataset,
1098
+ 'german_gpt4_3hn': german_gpt4_3hn_eval_dataset,
1099
+ 'german_orca_dpo': german_orca_dpo_eval_dataset,
1100
+ 'alpaca_gpt4_3hn': alpaca_gpt4_de_3hn_eval_dataset,
1101
+ 'alpaca_gpt4_0hn': alpaca_gpt4_de_0hn_eval_dataset,
1102
+ 'dolly_context_de_3hn': dolly_context_de_3hn_eval_ds,
1103
+ #'dolly_context_de_0hn': dolly_context_de_0hn_eval_ds,
1104
+ 'dolly_context_ende_3hn': dolly_context_ende_3hn_eval_ds,
1105
+ 'dolly_instructions_de_3hn': dolly_instructions_de_3hn_eval_ds,
1106
+ 'dolly_instructions_de_0hn': dolly_instructions_de_0hn_eval_ds,
1107
+ 'dolly_instructions_ende_3hn': dolly_instructions_ende_3hn_eval_ds,
1108
+ #'dolly_instructions_ende_0hn': dolly_instructions_ende_0hn_eval_ds,
1109
+ 'dolly_responses_de_3hn': dolly_responses_de_3hn_eval_ds,
1110
+ 'dolly_responses_de_0hn': dolly_responses_de_0hn_eval_ds,
1111
+ 'dolly_responses_ende_3hn': dolly_responses_ende_3hn_eval_ds,
1112
+ #'dolly_responses_ende_0hn': dolly_responses_ende_0hn_eval_ds,
1113
+ 'saf_legal_de': saf_legal_de_eval_ds,
1114
+ 'gls_3hn': gls_3hn_eval_dataset,
1115
+ 'gls_2hn': gls_2hn_eval_dataset,
1116
+ 'gls_1hn': gls_1hn_eval_dataset,
1117
+ 'gls_0hn': gls_0hn_eval_dataset,
1118
+ 'europarl_3hn': europarl_3hn_eval_dataset,
1119
+ 'europarl_0hn': europarl_0hn_eval_dataset,
1120
+ #'tatoeba': tatoeba_eval_dataset,
1121
+ 'tatoeba_3hn': tatoeba_3hn_eval_dataset,
1122
+ 'tatoeba_0hn': tatoeba_0hn_eval_dataset,
1123
+ 'wikimatrix_3hn': wikimatrix_3hn_eval_ds,
1124
+ #'wikimatrix_0hn': wikimatrix_0hn_eval_ds,
1125
+ 'wikipedia_abstract_3hn': wikipedia_abstract_3hn_eval_dataset,
1126
+ 'wikipedia_abstract_0hn': wikipedia_abstract_0hn_eval_dataset,
1127
+ 'wiktionary_gdg_de_3hn': wiktionary_gdg_de_3hn_eval_ds,
1128
+ 'wiktionary_gdg_de_short': wiktionary_gdg_de_short_eval_dataset,
1129
+ 'wmt24pp': wmt24pp_eval_dataset,
1130
+ 'synthia_de': synthia_de_eval_dataset,
1131
+ 'gbp_3hn': gbp_3hn_eval_ds,
1132
+ #'gbp_0hn': gbp_0hn_eval_ds,
1133
+ 'gbp_ende_3hn': gbp_ende_3hn_eval_ds,
1134
+ #'gbp_ende_0hn': gbp_ende_0hn_eval_ds,
1135
+ #'stbs_de': stbs_de_eval_dataset,
1136
+ 'stbs_de_3hn': stbs_de_3hn_eval_dataset,
1137
+ #'stbs_en': stbs_en_eval_dataset,
1138
+ 'stbs_en_3hn': stbs_en_3hn_eval_dataset,
1139
+ 'pawsx_de': pawsx_de_eval_dataset,
1140
+ 'pawsx_en': pawsx_en_eval_dataset,
1141
+ 'nli_anli_entail_3hn': de_anli_entail_3hn_eval_ds,
1142
+ 'nli_fever_entail_3hn': de_fever_entail_3hn_eval_ds,
1143
+ 'nli_ling_entail_3hn': de_ling_entail_3hn_eval_ds,
1144
+ 'nli_mnli_entail_3hn': de_mnli_entail_3hn_eval_ds,
1145
+ 'nli_wanli_entail_3hn': de_wanli_entail_3hn_eval_ds,
1146
+ #'nli_anli_entail_0hn': de_anli_entail_0hn_eval_ds,
1147
+ #'nli_fever_entail_0hn': de_fever_entail_0hn_eval_ds,
1148
+ #'nli_ling_entail_0hn': de_ling_entail_0hn_eval_ds,
1149
+ #'nli_mnli_entail_0hn': de_mnli_entail_0hn_eval_ds,
1150
+ #'nli_wanli_entail_0hn': de_wanli_entail_0hn_eval_ds,
1151
+ 'nli_anli_transl_3hn': de_anli_transl_3hn_eval_ds,
1152
+ 'nli_fever_transl_3hn': de_fever_transl_3hn_eval_ds,
1153
+ 'nli_ling_transl_3hn': de_ling_transl_3hn_eval_ds,
1154
+ 'nli_mnli_transl_3hn': de_mnli_transl_3hn_eval_ds,
1155
+ 'nli_wanli_transl_3hn': de_wanli_transl_3hn_eval_ds,
1156
+ #'nli_anli_transl_0hn': de_anli_transl_0hn_eval_ds,
1157
+ #'nli_fever_transl_0hn': de_fever_transl_0hn_eval_ds,
1158
+ #'nli_ling_transl_0hn': de_ling_transl_0hn_eval_ds,
1159
+ #'nli_mnli_transl_0hn': de_mnli_transl_0hn_eval_ds,
1160
+ #'nli_wanli_transl_0hn': de_wanli_transl_0hn_eval_ds,
1161
+ 'jina_ai_3en': jina_ai_ps_eval_3en,
1162
+ 'jina_ai_ende': jina_ai_ps_eval_en_de,
1163
+ 'jina_ai_dede': jina_ai_ps_eval_de_de,
1164
+ 'polyglot_de': polyglot_de_eval_dataset,
1165
+ 'polyglot_en': polyglot_en_eval_dataset,
1166
+ 'tilde_EESC': tilde_EESC_eval_dataset,
1167
+ #'tilde_RAPID': tilde_RAPID_eval_dataset,
1168
+ 'miracl_de_3hn': miracl_de_eval_dataset,
1169
+ 'miracl_de_0hn': miracl_de_0hn_eval_dataset,
1170
+ })
1171
+ #
1172
+ train_dataset.save_to_disk("base_datasets/train_dataset")
1173
+ eval_dataset.save_to_disk("base_datasets/eval_dataset")
1174
+ #
1175
+ end_time = timer()
1176
+ print('Time for preprocessing (minutes): '+str(round((end_time - start_time)/60, 3))) # the cheapest full timer one can get.
1177
+ # The `train_test_split` calls have put a lot of the datasets in memory, while we want it to just be on disk
1178
+ # So we're calling quit() here. Running the script again will load the datasets from disk.
1179
+ quit()
1180
+
1181
+ def main():
1182
+ # 1. Load a model to finetune with 2. (Optional) model card data
1183
+ static_embedding = StaticEmbedding(AutoTokenizer.from_pretrained(f"{tokenizer_model}"), embedding_dim=2048)
1184
+ model = SentenceTransformer(
1185
+ modules=[static_embedding],
1186
+ model_card_data=SentenceTransformerModelCardData(
1187
+ language="de, en",
1188
+ license="eupl-1.2",
1189
+ model_name=f"A static embedding model tokenized with {tokenizer_model} and mainly built on DE/EN-datasets.",
1190
+ ),
1191
+ )
1192
+ #
1193
+ # 3. Set up training & evaluation datasets - each dataset is trained with MNRL (with MRL)
1194
+ train_dataset, eval_dataset = load_train_eval_datasets()
1195
+ print(train_dataset)
1196
+ #
1197
+ # 4. Define a loss function
1198
+ # sadly at the moment neither CachedMultipleNegativesRankingLoss or GISTEmbedLoss work with StaticEmbedding.
1199
+ loss = MultipleNegativesRankingLoss(model)
1200
+ loss = MatryoshkaLoss(model, loss, matryoshka_dims=[32, 64, 128, 256, 512, 1024, 2048])
1201
+ #
1202
+ # 5. (Optional) Specify training arguments
1203
+ # check for GPU support (using already loaded tensorflow)
1204
+ if len(tf.config.list_physical_devices('GPU')) > 0:
1205
+ fp16=True
1206
+ bf16=False
1207
+ else:
1208
+ fp16=False
1209
+ bf16=True
1210
+ ## manual override
1211
+ #fp16=False
1212
+ #bf16=False
1213
+ run_name = f"{sts_basename}-v{version}"
1214
+ args = SentenceTransformerTrainingArguments(
1215
+ # Required parameter:
1216
+ output_dir=f"models/{run_name}",
1217
+ # Optional training parameters:
1218
+ num_train_epochs=1, # original 1 - if 2 epochs deliver worse results, it's already overfitting.
1219
+ per_device_train_batch_size=1024 * 4, # original 2048 - suggestions are 16384 (but beware of the GPU-RAM(!))
1220
+ per_device_eval_batch_size=1024 * 4, # original 2048
1221
+ learning_rate=2e-1,
1222
+ lr_scheduler_type="cosine", # instead of 'linear'
1223
+ warmup_ratio=0.1,
1224
+ fp16=fp16, # Set to False if you get an error that your GPU can't run on FP16
1225
+ bf16=bf16, # Set to True if you have a GPU that supports BF16
1226
+ batch_sampler=BatchSamplers.NO_DUPLICATES, # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
1227
+ multi_dataset_batch_sampler=MultiDatasetBatchSamplers.PROPORTIONAL,
1228
+ # Optional tracking/debugging parameters:
1229
+ eval_strategy="steps",
1230
+ eval_steps=500,
1231
+ save_strategy="steps",
1232
+ save_steps=1000,
1233
+ save_total_limit=2,
1234
+ logging_steps=500,
1235
+ logging_first_step=True,
1236
+ run_name=run_name, # Will be used in W&B if `wandb` is installed
1237
+ )
1238
+ #
1239
+ # 6. Create a trainer & train
1240
+ trainer = SentenceTransformerTrainer(
1241
+ model=model,
1242
+ args=args,
1243
+ train_dataset=train_dataset,
1244
+ eval_dataset=eval_dataset,
1245
+ loss=loss,
1246
+ )
1247
+ trainer.train()
1248
+ #
1249
+ # 7. Save the trained model
1250
+ model.save_pretrained(f"models/{run_name}/final")
1251
+ #
1252
+ # 8. (Optional) Push it to the Hugging Face Hub
1253
+ #model.push_to_hub(run_name, private=True)
1254
+ #
1255
+ # 9. Quick testing the model with NanoBEIR
1256
+ ## found at: https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#nanobeirevaluator
1257
+ evaluator = NanoBEIREvaluator(show_progress_bar=True)
1258
+ results = evaluator(model)
1259
+ print('\n' + str(results[evaluator.primary_metric]))
1260
+
1261
+ # STARTER
1262
+ if __name__ == "__main__":
1263
+ start_time = timer()
1264
+ main()
1265
+ end_time = timer()
1266
+ print('Time for training (minutes): '+str(round((end_time - start_time)/60, 3))) # the cheapest full timer one can get.
1267
+