--- tags: - sentence-transformers - sentence-similarity - dataset_size:120000 - multilingual base_model: Alibaba-NLP/gte-multilingual-base widget: - source_sentence: Who is filming along? sentences: - Wién filmt mat? - >- Weider huet den Tatarescu drop higewisen, datt Rumänien durch seng krichsbedélegong op de 6eite vun den allie'erten 110.000 mann verluer hätt. - Brambilla 130.08.03 St. - source_sentence: 'Four potential scenarios could still play out: Jean Asselborn.' sentences: - >- Dann ass nach eng Antenne hei um Kierchbierg virgesi Richtung RTL Gebai, do gëtt jo een ganz neie Wunnquartier gebaut. - >- D'bedélegong un de wählen wir ganz stärk gewiéscht a munche ge'genden wor re eso'gucr me' we' 90 prozent. - Jean Asselborn gesäit 4 Méiglechkeeten, wéi et kéint virugoen. - source_sentence: >- Non-profit organisation Passerell, which provides legal council to refugees in Luxembourg, announced that it has to make four employees redundant in August due to a lack of funding. sentences: - Oetringen nach Remich....8.20» 215» - >- D'ASBL Passerell, déi sech ëm d'Berodung vu Refugiéeën a Saache Rechtsfroe këmmert, wäert am August mussen hir véier fix Salariéen entloossen. - D'Regierung huet allerdéngs "just" 180.041 Doudeger verzeechent. - source_sentence: This regulation was temporarily lifted during the Covid pandemic. sentences: - Six Jours vu New-York si fir d’équipe Girgetti — Debacco - Dës Reegelung gouf wärend der Covid-Pandemie ausgesat. - ING-Marathon ouni gréisser Tëschefäll ofgelaf - 18 Leit hospitaliséiert. - source_sentence: The cross-border workers should also receive more wages. sentences: - D'grenzarbechetr missten och me' lo'n kre'en. - >- De Néckel: Firun! Dât ass jo ailes, wèll 't get dach neischt un der Bréck gemâcht! - >- D'Grande-Duchesse Josephine Charlotte an hir Ministeren hunn d'Land verlooss, et war den Optakt vun der Zäit am Exil. pipeline_tag: sentence-similarity library_name: sentence-transformers model-index: - name: >- SentenceTransformer based on Alibaba-NLP/gte-multilingual-base results: - task: type: contemporary-lb name: Contemporary-lb dataset: name: Contemporary-lb type: contemporary-lb metrics: - type: accuracy value: 0.6216 name: SIB-200(LB) accuracy - type: accuracy value: 0.6282 name: ParaLUX accuracy - task: type: bitext-mining name: LBHistoricalBitextMining dataset: name: LBHistoricalBitextMining type: lb-en metrics: - type: accuracy value: 0.9683 name: LB<->FR accuracy - type: accuracy value: 0.9715 name: LB<->EN accuracy - type: mean_accuracy value: 0.9793 name: LB<->DE accuracy license: agpl-3.0 datasets: - impresso-project/HistLuxAlign - fredxlpy/LuxAlign language: - lb --- # Luxembourgish adaptation of Alibaba-NLP/gte-multilingual-base This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) further adapted to support Historical and Contemporary Luxembourgish. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for (cross-lingual) semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. ## Model Details This model is specialised to perform cross-lingual semantic search to and from Historical/Contemporary Luxembourgish. This model would be particularly useful for libraries and archives that want to perform semantic search and longitudinal studies within their collections. This is an [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) model that was further adapted by (Michail et al., 2025) ## Limitations We also release a model that performs better (18pp) on ParaLUX. If finding monolingual exact matches within adversarial collections is of at-most importance, please use [histlux-paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/impresso-project/histlux-paraphrase-multilingual-mpnet-base-v2) ### Model Description - **Model Type:** GTE-Multilingual-Base - **Base model:** [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) - **Maximum Sequence Length:** 8192 tokens - **Output Dimensionality:** 768 dimensions - **Similarity Function:** Cosine Similarity - **Training Dataset:** - LB-EN (Historical, Modern) ## Usage (Sentence-Transformers) Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed: ``` pip install -U sentence-transformers ``` Then you can use the model like this: ```python from sentence_transformers import SentenceTransformer sentences = ["This is an example sentence", "Each sentence is converted"] model = SentenceTransformer('impresso-project/histlux-gte-multilingual-base', trust_remote_code=True) embeddings = model.encode(sentences) print(embeddings) ``` ## Evaluation Results ### Metrics (see introducing paper) Historical Bitext Mining (Accuracy): LB -> FR: 96.8 FR -> LB: 96.9 LB -> EN: 97.2 EN -> LB: 97.2 LB -> DE: 98.0 DE -> LB: 91.8 Contemporary LB (Accuracy): ParaLUX: 62.82 SIB-200(LB): 62.16 ## Training Details ### Training Dataset The parallel sentences data mix is the following: impresso-project/HistLuxAlign: - LB-FR (x20,000) - LB-EN (x20,000) - LB-DE (x20,000) fredxlpy/LuxAlign: - LB-FR (x40,000) - LB-EN (x20,000) Total: 120 000 Sentence pairs in mixed batches of size 8 ### Contrastive Training The model was trained with the parameters: ``` **Loss**: `sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss` with parameters: ``` {'scale': 20.0, 'similarity_fct': 'cos_sim'} ``` Parameters of the fit()-Method: ``` { "epochs": 1, "evaluation_steps": 520, "max_grad_norm": 1, "optimizer_class": "", "optimizer_params": { "lr": 2e-05 }, "scheduler": "WarmupLinear", } ``` ``` ## Citation ### BibTeX #### Adapting Multilingual Embedding Models to Historical Luxembourgish (introducing paper) ```bibtex @misc{michail2025adaptingmultilingualembeddingmodels, title={Adapting Multilingual Embedding Models to Historical Luxembourgish}, author={Andrianos Michail and Corina Julia Raclé and Juri Opitz and Simon Clematide}, year={2025}, eprint={2502.07938}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2502.07938}, } ``` #### Original Multilingual GTE Model ```bibtex @inproceedings{zhang2024mgte, title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval}, author={Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Wen and Dai, Ziqi and Tang, Jialong and Lin, Huan and Yang, Baosong and Xie, Pengjun and Huang, Fei and others}, booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track}, pages={1393--1412}, year={2024} } ```