opus-mt-tc-bible-big-deu_eng_fra_por_spa-afa

Table of Contents

Model Details

Neural machine translation model for translating from unknown (deu+eng+fra+por+spa) to Afro-Asiatic languages (afa).

This model is part of the OPUS-MT project, an effort to make neural machine translation models widely available and accessible for many languages in the world. All models are originally trained using the amazing framework of Marian NMT, an efficient NMT implementation written in pure C++. The models have been converted to pyTorch using the transformers library by huggingface. Training data is taken from OPUS and training pipelines use the procedures of OPUS-MT-train. Model Description:

  • Developed by: Language Technology Research Group at the University of Helsinki
  • Model Type: Translation (transformer-big)
  • Release: 2024-05-29
  • License: Apache-2.0
  • Language(s):
    • Source Language(s): deu eng fra por spa
    • Target Language(s): aar acm afb amh apc ara arc arq arz bcw byn cop daa dsh gde gnd hau hbo heb hig irk jpa kab ker kqp ktb kxc lln lme meq mfh mfi mfk mif mlt mpg mqb muy oar orm pbi phn rif sgw shi shy som sur syc syr taq thv tig tir tmc tmh tmr ttr tzm wal xed zgh
    • Valid Target Language Labels: >>aal<< >>aar<< >>aas<< >>acm<< >>afb<< >>agj<< >>ahg<< >>aij<< >>aiw<< >>ajw<< >>akk<< >>alw<< >>amh<< >>amw<< >>anc<< >>ank<< >>apc<< >>ara<< >>arc<< >>arq<< >>arv<< >>arz<< >>auj<< >>auo<< >>awn<< >>bbt<< >>bcq<< >>bcw<< >>bcy<< >>bde<< >>bdm<< >>bdn<< >>bds<< >>bej<< >>bhm<< >>bhn<< >>bhs<< >>bid<< >>bjf<< >>bji<< >>bnl<< >>bob<< >>bol<< >>bsw<< >>bta<< >>btf<< >>bux<< >>bva<< >>bvf<< >>bvh<< >>bvw<< >>bwo<< >>bwr<< >>bxe<< >>bxq<< >>byn<< >>cie<< >>ckl<< >>ckq<< >>cky<< >>cla<< >>cnu<< >>cop<< >>cop_Copt<< >>cuv<< >>daa<< >>dal<< >>dbb<< >>dbp<< >>dbq<< >>dbr<< >>dgh<< >>dim<< >>dkx<< >>dlk<< >>dme<< >>dot<< >>dox<< >>doz<< >>drs<< >>dsh<< >>dwa<< >>egy<< >>elo<< >>fie<< >>fkk<< >>fli<< >>gab<< >>gde<< >>gdf<< >>gdk<< >>gdl<< >>gdq<< >>gdu<< >>gea<< >>gek<< >>gew<< >>gex<< >>gez<< >>gft<< >>gha<< >>gho<< >>gid<< >>gis<< >>giz<< >>gji<< >>glo<< >>glw<< >>gnc<< >>gnd<< >>gou<< >>gow<< >>gqa<< >>grd<< >>grr<< >>gru<< >>gwd<< >>gwn<< >>har<< >>hau<< >>hau_Latn<< >>hbb<< >>hbo<< >>hbo_Hebr<< >>hdy<< >>heb<< >>hed<< >>hia<< >>hig<< >>hna<< >>hod<< >>hoh<< >>hrt<< >>hss<< >>huy<< >>hwo<< >>hya<< >>inm<< >>ior<< >>irk<< >>jaf<< >>jbe<< >>jbn<< >>jeu<< >>jia<< >>jie<< >>jii<< >>jim<< >>jmb<< >>jmi<< >>jnj<< >>jpa<< >>jpa_Hebr<< >>jrb<< >>juu<< >>kab<< >>kai<< >>kbz<< >>kcn<< >>kcs<< >>ker<< >>kil<< >>kkr<< >>kks<< >>kna<< >>kof<< >>kot<< >>kpa<< >>kqd<< >>kqp<< >>kqx<< >>ksq<< >>ktb<< >>ktc<< >>kuh<< >>kul<< >>kvf<< >>kvi<< >>kvj<< >>kwl<< >>kxc<< >>ldd<< >>lhs<< >>liq<< >>lln<< >>lme<< >>lsd<< >>maf<< >>mcn<< >>mcw<< >>mdx<< >>meq<< >>mes<< >>mew<< >>mey<< >>mfh<< >>mfi<< >>mfj<< >>mfk<< >>mfl<< >>mfm<< >>mid<< >>mif<< >>mje<< >>mjs<< >>mkf<< >>mlj<< >>mlr<< >>mlt<< >>mlw<< >>mmf<< >>mmy<< >>mou<< >>moz<< >>mpg<< >>mpi<< >>mpk<< >>mqb<< >>mrt<< >>mse<< >>msv<< >>mtl<< >>mub<< >>mug<< >>muj<< >>muu<< >>muy<< >>mvh<< >>mvz<< >>mxf<< >>mxu<< >>mys<< >>myz<< >>mzb<< >>nbh<< >>ndm<< >>ngi<< >>ngs<< >>ngw<< >>ngx<< >>nja<< >>nmi<< >>nnc<< >>nnn<< >>noz<< >>nxm<< >>oar<< >>oar_Hebr<< >>oar_Syrc<< >>orm<< >>oua<< >>pbi<< >>pcw<< >>phn<< >>phn_Phnx<< >>pip<< >>piy<< >>plj<< >>pqa<< >>rel<< >>rif<< >>rif_Latn<< >>rzh<< >>saa<< >>sam<< >>say<< >>scw<< >>sds<< >>sgw<< >>she<< >>shi<< >>shi_Latn<< >>shv<< >>shy<< >>shy_Latn<< >>sid<< >>sir<< >>siz<< >>sjs<< >>smp<< >>sok<< >>som<< >>sor<< >>sqr<< >>sqt<< >>ssn<< >>ssy<< >>stv<< >>sur<< >>swn<< >>swq<< >>swy<< >>syc<< >>syk<< >>syn<< >>syr<< >>tak<< >>tal<< >>tan<< >>taq<< >>tax<< >>tdk<< >>tez<< >>tgd<< >>thv<< >>tia<< >>tig<< >>tir<< >>tjo<< >>tmc<< >>tmh<< >>tmr<< >>tmr_Hebr<< >>tng<< >>tqq<< >>trg<< >>trj<< >>tru<< >>tsb<< >>tsh<< >>ttr<< >>twc<< >>tzm<< >>tzm_Latn<< >>tzm_Tfng<< >>ubi<< >>udl<< >>uga<< >>vem<< >>wal<< >>wbj<< >>wji<< >>wka<< >>wle<< >>xaa<< >>xan<< >>xeb<< >>xed<< >>xhd<< >>xmd<< >>xmj<< >>xna<< >>xpu<< >>xqt<< >>xsa<< >>ymm<< >>zah<< >>zay<< >>zaz<< >>zen<< >>zgh<< >>zim<< >>ziz<< >>zns<< >>zrn<< >>zua<< >>zuy<< >>zwa<<
  • Original Model: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-29.zip
  • Resources for more information:

This is a multilingual translation model with multiple target languages. A sentence initial language token is required in the form of >>id<< (id = valid target language ID), e.g. >>aar<<

Uses

This model can be used for translation and text-to-text generation.

Risks, Limitations and Biases

CONTENT WARNING: Readers should be aware that the model is trained on various public data sets that may contain content that is disturbing, offensive, and can propagate historical and current stereotypes.

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)).

How to Get Started With the Model

A short example code:

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>kab<< Tu seras parmi nous demain.",
    ">>heb<< Let's get out of here while we can."
]

model_name = "pytorch-models/opus-mt-tc-bible-big-deu_eng_fra_por_spa-afa"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

# expected output:
#     Azekka ad tiliḍ yid-i
#     בוא נצא מכאן כל עוד אנחנו יכולים.

You can also use OPUS-MT models with the transformers pipelines, for example:

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-deu_eng_fra_por_spa-afa")
print(pipe(">>kab<< Tu seras parmi nous demain."))

# expected output: Azekka ad tiliḍ yid-i

Training

Evaluation

langpair testset chr-F BLEU #sent #words
deu-ara tatoeba-test-v2021-08-07 0.49517 20.2 1209 6324
deu-heb tatoeba-test-v2021-08-07 0.56943 35.8 3090 20341
eng-ara tatoeba-test-v2021-08-07 0.46273 17.3 10305 61356
eng-heb tatoeba-test-v2021-08-07 0.57708 34.9 10519 63628
eng-mlt tatoeba-test-v2021-08-07 0.61044 29.5 203 899
fra-ara tatoeba-test-v2021-08-07 0.42223 10.4 1569 7956
fra-heb tatoeba-test-v2021-08-07 0.58681 37.5 3281 20655
por-heb tatoeba-test-v2021-08-07 0.61593 41.0 719 4423
spa-ara tatoeba-test-v2021-08-07 0.53669 23.9 1511 7547
spa-heb tatoeba-test-v2021-08-07 0.61966 41.2 1849 12112
deu-ara flores101-devtest 0.47927 15.7 1012 21357
eng-hau flores101-devtest 0.47807 19.0 1012 27730
eng-mlt flores101-devtest 0.67196 32.9 1012 22169
fra-mlt flores101-devtest 0.56271 19.9 1012 22169
por-heb flores101-devtest 0.49378 19.6 1012 20749
spa-ara flores101-devtest 0.44988 11.7 1012 21357
deu-ara flores200-devtest 0.661 0.0 1012 5
deu-hau flores200-devtest 0.40471 11.4 1012 27730
deu-heb flores200-devtest 0.48645 18.1 1012 20238
deu-mlt flores200-devtest 0.54079 17.5 1012 22169
eng-ara flores200-devtest 0.627 0.0 1012 5
eng-arz flores200-devtest 0.42804 11.1 1012 21034
eng-hau flores200-devtest 0.49023 20.4 1012 27730
eng-heb flores200-devtest 0.56635 27.1 1012 20238
eng-mlt flores200-devtest 0.68334 34.9 1012 22169
eng-som flores200-devtest 0.42814 9.9 1012 25991
fra-ara flores200-devtest 0.631 0.0 1012 5
fra-hau flores200-devtest 0.42731 13.2 1012 27730
fra-heb flores200-devtest 0.49683 19.1 1012 20238
fra-mlt flores200-devtest 0.56844 20.4 1012 22169
por-ara flores200-devtest 0.622 0.0 1012 5
por-hau flores200-devtest 0.42593 13.6 1012 27730
por-heb flores200-devtest 0.50345 19.7 1012 20238
por-mlt flores200-devtest 0.58913 21.5 1012 22169
spa-ara flores200-devtest 0.587 0.0 1012 5
spa-hau flores200-devtest 0.40309 9.4 1012 27730
spa-heb flores200-devtest 0.45249 13.5 1012 20238
spa-mlt flores200-devtest 0.51077 12.7 1012 22169
eng-hau newstest2021 0.43617 13.1 1000 32966
deu-hau ntrex128 0.41931 12.5 1997 54982
deu-heb ntrex128 0.43961 13.3 1997 39624
deu-mlt ntrex128 0.49871 15.1 1997 43308
eng-hau ntrex128 0.51601 23.2 1997 54982
eng-heb ntrex128 0.50625 20.3 1997 39624
eng-mlt ntrex128 0.62552 29.0 1997 43308
eng-som ntrex128 0.46845 13.5 1997 49351
fra-hau ntrex128 0.43729 14.5 1997 54982
fra-heb ntrex128 0.43855 13.9 1997 39624
fra-mlt ntrex128 0.51640 17.3 1997 43308
fra-som ntrex128 0.41813 9.6 1997 49351
por-hau ntrex128 0.44408 15.1 1997 54982
por-heb ntrex128 0.45739 15.0 1997 39624
por-mlt ntrex128 0.53719 18.2 1997 43308
por-som ntrex128 0.41367 9.3 1997 49351
spa-hau ntrex128 0.44695 14.8 1997 54982
spa-heb ntrex128 0.45509 14.5 1997 39624
spa-mlt ntrex128 0.53631 17.7 1997 43308
spa-som ntrex128 0.41755 9.1 1997 49351
eng-ara tico19-test 0.56288 25.4 2100 51339
eng-hau tico19-test 0.50060 22.2 2100 64509
fra-amh tico19-test 3.575 1.3 2100 44782
fra-hau tico19-test 5.071 1.8 2100 64509
fra-orm tico19-test 4.044 1.8 2100 50032
fra-som tico19-test 2.698 0.9 2100 63654
fra-tir tico19-test 4.151 1.4 2100 46685
por-amh tico19-test 3.799 1.4 2100 44782
por-ara tico19-test 0.44442 16.0 2100 51339
por-hau tico19-test 5.786 2.0 2100 64509
por-orm tico19-test 4.613 2.0 2100 50032
por-som tico19-test 3.413 1.2 2100 63654
por-tir tico19-test 5.092 1.6 2100 46685
spa-amh tico19-test 3.831 1.4 2100 44782
spa-ara tico19-test 0.45429 16.5 2100 51339
spa-hau tico19-test 5.790 1.9 2100 64509
spa-orm tico19-test 4.617 1.9 2100 50032
spa-som tico19-test 3.402 1.2 2100 63654
spa-tir tico19-test 5.033 1.6 2100 46685

Citation Information

@article{tiedemann2023democratizing,
  title={Democratizing neural machine translation with {OPUS-MT}},
  author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
  journal={Language Resources and Evaluation},
  number={58},
  pages={713--755},
  year={2023},
  publisher={Springer Nature},
  issn={1574-0218},
  doi={10.1007/s10579-023-09704-w}
}

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

Acknowledgements

The work is supported by the HPLT project, funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350. We are also grateful for the generous computational resources and IT infrastructure provided by CSC -- IT Center for Science, Finland, and the EuroHPC supercomputer LUMI.

Model conversion info

  • transformers version: 4.45.1
  • OPUS-MT git hash: 0882077
  • port time: Tue Oct 8 08:58:38 EEST 2024
  • port machine: LM0-400-22516.local
Downloads last month
5
Safetensors
Model size
240M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including Helsinki-NLP/opus-mt-tc-bible-big-deu_eng_fra_por_spa-afa

Evaluation results