Moi!
I used sentence pairs from https://tatoeba.org/ to finetune an NLLB model for Gronings. Consider this an early beta version!
I am a linguist and speaker of Gronings so I carried out evaluation by expert's eyeball. I haven't thoroughly investigated the performance by means of BLEU scores or anything for this version. The model produces something that is recognizable as Gronings when the input language is Dutch. I found that interesting enough for a PoC, so I decided to publish.
The model is not optimal in terms of hyperparameters, so I am planning to upload an even better version in the future.
Update 10 September 2025: I've updated the code to the latest version of transformers
so that it can immediately be used by anyone without any tokenizer black magic needed. Also about 500 more parallel nld-gos sentences were added to the training data.
Only the additional Gronings language token needs to be added to the tokenizer at initialization, then everything should work.
See here a minimal example code snippet to get the model up and running: (click)
from transformers import AutoModelForSeq2SeqLM
from transformers import NllbTokenizer
MODEL_URL = 'Tom9358/nllb-tatoeba-gos-nld-v1'
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_URL)
tokenizer = NllbTokenizer.from_pretrained(MODEL_URL, force_download=True,
additional_special_tokens=["gos_Latn"])
def translate(text, src_lang: str = "nld_Latn", tgt_lang: str = "gos_Latn", **kwargs):
tokenizer.src_lang = src_lang
tokenizer.tgt_lang = tgt_lang
inputs = tokenizer(
text,
return_tensors='pt',
padding='longest',
truncation=True,
max_length=120
)
result = model.generate(
**inputs.to(model.device),
forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
max_new_tokens=int(16 + 1.5 * inputs.input_ids.shape[1]),
**kwargs
)
return tokenizer.batch_decode(result, skip_special_tokens=True)
translate("Dit is een testzin om te kijken of de code werkt.")
See https://github.com/tom9358/nllb-tryout for everything (code, more documentation and references) except the model itself and training data.
A (rather slow, but at least free and accessible to everyone) way to try out the model: https://colab.research.google.com/drive/1b5dn3VT4fvOBKly1CIx4Qwo59GDM1H-M
The code there is also a minimal example of how to use this model.
Don't hesitate to contact me if anything comes up!
- Downloads last month
- 18
Model tree for Tom9358/nllb-tatoeba-gos-nld-v1
Base model
facebook/nllb-200-distilled-1.3B