Moi!

I used sentence pairs from https://tatoeba.org/ to finetune an NLLB model for Gronings. Consider this an early beta version!

I am a linguist and speaker of Gronings so I carried out evaluation by expert's eyeball. I haven't thoroughly investigated the performance by means of BLEU scores or anything for this version. The model produces something that is recognizable as Gronings when the input language is Dutch. I found that interesting enough for a PoC, so I decided to publish.

The model is not optimal in terms of hyperparameters, so I am planning to upload an even better version in the future. Update 10 September 2025: I've updated the code to the latest version of transformers so that it can immediately be used by anyone without any tokenizer black magic needed. Also about 500 more parallel nld-gos sentences were added to the training data. Only the additional Gronings language token needs to be added to the tokenizer at initialization, then everything should work.

See here a minimal example code snippet to get the model up and running: (click)

from transformers import AutoModelForSeq2SeqLM
from transformers import NllbTokenizer

MODEL_URL = 'Tom9358/nllb-tatoeba-gos-nld-v1'
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_URL)
tokenizer = NllbTokenizer.from_pretrained(MODEL_URL, force_download=True,
                                          additional_special_tokens=["gos_Latn"])

def translate(text, src_lang: str = "nld_Latn", tgt_lang: str = "gos_Latn", **kwargs):
    tokenizer.src_lang = src_lang
    tokenizer.tgt_lang = tgt_lang
    inputs = tokenizer(
        text,
        return_tensors='pt',
        padding='longest',
        truncation=True,
        max_length=120
    )
    result = model.generate(
        **inputs.to(model.device),
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
        max_new_tokens=int(16 + 1.5 * inputs.input_ids.shape[1]),
        **kwargs
    )
    return tokenizer.batch_decode(result, skip_special_tokens=True)

translate("Dit is een testzin om te kijken of de code werkt.")

See https://github.com/tom9358/nllb-tryout for everything (code, more documentation and references) except the model itself and training data.

A (rather slow, but at least free and accessible to everyone) way to try out the model: https://colab.research.google.com/drive/1b5dn3VT4fvOBKly1CIx4Qwo59GDM1H-M

The code there is also a minimal example of how to use this model.

Don't hesitate to contact me if anything comes up!

Downloads last month: 18

Safetensors

Model size

1.37B params

Tensor type

F32

Model tree for Tom9358/nllb-tatoeba-gos-nld-v1

Base model

facebook/nllb-200-distilled-1.3B

Finetuned

(13)

this model

Tom9358
/

nllb-tatoeba-gos-nld-v1

Model tree for Tom9358/nllb-tatoeba-gos-nld-v1

Dataset used to train Tom9358/nllb-tatoeba-gos-nld-v1