Model Description:
The model was fine-tuned from the mT5-base architecture using the ChavacanoMT corpus within a many-to-many training framework. The training involved Philippine languages—Cebuano, Chavacano, and Hiligaynon—alongside Spanish and English as foreign auxiliary languages. The primary objective was to enhance the translation quality of Chavacano, a low-resource language, by leveraging linguistically related auxiliary languages. The resulting model illustrates the advantages of multilingual training in improving translation performance for low-resource languages.
The training was performed using Huggingface's Transformers library and Tensorflow.
Usage Examples:
#load model<br>
from transformers import TFAutoModelForSeq2SeqLM<br>
finetune_model = TFAutoModelForSeq2SeqLM.from_pretrained('path to mt5-full.model.keras')
#load tokenizer
from transformers import AutoTokenizer
checkpoint = 'google/mt5-base' #or use finetune_model
tokenizer = AutoTokenizer.from_pretrained(checkpoint,
additional_special_tokens=['<2ceb>','<2es>','<2en>','<2hil>','<2cbk>'],
return_token_type_ids=True, use_fast=True, legacy=False)
#tokenize new samples and test on finetuned model
sample = "<2cbk> The girl danced in the rain."
tokenized = tokenizer(sample, return_tensors="tf",max_length=128 ,truncation=True, padding='max_length')
#generate output sequences from input sample
outputs = finetune_model.generate(input_ids = tokenized.input_ids, attention_mask=tokenized.attention_mask)
print(tokenizer.decode(outputs['sequences'][0], skip_special_tokens = True))
#should display: 'El dalaga ta baila cuando ta cae el ulan.'
Technical Specifications:
Running the model requires a minimum of 18 GB of memory to ensure optimal performance.
Set max_new_tokens to the desired length for longer sentences. The mT5 default is set to 50.
Performance Metrics:
The model was evaluated using BLEU, ROUGE, and chrF and earned the following results:
Overall Performance
BLEU | ROUGE-1 | chrF++ |
---|---|---|
17.83 | 0.55 | 37.26 |
Language-Pair Evaluation
Pair | BLEU | ROUGE-1 | chrF++ |
---|---|---|---|
cbk-ceb | 21.95 | 0.62 | 42.94 |
ceb-cbk | 23.54 | 0.66 | 43.23 |
hil-ceb | 22.34 | 0.60 | 42.44 |
ceb-hil | 22.79 | 0.61 | 42.96 |
en-ceb | 20.02 | 0.59 | 40.78 |
ceb-en | 26.25 | 0.61 | 43.61 |
en-cbk | 24.06 | 0.68 | 43.74 |
cbk-en | 35.30 | 0.69 | 43.57 |
ceb-es | 16.61 | 0.50 | 34.50 |
es-ceb | 16.34 | 0.53 | 36.47 |
hil-cbk | 22.10 | 0.65 | 41.55 |
cbk-hil | 23.51 | 0.64 | 43.99 |
en-hil | 16.20 | 0.57 | 37.80 |
hil-en | 23.70 | 0.59 | 41.48 |
es-hil | 15.13 | 0.55 | 36.39 |
hil-es | 16.63 | 0.52 | 35.02 |
es-cbk | 21.79 | 0.64 | 41.21 |
cbk-es | 24.68 | 0.60 | 43.04 |
en-es | 19.45 | 0.54 | 38.25 |
es-en | 25.94 | 0.60 | 43.57 |
Limitations and Biases:
Due to hardware limitations, model training was limited to 8 epochs using the default hyperparameters of the mT5-base model. The ChavacanoMT corpus was constructed from digital resources primarily within secular and religious domains. Consequently, the resulting translation model may exhibit reduced accuracy when applied to more general or conventional texts.
- Downloads last month
- 0
Model tree for ajvicente/cbkmnmt_mt5-full
Base model
google/mt5-base