You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Model Description:

The model was fine-tuned from the mT5-base architecture using the ChavacanoMT corpus within a many-to-many training framework. The training involved Philippine languages—Cebuano, Chavacano, and Hiligaynon—alongside Spanish and English as foreign auxiliary languages. The primary objective was to enhance the translation quality of Chavacano, a low-resource language, by leveraging linguistically related auxiliary languages. The resulting model illustrates the advantages of multilingual training in improving translation performance for low-resource languages.
The training was performed using Huggingface's Transformers library and Tensorflow.

Usage Examples:

#load model<br>
from transformers import TFAutoModelForSeq2SeqLM<br>
finetune_model = TFAutoModelForSeq2SeqLM.from_pretrained('path to mt5-full.model.keras')
#load tokenizer
from transformers import AutoTokenizer
checkpoint = 'google/mt5-base' #or use finetune_model
tokenizer = AutoTokenizer.from_pretrained(checkpoint,
                                          additional_special_tokens=['<2ceb>','<2es>','<2en>','<2hil>','<2cbk>'],
                                          return_token_type_ids=True, use_fast=True, legacy=False)
#tokenize new samples and test on finetuned model
sample = "<2cbk> The girl danced in the rain."
tokenized = tokenizer(sample, return_tensors="tf",max_length=128 ,truncation=True, padding='max_length')
#generate output sequences from input sample
outputs = finetune_model.generate(input_ids = tokenized.input_ids, attention_mask=tokenized.attention_mask)
print(tokenizer.decode(outputs['sequences'][0], skip_special_tokens = True))
#should display: 'El dalaga ta baila cuando ta cae el ulan.'

Technical Specifications:

Running the model requires a minimum of 18 GB of memory to ensure optimal performance.
Set max_new_tokens to the desired length for longer sentences. The mT5 default is set to 50.

Performance Metrics:

The model was evaluated using BLEU, ROUGE, and chrF and earned the following results:
Overall Performance

BLEU ROUGE-1 chrF++
17.83 0.55 37.26

Language-Pair Evaluation

Pair BLEU ROUGE-1 chrF++
cbk-ceb 21.95 0.62 42.94
ceb-cbk 23.54 0.66 43.23
hil-ceb 22.34 0.60 42.44
ceb-hil 22.79 0.61 42.96
en-ceb 20.02 0.59 40.78
ceb-en 26.25 0.61 43.61
en-cbk 24.06 0.68 43.74
cbk-en 35.30 0.69 43.57
ceb-es 16.61 0.50 34.50
es-ceb 16.34 0.53 36.47
hil-cbk 22.10 0.65 41.55
cbk-hil 23.51 0.64 43.99
en-hil 16.20 0.57 37.80
hil-en 23.70 0.59 41.48
es-hil 15.13 0.55 36.39
hil-es 16.63 0.52 35.02
es-cbk 21.79 0.64 41.21
cbk-es 24.68 0.60 43.04
en-es 19.45 0.54 38.25
es-en 25.94 0.60 43.57

Limitations and Biases:

Due to hardware limitations, model training was limited to 8 epochs using the default hyperparameters of the mT5-base model. The ChavacanoMT corpus was constructed from digital resources primarily within secular and religious domains. Consequently, the resulting translation model may exhibit reduced accuracy when applied to more general or conventional texts.

Downloads last month
0
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ajvicente/cbkmnmt_mt5-full

Base model

google/mt5-base
Finetuned
(210)
this model