ajvicente/cbkmnmt_mt5-full

Model Description:

The model was fine-tuned from the mT5-base architecture using the ChavacanoMT corpus within a many-to-many training framework. The training involved Philippine languages—Cebuano, Chavacano, and Hiligaynon—alongside Spanish and English as foreign auxiliary languages. The primary objective was to enhance the translation quality of Chavacano, a low-resource language, by leveraging linguistically related auxiliary languages. The resulting model illustrates the advantages of multilingual training in improving translation performance for low-resource languages.
The training was performed using Huggingface's Transformers library and Tensorflow.

Usage Examples:

#load model<br>
from transformers import TFAutoModelForSeq2SeqLM<br>
finetune_model = TFAutoModelForSeq2SeqLM.from_pretrained('path to mt5-full.model.keras')

#load tokenizer
from transformers import AutoTokenizer
checkpoint = 'google/mt5-base' #or use finetune_model
tokenizer = AutoTokenizer.from_pretrained(checkpoint,
                                          additional_special_tokens=['<2ceb>','<2es>','<2en>','<2hil>','<2cbk>'],
                                          return_token_type_ids=True, use_fast=True, legacy=False)

#tokenize new samples and test on finetuned model
sample = "<2cbk> The girl danced in the rain."
tokenized = tokenizer(sample, return_tensors="tf",max_length=128 ,truncation=True, padding='max_length')

#generate output sequences from input sample
outputs = finetune_model.generate(input_ids = tokenized.input_ids, attention_mask=tokenized.attention_mask)
print(tokenizer.decode(outputs['sequences'][0], skip_special_tokens = True))
#should display: 'El dalaga ta baila cuando ta cae el ulan.'

Technical Specifications:

Running the model requires a minimum of 18 GB of memory to ensure optimal performance.
Set max_new_tokens to the desired length for longer sentences. The mT5 default is set to 50.

Performance Metrics:

The model was evaluated using BLEU, ROUGE, and chrF and earned the following results:
Overall Performance

BLEU	ROUGE-1	chrF++
17.83	0.55	37.26

Language-Pair Evaluation

Pair	BLEU	ROUGE-1	chrF++
cbk-ceb	21.95	0.62	42.94
ceb-cbk	23.54	0.66	43.23
hil-ceb	22.34	0.60	42.44
ceb-hil	22.79	0.61	42.96
en-ceb	20.02	0.59	40.78
ceb-en	26.25	0.61	43.61
en-cbk	24.06	0.68	43.74
cbk-en	35.30	0.69	43.57
ceb-es	16.61	0.50	34.50
es-ceb	16.34	0.53	36.47
hil-cbk	22.10	0.65	41.55
cbk-hil	23.51	0.64	43.99
en-hil	16.20	0.57	37.80
hil-en	23.70	0.59	41.48
es-hil	15.13	0.55	36.39
hil-es	16.63	0.52	35.02
es-cbk	21.79	0.64	41.21
cbk-es	24.68	0.60	43.04
en-es	19.45	0.54	38.25
es-en	25.94	0.60	43.57

Limitations and Biases:

Due to hardware limitations, model training was limited to 8 epochs using the default hyperparameters of the mT5-base model. The ChavacanoMT corpus was constructed from digital resources primarily within secular and religious domains. Consequently, the resulting translation model may exhibit reduced accuracy when applied to more general or conventional texts.

ajvicente
/

cbkmnmt_mt5-full

You need to agree to share your contact information to access this model

Model Description:

Usage Examples:

Technical Specifications:

Performance Metrics:

Limitations and Biases:

Model tree for ajvicente/cbkmnmt_mt5-full