metadata
library_name: transformers
tags:
- english-dhivehi-latin
license: mit
datasets:
- alakxender/dhivehi-english-parallel
- alakxender/dhivehi-translit-mixed
- alakxender/dhivehi-english-pairs-with-metadata
language:
- dv
base_model:
- google/flan-t5-base
FLAN-T5-base Fine-tuned for Dhivehi-English-Latin Translation
This model is a fine-tuned version of google/flan-t5-base for multilingual translation between Dhivehi, English and Latin script.
Model description
The model was trained on a combination of:
- English-Dhivehi parallel corpus
- Dhivehi-Latin transliteration pairs
It supports the following translation directions:
- English to Dhivehi (en2dv)
- Dhivehi to English (dv2en)
- Dhivehi to Latin script (dv2latin)
- Latin script to Dhivehi (latin2dv)
Training and Evaluation
The model was trained for 3 epochs with the following parameters:
- Learning rate: 3e-4
- Batch size: 2
- Weight decay: 0.01
- Max sequence length: 128 tokens
Final training metrics:
- Training loss: 1.305
- Training runtime: 535,653 seconds
- Training samples per second: 3.733
Final evaluation metrics:
- Evaluation loss: 1.167
- ROUGE-1: 0.415
- ROUGE-2: 0.288
- ROUGE-L: 0.402
- ROUGE-Lsum: 0.405
- Evaluation runtime: 131,307 seconds
- Evaluation samples per second: 5.077
Usage
To use the model, you can load it using the from_pretrained
method:
from transformers import T5ForConditionalGeneration, AutoTokenizer
model = T5ForConditionalGeneration.from_pretrained("alakxender/flan-t5-base-dhivehi-en-latin")
tokenizer = AutoTokenizer.from_pretrained("alakxender/flan-t5-base-dhivehi-en-latin")
supported_languages = ["en2dv", "dv2en", "dv2latin", "latin2dv"]
def translate(source_text, target_language):
prompt = f"{target_language.strip()} {source_text.strip()}"
inputs = tokenizer(prompt, return_tensors="pt", max_length=128, truncation=True)
output_ids = model.generate(
**inputs,
max_length=128,
min_length=10,
num_beams=4,
early_stopping=True,
no_repeat_ngram_size=3,
repetition_penalty=1.2,
do_sample=False,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id
)
result = tokenizer.decode(output_ids[0], skip_special_tokens=True)
return result
# Example usage
source_text = "Concerns over prepayment of GST raised in parliament"
target_language = "en2dv"
translated_text = translate(source_text, target_language)
print(translated_text)
# Output: ރައްޔިތުންގެ މަޖިލީހުގައި ޖީއެސްޓީގެ އަގު ބޮޑުވުމާ ގުޅިގެން ކަންބޮޑުވުން ފާޅުކޮށްފި
source_text = "ދުނިޔޭގެ އެކި ކަންކޮޅުތަކުން 1.4 މިލިއަން މީހުން މައްކާއަށް ޖަމާވެފައި"
target_language = "dv2en"
translated_text = translate(source_text, target_language)
print(translated_text)
# Output: 1.4 million people gathered in Mecca from different parts of the world
source_text = "ވައިބާރުވުމުން ކުޅުދުއްފުށީ އެއާޕޯޓަށް ނުޖެއްސިގެން މޯލްޑިވިއަންގެ ބޯޓެއް އެނބުރި މާލެއަށް"
target_language = "dv2latin"
translated_text = translate(source_text, target_language)
print(translated_text)
# Output: Vaibaruvumun kulhudhuhfushee eaapoatah nujehsigen moaldiviange boateh enburi maaleah
source_text = "Paakisthaanuge skoolu bahakah dhin hamalaaehgai thin kuhjakaai bodu dhe meehaku maruvehje"
target_language = "latin2dv"
translated_text = translate(source_text, target_language)
print(translated_text)
# Output: ޕާކިސްތާނުގެ ސްކޫލު ބަހަކަށް ދިން ހަމަލާއެއްގައި ތިން ކުއްޖަކާއި ބޮޑު ދެ މީހަކު މަރުވެއްޖެ
*Note: The model was fine-tuned mostly on english-dhivehi local news articles. It may not perform well on other domains.