umt5-thai-g2p-v2-0.5k

This model is a fine-tuned version of B-K/umt5-thai-g2p-v2-pretraining-0.5k on the B-K/thai-g2p dataset for Thai Grapheme-to-Phoneme (G2P) conversion.. It achieves the following results on the evaluation set:

  • Loss: 1.0480
  • Cer: 0.0369

Model description

umt5-thai-g2p-v2-0.5k is designed to convert Thai text (words or sentences) into their corresponding phonemic International Phonetic Alphabet (IPA) representations.

Intended uses & limitations

Intended Uses

  • Thai Grapheme-to-Phoneme (G2P) Conversion: The primary use of this model is to generate phonemic transcriptions (IPA) for Thai text.
  • Speech Synthesis Preprocessing: Can be used as a component in a Text-to-Speech (TTS) pipeline to convert input text into phonemes before acoustic model processing.

Limitations

  • Accuracy: While the model achieves a Character Error Rate (CER) of approximately 0.0369 on the evaluation set, it is not 100% accurate. Users should expect some errors in the generated phonemes.
  • Out-of-Distribution Data: Performance may degrade on words, phrases, or sentence structures significantly different from those present in the B-K/thai-g2p training dataset. This includes very rare words, neologisms, or complex named entities.
  • Ambiguity: Thai orthography can sometimes be ambiguous, and the model might not always resolve such ambiguities correctly to the intended pronunciation in all contexts.
  • Sentence-Level vs. Word-Level: While trained on a dataset that includes sentences, its robustness for very long or highly complex sentences might vary.
  • Inherited Limitations: As a fine-tuned version of google/umt5-small, it inherits the general architectural limitations and scale of the base model.

How to use

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("B-K/umt5-thai-g2p-v2-0.5k")
model = AutoModelForSeq2SeqLM.from_pretrained("B-K/umt5-thai-g2p-v2-0.5k")

thai_text = "สวัสดีครับนี่คือโมเดลจีทูพีขนาดสี่สิบห้าล้านพารามิเตอร์มันเล็กมาก" # Example Thai text
inputs = tokenizer(thai_text, return_tensors="pt", padding=True, truncation=True)

outputs = model.generate(**inputs, num_beams=3, max_new_tokens=256)
phonemes = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(f"Thai Text: {thai_text}")
# สวัสดีครับนี่คือโมเดลจีทูพีขนาดสี่สิบห้าล้านพารามิเตอร์มันเล็กมาก
print(f"Phonemes: {phonemes.replace(" ", "")}") # <-- Removing the space to make it more readable
# sa˨˩.wat̚˨˩.diː˧.kʰrap̚˦˥.niː˥˩.kʰɯː˧.moː˧.deːl˧.t͡ɕiː˧.tʰuː˧.pʰiː˧.kʰa˨˩.naːt̚˨˩.siː˨˩.sip̚˨˩.haː˥˩.laːn˦˥.pʰaː˧.raː˧.mi˦˥.tɤː˥˩.man˧.lek̚˦˥.maːk̚˥˩

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0003
  • train_batch_size: 128
  • eval_batch_size: 128
  • seed: 42
  • optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_steps: 500
  • num_epochs: 50
  • label_smoothing_factor: 0.1

Training results

Training Loss Epoch Step Validation Loss Cer Gen Len Max Gen Len
1.9989 4.0 512 1.2908 0.253 28.8508 96
1.9989 5.0 640 1.2310 0.2622 30.9793 96
1.9989 6.0 768 1.2074 0.1624 27.7673 96
1.9989 7.0 896 1.1620 0.1404 28.9122 96
1.2098 8.0 1024 1.1444 0.1235 28.9108 96
1.2098 9.0 1152 1.1461 0.1035 27.1941 96
1.2098 10.0 1280 1.1199 0.0941 28.586 96
1.2098 11.0 1408 1.1191 0.089 28.349 96
1.1322 12.0 1536 1.1095 0.0859 28.2662 96
1.1322 13.0 1664 1.0993 0.0724 28.4911 96
1.1322 14.0 1792 1.0984 0.0749 28.2834 96
1.1322 15.0 1920 1.0943 0.0638 28.2684 96
1.0961 16.0 2048 1.0915 0.065 28.2241 96
1.0961 17.0 2176 1.0845 0.0613 28.349 96
1.0961 18.0 2304 1.0830 0.0626 28.1406 96
1.0961 19.0 2432 1.0803 0.058 28.3812 96
1.0729 20.0 2560 1.0749 0.0562 28.5453 96
1.0729 21.0 2688 1.0744 0.0683 29.0835 96
1.0729 22.0 2816 1.0734 0.0534 28.5018 96
1.0729 23.0 2944 1.0689 0.0562 28.8658 96
1.0581 24.0 3072 1.0672 0.0534 28.4768 96
1.0581 25.0 3200 1.0614 0.0469 28.6838 96
1.0581 26.0 3328 1.0598 0.0448 28.6067 96
1.0581 27.0 3456 1.0577 0.0443 28.6495 96
1.0458 28.0 3584 1.0568 0.0429 28.4611 96
1.0458 29.0 3712 1.0601 0.0454 28.5168 96
1.0458 30.0 3840 1.0579 0.046 28.636 96
1.0458 31.0 3968 1.0559 0.0464 28.5339 96
1.0372 32.0 4096 1.0532 0.0423 28.6338 96
1.0372 33.0 4224 1.0519 0.0432 28.6474 96
1.0372 34.0 4352 1.0533 0.0378 28.3983 96
1.0372 35.0 4480 1.0521 0.04 28.3733 96
1.0307 36.0 4608 1.0511 0.04 28.601 96
1.0307 37.0 4736 1.0507 0.0401 28.5282 96
1.0307 38.0 4864 1.0507 0.0414 28.5682 96
1.0307 39.0 4992 1.0488 0.0376 28.5382 96
1.0256 40.0 5120 1.0488 0.0382 28.546 96
1.0256 41.0 5248 1.0491 0.0386 28.5025 96
1.0256 42.0 5376 1.0483 0.0373 28.5118 96
1.0229 43.0 5504 1.0479 0.0378 28.5182 96
1.0229 44.0 5632 1.0481 0.0376 28.5403 96
1.0229 45.0 5760 1.0480 0.0391 28.5532 96
1.0229 46.0 5888 1.0485 0.0374 28.5339 96
1.0211 47.0 6016 1.0482 0.0371 28.5125 96
1.0211 48.0 6144 1.0480 0.0372 28.5168 96
1.0211 49.0 6272 1.0479 0.0376 28.5253 96
1.0211 50.0 6400 1.0480 0.0369 28.5032 96

Framework versions

  • Transformers 4.47.0
  • Pytorch 2.5.1
  • Datasets 3.6.0
  • Tokenizers 0.21.0
Downloads last month
28
Safetensors
Model size
44.6M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for B-K/umt5-thai-g2p-v2-0.5k

Finetuned
(1)
this model

Dataset used to train B-K/umt5-thai-g2p-v2-0.5k