umt5-thai-g2p-v2-0.5k

This model is a fine-tuned version of B-K/umt5-thai-g2p-v2-pretraining-0.5k on the B-K/thai-g2p dataset for Thai Grapheme-to-Phoneme (G2P) conversion.. It achieves the following results on the evaluation set:

Loss: 1.0480
Cer: 0.0369

Model description

umt5-thai-g2p-v2-0.5k is designed to convert Thai text (words or sentences) into their corresponding phonemic International Phonetic Alphabet (IPA) representations.

Intended uses & limitations

Intended Uses

Thai Grapheme-to-Phoneme (G2P) Conversion: The primary use of this model is to generate phonemic transcriptions (IPA) for Thai text.
Speech Synthesis Preprocessing: Can be used as a component in a Text-to-Speech (TTS) pipeline to convert input text into phonemes before acoustic model processing.

Limitations

Accuracy: While the model achieves a Character Error Rate (CER) of approximately 0.0369 on the evaluation set, it is not 100% accurate. Users should expect some errors in the generated phonemes.
Out-of-Distribution Data: Performance may degrade on words, phrases, or sentence structures significantly different from those present in the B-K/thai-g2p training dataset. This includes very rare words, neologisms, or complex named entities.
Ambiguity: Thai orthography can sometimes be ambiguous, and the model might not always resolve such ambiguities correctly to the intended pronunciation in all contexts.
Sentence-Level vs. Word-Level: While trained on a dataset that includes sentences, its robustness for very long or highly complex sentences might vary.
Inherited Limitations: As a fine-tuned version of google/umt5-small, it inherits the general architectural limitations and scale of the base model.

How to use

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("B-K/umt5-thai-g2p-v2-0.5k")
model = AutoModelForSeq2SeqLM.from_pretrained("B-K/umt5-thai-g2p-v2-0.5k")

thai_text = "สวัสดีครับนี่คือโมเดลจีทูพีขนาดสี่สิบห้าล้านพารามิเตอร์มันเล็กมาก" # Example Thai text
inputs = tokenizer(thai_text, return_tensors="pt", padding=True, truncation=True)

outputs = model.generate(**inputs, num_beams=3, max_new_tokens=256)
phonemes = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(f"Thai Text: {thai_text}")
# สวัสดีครับนี่คือโมเดลจีทูพีขนาดสี่สิบห้าล้านพารามิเตอร์มันเล็กมาก
print(f"Phonemes: {phonemes.replace(" ", "")}") # <-- Removing the space to make it more readable
# sa˨˩.wat̚˨˩.diː˧.kʰrap̚˦˥.niː˥˩.kʰɯː˧.moː˧.deːl˧.t͡ɕiː˧.tʰuː˧.pʰiː˧.kʰa˨˩.naːt̚˨˩.siː˨˩.sip̚˨˩.haː˥˩.laːn˦˥.pʰaː˧.raː˧.mi˦˥.tɤː˥˩.man˧.lek̚˦˥.maːk̚˥˩

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0003
train_batch_size: 128
eval_batch_size: 128
seed: 42
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 500
num_epochs: 50
label_smoothing_factor: 0.1

Training results

Training Loss	Epoch	Step	Validation Loss	Cer	Gen Len	Max Gen Len
1.9989	4.0	512	1.2908	0.253	28.8508	96
1.9989	5.0	640	1.2310	0.2622	30.9793	96
1.9989	6.0	768	1.2074	0.1624	27.7673	96
1.9989	7.0	896	1.1620	0.1404	28.9122	96
1.2098	8.0	1024	1.1444	0.1235	28.9108	96
1.2098	9.0	1152	1.1461	0.1035	27.1941	96
1.2098	10.0	1280	1.1199	0.0941	28.586	96
1.2098	11.0	1408	1.1191	0.089	28.349	96
1.1322	12.0	1536	1.1095	0.0859	28.2662	96
1.1322	13.0	1664	1.0993	0.0724	28.4911	96
1.1322	14.0	1792	1.0984	0.0749	28.2834	96
1.1322	15.0	1920	1.0943	0.0638	28.2684	96
1.0961	16.0	2048	1.0915	0.065	28.2241	96
1.0961	17.0	2176	1.0845	0.0613	28.349	96
1.0961	18.0	2304	1.0830	0.0626	28.1406	96
1.0961	19.0	2432	1.0803	0.058	28.3812	96
1.0729	20.0	2560	1.0749	0.0562	28.5453	96
1.0729	21.0	2688	1.0744	0.0683	29.0835	96
1.0729	22.0	2816	1.0734	0.0534	28.5018	96
1.0729	23.0	2944	1.0689	0.0562	28.8658	96
1.0581	24.0	3072	1.0672	0.0534	28.4768	96
1.0581	25.0	3200	1.0614	0.0469	28.6838	96
1.0581	26.0	3328	1.0598	0.0448	28.6067	96
1.0581	27.0	3456	1.0577	0.0443	28.6495	96
1.0458	28.0	3584	1.0568	0.0429	28.4611	96
1.0458	29.0	3712	1.0601	0.0454	28.5168	96
1.0458	30.0	3840	1.0579	0.046	28.636	96
1.0458	31.0	3968	1.0559	0.0464	28.5339	96
1.0372	32.0	4096	1.0532	0.0423	28.6338	96
1.0372	33.0	4224	1.0519	0.0432	28.6474	96
1.0372	34.0	4352	1.0533	0.0378	28.3983	96
1.0372	35.0	4480	1.0521	0.04	28.3733	96
1.0307	36.0	4608	1.0511	0.04	28.601	96
1.0307	37.0	4736	1.0507	0.0401	28.5282	96
1.0307	38.0	4864	1.0507	0.0414	28.5682	96
1.0307	39.0	4992	1.0488	0.0376	28.5382	96
1.0256	40.0	5120	1.0488	0.0382	28.546	96
1.0256	41.0	5248	1.0491	0.0386	28.5025	96
1.0256	42.0	5376	1.0483	0.0373	28.5118	96
1.0229	43.0	5504	1.0479	0.0378	28.5182	96
1.0229	44.0	5632	1.0481	0.0376	28.5403	96
1.0229	45.0	5760	1.0480	0.0391	28.5532	96
1.0229	46.0	5888	1.0485	0.0374	28.5339	96
1.0211	47.0	6016	1.0482	0.0371	28.5125	96
1.0211	48.0	6144	1.0480	0.0372	28.5168	96
1.0211	49.0	6272	1.0479	0.0376	28.5253	96
1.0211	50.0	6400	1.0480	0.0369	28.5032	96

Framework versions

Transformers 4.47.0
Pytorch 2.5.1
Datasets 3.6.0
Tokenizers 0.21.0

B-K
/

umt5-thai-g2p-v2-0.5k