Naozumi0512 commited on
Commit
6c1850b
·
verified ·
1 Parent(s): 9ae6426

Upload 5 files

Browse files
Files changed (5) hide show
  1. MONOPHONIC_CHARS.txt +0 -0
  2. POLYPHONIC_CHARS.txt +0 -0
  3. README.md +49 -1
  4. config.py +39 -0
  5. g2pw.onnx +3 -0
MONOPHONIC_CHARS.txt ADDED
The diff for this file is too large to render. See raw diff
 
POLYPHONIC_CHARS.txt ADDED
The diff for this file is too large to render. See raw diff
 
README.md CHANGED
@@ -1,3 +1,51 @@
1
  ---
2
- license: cc-by-4.0
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - yue
4
+ pretty_name: "Cantonese (yue) G2PW model - bert base"
5
+ tags:
6
+ - g2p
7
+ license: "cc-by-4.0"
8
+ task_categories:
9
+ - text2text-generation
10
+ datasets:
11
+ - Naozumi0512/g2p-Cantonese-aggregate
12
  ---
13
+
14
+ # g2pW-canto-20241201-bert-base
15
+
16
+ This is a **G2P (Grapheme-to-Phoneme)** model trained on the [Naozumi0512/g2p-Cantonese-aggregate](https://huggingface.co/datasets/Naozumi0512/g2p-Cantonese-aggregate) dataset and evaluated on the [yue-g2p-benchmark](https://github.com/hon9kon9ize/yue-g2p-benchmark).
17
+
18
+ ## Model Overview
19
+
20
+ The model uses **[hon9kon9ize/bert-base-cantonese](https://huggingface.co/hon9kon9ize/bert-base-cantonese)**. For more details see https://github.com/Naozumi520/g2pW-Cantonese .
21
+
22
+ ---
23
+
24
+ ## Dataset
25
+
26
+ The model was trained on the [Naozumi0512/g2p-Cantonese-aggregate](https://huggingface.co/datasets/Naozumi0512/g2p-Cantonese-aggregate) dataset, which includes:
27
+
28
+ - **68,500 Cantonese words/phrases** with corresponding phonetic transcriptions.
29
+ - Data is formatted to align with the **CPP (Chinese Polyphones with Pinyin)** structure.
30
+ - Sources include:
31
+ - Rime Cantonese Input Schema (`jyut6ping3.words.dict.yaml`)
32
+ - 粵典 Words.hk
33
+ - CantoDict
34
+
35
+ ---
36
+
37
+ ## Evaluation
38
+
39
+ The model was evaluated on the [yue-g2p-benchmark](https://github.com/hon9kon9ize/yue-g2p-benchmark):
40
+
41
+ | Metric | Score |
42
+ |-------------------------|--------|
43
+ | **Accuracy** | 0.6873 |
44
+ | **Levenshtein Distance**| 0.1789 |
45
+ | **Phoneme Error Rate** | 0.2083 |
46
+
47
+ ---
48
+
49
+ ## Inference
50
+
51
+ https://github.com/Naozumi520/g2pW-Cantonese
config.py ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ root = './rimeExtract_dataset/'
2
+
3
+ manual_seed = 1313
4
+ model_source = './bert-base-cantonese'
5
+ polyphonic_chars_path = root + 'POLYPHONIC_CHARS.txt'
6
+ window_size = 32
7
+ num_workers = 2
8
+ use_mask = True
9
+ use_conditional = True
10
+ param_conditional = {
11
+ 'bias': True,
12
+ 'char-linear': True,
13
+ 'pos-linear': False,
14
+ 'char+pos-second': True,
15
+ }
16
+
17
+ # for training
18
+ exp_name = '20241206_BERT_B_DescWS-Sec-cLin-B_POS_hkcancor_w03'
19
+ train_sent_path = root + 'train.sent'
20
+ train_lb_path = root + 'train.lb'
21
+ valid_sent_path = root + 'dev.sent'
22
+ valid_lb_path = root + 'dev.lb'
23
+ test_sent_path = root + 'test.sent'
24
+ test_lb_path = root + 'test.lb'
25
+ batch_size = 128
26
+ lr = 5e-5
27
+ val_interval = 200
28
+ num_iter = 13000
29
+ use_pos = True
30
+ param_pos = {
31
+ 'weight': 0.3,
32
+ 'pos_joint_training': True,
33
+ # 'train_pos_path': root + 'train.pos',
34
+ # 'valid_pos_path': root + 'dev.pos',
35
+ # 'test_pos_path': root + 'test.pos',
36
+ 'train_pos_path': root + 'train_hkcancor.pos',
37
+ 'valid_pos_path': root + 'dev_hkcancor.pos',
38
+ 'test_pos_path': root + 'test_hkcancor.pos',
39
+ }
g2pw.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1d04732ba7697b617e17e8ffc0895cb22c5db5f96f12b481b438eeee5d53f9d7
3
+ size 1203023863