MahtaFetrat commited on
Commit
c88c1e0
·
verified ·
1 Parent(s): c3a7767

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +79 -3
README.md CHANGED
@@ -1,3 +1,79 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+
5
+ # Homo-GE2PE: Persian Grapheme-to-Phoneme Conversion with Homograph Disambiguation
6
+
7
+ ![Hugging Face](https://img.shields.io/badge/Hugging%20Face-Dataset-orange)
8
+
9
+ **Homo-GE2PE** is a Persian grapheme-to-phoneme (G2P) model specialized in homograph disambiguation—words with identical spellings but context-dependent pronunciations (e.g., *مرد* pronounced as *mard* "man" or *mord* "died"). Introduced in *[Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models](link)*, the model extends **GE2PE** by fine-tuning it on the **HomoRich** dataset, explicitly designed for such pronunciation challenges.
10
+
11
+ ---
12
+
13
+ ## Repository Structure
14
+
15
+ ```
16
+ model-weights/
17
+ │ ├── homo-ge2pe.zip # Homo-GE2PE model checkpoint
18
+ │ └── homo-t5.zip # Homo-T5 model checkpoint (T5-based G2P model)
19
+
20
+ training-scripts/
21
+ │ ├── finetune-ge2pe.py # Fine-tuning script for GE2PE
22
+ │ └── finetune-t5.py # Fine-tuning script for T5
23
+
24
+ testing-scripts/
25
+ │ └── test.ipynb # Benchmarking the models with SentenceBench Persian G2P Benchmark
26
+
27
+ assets/
28
+ │ └── (files required for inference, e.g., Parsivar, GE2PE.py)
29
+
30
+ ```
31
+
32
+ ---
33
+
34
+ ### Model Performance
35
+
36
+ Below are the performance metrics for each model variant on the SentenceBench dataset:
37
+
38
+ | Model | PER (%) | Homograph Acc. (%) | Avg. Inf. Time (s) |
39
+ | ------------ | ------- | ------------------ | ------------------ |
40
+ | GE2PE (Base) | 4.81 | 47.17 | 0.4464 |
41
+ | Homo-T5 | 4.12 | 76.32 | 0.4141 |
42
+ | Homo-GE2PE | 3.98 | 76.89 | 0.4473 |
43
+
44
+ ---
45
+
46
+ ## Inference
47
+
48
+ For inference, use the provided `inference.ipynb` notebook or the [Colab link](https://colab.research.google.com/drive/1Osue8HOgTGMZXIhpvCuiRyfuxpte1v0p?usp=sharing). The notebook demonstrates how to load the checkpoints and perform grapheme-to-phoneme conversion using Homo-GE2PE and Homo-T5.
49
+
50
+ ---
51
+
52
+ ## Dataset: HomoRich G2P Persian
53
+
54
+ The models in this repository were fine-tuned on HomoRich, the first large-scale public Persian homograph dataset for grapheme-to-phoneme (G2P) tasks, resolving pronunciation/meaning ambiguities in identically spelled words. Introduced in "Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models", the dataset is available [here](https://huggingface.co/datasets/MahtaFetrat/HomoRich).
55
+
56
+ ---
57
+
58
+ ## Citation
59
+
60
+ If you use this project in your work, please cite the corresponding paper:
61
+
62
+ > TODO
63
+
64
+ ---
65
+
66
+ ## Contributions
67
+
68
+ Contributions and pull requests are welcome. Please open an issue to discuss the changes you intend to make.
69
+
70
+ ---
71
+
72
+ ### Additional Links
73
+
74
+ * [Paper PDF](#) (TODO: link to paper)
75
+ * [Base GE2PE Paper](https://aclanthology.org/2024.findings-emnlp.196/)
76
+ * [Base GE2PE Model](https://github.com/Sharif-SLPL/GE2PE)
77
+ * [HomoRich Dataset](https://huggingface.co/datasets/MahtaFetrat/HomoRich-G2P-Persian)
78
+ * [SentenceBench Persian G2P Benchmark](https://huggingface.co/datasets/MahtaFetrat/SentenceBench)
79
+