Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,79 @@
|
|
1 |
-
---
|
2 |
-
license: mit
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
---
|
4 |
+
|
5 |
+
# Homo-GE2PE: Persian Grapheme-to-Phoneme Conversion with Homograph Disambiguation
|
6 |
+
|
7 |
+

|
8 |
+
|
9 |
+
**Homo-GE2PE** is a Persian grapheme-to-phoneme (G2P) model specialized in homograph disambiguation—words with identical spellings but context-dependent pronunciations (e.g., *مرد* pronounced as *mard* "man" or *mord* "died"). Introduced in *[Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models](link)*, the model extends **GE2PE** by fine-tuning it on the **HomoRich** dataset, explicitly designed for such pronunciation challenges.
|
10 |
+
|
11 |
+
---
|
12 |
+
|
13 |
+
## Repository Structure
|
14 |
+
|
15 |
+
```
|
16 |
+
model-weights/
|
17 |
+
│ ├── homo-ge2pe.zip # Homo-GE2PE model checkpoint
|
18 |
+
│ └── homo-t5.zip # Homo-T5 model checkpoint (T5-based G2P model)
|
19 |
+
|
20 |
+
training-scripts/
|
21 |
+
│ ├── finetune-ge2pe.py # Fine-tuning script for GE2PE
|
22 |
+
│ └── finetune-t5.py # Fine-tuning script for T5
|
23 |
+
|
24 |
+
testing-scripts/
|
25 |
+
│ └── test.ipynb # Benchmarking the models with SentenceBench Persian G2P Benchmark
|
26 |
+
|
27 |
+
assets/
|
28 |
+
│ └── (files required for inference, e.g., Parsivar, GE2PE.py)
|
29 |
+
|
30 |
+
```
|
31 |
+
|
32 |
+
---
|
33 |
+
|
34 |
+
### Model Performance
|
35 |
+
|
36 |
+
Below are the performance metrics for each model variant on the SentenceBench dataset:
|
37 |
+
|
38 |
+
| Model | PER (%) | Homograph Acc. (%) | Avg. Inf. Time (s) |
|
39 |
+
| ------------ | ------- | ------------------ | ------------------ |
|
40 |
+
| GE2PE (Base) | 4.81 | 47.17 | 0.4464 |
|
41 |
+
| Homo-T5 | 4.12 | 76.32 | 0.4141 |
|
42 |
+
| Homo-GE2PE | 3.98 | 76.89 | 0.4473 |
|
43 |
+
|
44 |
+
---
|
45 |
+
|
46 |
+
## Inference
|
47 |
+
|
48 |
+
For inference, use the provided `inference.ipynb` notebook or the [Colab link](https://colab.research.google.com/drive/1Osue8HOgTGMZXIhpvCuiRyfuxpte1v0p?usp=sharing). The notebook demonstrates how to load the checkpoints and perform grapheme-to-phoneme conversion using Homo-GE2PE and Homo-T5.
|
49 |
+
|
50 |
+
---
|
51 |
+
|
52 |
+
## Dataset: HomoRich G2P Persian
|
53 |
+
|
54 |
+
The models in this repository were fine-tuned on HomoRich, the first large-scale public Persian homograph dataset for grapheme-to-phoneme (G2P) tasks, resolving pronunciation/meaning ambiguities in identically spelled words. Introduced in "Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models", the dataset is available [here](https://huggingface.co/datasets/MahtaFetrat/HomoRich).
|
55 |
+
|
56 |
+
---
|
57 |
+
|
58 |
+
## Citation
|
59 |
+
|
60 |
+
If you use this project in your work, please cite the corresponding paper:
|
61 |
+
|
62 |
+
> TODO
|
63 |
+
|
64 |
+
---
|
65 |
+
|
66 |
+
## Contributions
|
67 |
+
|
68 |
+
Contributions and pull requests are welcome. Please open an issue to discuss the changes you intend to make.
|
69 |
+
|
70 |
+
---
|
71 |
+
|
72 |
+
### Additional Links
|
73 |
+
|
74 |
+
* [Paper PDF](#) (TODO: link to paper)
|
75 |
+
* [Base GE2PE Paper](https://aclanthology.org/2024.findings-emnlp.196/)
|
76 |
+
* [Base GE2PE Model](https://github.com/Sharif-SLPL/GE2PE)
|
77 |
+
* [HomoRich Dataset](https://huggingface.co/datasets/MahtaFetrat/HomoRich-G2P-Persian)
|
78 |
+
* [SentenceBench Persian G2P Benchmark](https://huggingface.co/datasets/MahtaFetrat/SentenceBench)
|
79 |
+
|