MahtaFetrat
/

Homo-GE2PE-Persian

grapheme-to-phoneme

phoneme-translation

homograph-disambiguation

Model card Files Files and versions Community

MahtaFetrat commited on May 17

Commit

c88c1e0

·

verified ·

1 Parent(s): c3a7767

Update README.md

Files changed (1) hide show

README.md +79 -3

README.md CHANGED Viewed

@@ -1,3 +1,79 @@
----
-license: mit
----

+---
+license: mit
+---
+# Homo-GE2PE: Persian Grapheme-to-Phoneme Conversion with Homograph Disambiguation
+![Hugging Face](https://img.shields.io/badge/Hugging%20Face-Dataset-orange)
+**Homo-GE2PE** is a Persian grapheme-to-phoneme (G2P) model specialized in homograph disambiguation—words with identical spellings but context-dependent pronunciations (e.g., *مرد* pronounced as *mard* "man" or *mord* "died"). Introduced in *[Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models](link)*, the model extends **GE2PE** by fine-tuning it on the **HomoRich** dataset, explicitly designed for such pronunciation challenges.
+---
+## Repository Structure
+```
+model-weights/
+│   ├── homo-ge2pe.zip       # Homo-GE2PE model checkpoint
+│   └── homo-t5.zip          # Homo-T5 model checkpoint (T5-based G2P model)
+training-scripts/
+│   ├── finetune-ge2pe.py    # Fine-tuning script for GE2PE
+│   └── finetune-t5.py       # Fine-tuning script for T5
+testing-scripts/
+│   └── test.ipynb           # Benchmarking the models with SentenceBench Persian G2P Benchmark
+assets/
+│   └── (files required for inference, e.g., Parsivar, GE2PE.py)
+```
+---
+### Model Performance
+Below are the performance metrics for each model variant on the SentenceBench dataset:
+| Model        | PER (%) | Homograph Acc. (%) | Avg. Inf. Time (s) |
+| ------------ | ------- | ------------------ | ------------------ |
+| GE2PE (Base) | 4.81    | 47.17              | 0.4464             |
+| Homo-T5      | 4.12    | 76.32              | 0.4141             |
+| Homo-GE2PE   | 3.98    | 76.89              | 0.4473             |
+---
+## Inference
+For inference, use the provided `inference.ipynb` notebook or the [Colab link](https://colab.research.google.com/drive/1Osue8HOgTGMZXIhpvCuiRyfuxpte1v0p?usp=sharing). The notebook demonstrates how to load the checkpoints and perform grapheme-to-phoneme conversion using Homo-GE2PE and Homo-T5.
+---
+## Dataset: HomoRich G2P Persian
+The models in this repository were fine-tuned on HomoRich, the first large-scale public Persian homograph dataset for grapheme-to-phoneme (G2P) tasks, resolving pronunciation/meaning ambiguities in identically spelled words. Introduced in "Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models", the dataset is available [here](https://huggingface.co/datasets/MahtaFetrat/HomoRich).
+---
+## Citation
+If you use this project in your work, please cite the corresponding paper:
+> TODO
+---
+## Contributions
+Contributions and pull requests are welcome. Please open an issue to discuss the changes you intend to make.
+---
+### Additional Links
+* [Paper PDF](#) (TODO: link to paper)
+* [Base GE2PE Paper](https://aclanthology.org/2024.findings-emnlp.196/)
+* [Base GE2PE Model](https://github.com/Sharif-SLPL/GE2PE)
+* [HomoRich Dataset](https://huggingface.co/datasets/MahtaFetrat/HomoRich-G2P-Persian)
+* [SentenceBench Persian G2P Benchmark](https://huggingface.co/datasets/MahtaFetrat/SentenceBench)