Homo-GE2PE: Persian Grapheme-to-Phoneme Conversion with Homograph Disambiguation
Homo-GE2PE is a Persian grapheme-to-phoneme (G2P) model specialized in homograph disambiguation—words with identical spellings but context-dependent pronunciations (e.g., مرد pronounced as mard "man" or mord "died"). Introduced in Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models, the model extends GE2PE by fine-tuning it on the HomoRich dataset, explicitly designed for such pronunciation challenges.
Repository Structure
model-weights/
│ ├── homo-ge2pe.zip # Homo-GE2PE model checkpoint
│ └── homo-t5.zip # Homo-T5 model checkpoint (T5-based G2P model)
training-scripts/
│ ├── finetune-ge2pe.py # Fine-tuning script for GE2PE
│ └── finetune-t5.py # Fine-tuning script for T5
testing-scripts/
│ └── test.ipynb # Benchmarking the models with SentenceBench Persian G2P Benchmark
assets/
│ └── (files required for inference, e.g., Parsivar, GE2PE.py)
Model Performance
Below are the performance metrics for each model variant on the SentenceBench dataset:
Model | PER (%) | Homograph Acc. (%) | Avg. Inf. Time (s) |
---|---|---|---|
GE2PE (Base) | 4.81 | 47.17 | 0.4464 |
Homo-T5 | 4.12 | 76.32 | 0.4141 |
Homo-GE2PE | 3.98 | 76.89 | 0.4473 |
Usage
For inference, run the provided inference.ipynb
notebook either locally or via the Colab link (recommended for easy setup).
Quick Setup
Install dependencies:
pip install unidecode
Download models:
git clone https://huggingface.co/MahtaFetrat/Homo-GE2PE-Persian/ unzip -q Homo-GE2PE-Persian/assets/Parsivar.zip unzip -q Homo-GE2PE-Persian/model-weights/homo-ge2pe.zip -d homo-ge2pe unzip -q Homo-GE2PE-Persian/model-weights/homo-t5.zip -d homo-t5 mv Homo-GE2PE-Persian/assets/GE2PE.py ./
Fix compatibility (if needed):
sed -i 's/from collections import Iterable/from collections.abc import Iterable/g' Parsivar/token_merger.py
Example Usage
from GE2PE import GE2PE
g2p = GE2PE(model_path='/content/homo-ge2pe') # or homo-t5
g2p.generate(['تست مدل تبدیل نویسه به واج', 'این کتابِ علی است'], use_rules=True)
# Output: ['teste model t/bdil nevise be vaj', '@in ketabe @ali @/st']
Dataset: HomoRich G2P Persian
The models in this repository were fine-tuned on HomoRich, the first large-scale public Persian homograph dataset for grapheme-to-phoneme (G2P) tasks, resolving pronunciation/meaning ambiguities in identically spelled words. Introduced in "Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models", the dataset is available here.
Citation
If you use this project in your work, please cite the corresponding paper:
@misc{qharabagh2025fastfancyrethinkingg2p,
title={Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models},
author={Mahta Fetrat Qharabagh and Zahra Dehghanian and Hamid R. Rabiee},
year={2025},
eprint={2505.12973},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.12973},
}
Contributions
Contributions and pull requests are welcome. Please open an issue to discuss the changes you intend to make.