An ELECTRA-small model for Ancient Greek, trained on texts from Homer up until the 4th century AD from the literary GLAUx corpus and the DukeNLP papyrus corpus.

The model has some design choices made to combat data sparsity:

  • Its input should always be in Unicode NFD (so separate Unicode signs for diacritics).
  • All grave accents should be replaced with acute accents (καί, not καὶ).
  • When a word contains two accents, the second one should be removed (εἶπε μοι, not εἶπέ μοι).

If you use it in conjunction with glaux-nlp, you can pass the tokenized sentence to normalize_tokens from tokenization.Tokenization, using normalization_rule=greek_glaux, which will do all these normalizations for you.

Citation

@misc{mercelis_electra-grc_2022,
    title = {electra-grc},
    url = {https://huggingface.co/mercelisw/electra-grc},
    abstract = {An ELECTRA-small model for Ancient Greek, trained on texts from Homer up until the 4th century AD.},
    author = {Mercelis, Wouter and Keersmaekers, Alek},
    year = {2022},
}
Downloads last month
133
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.