mercelisw/electra-grc · Hugging Face

An ELECTRA-small model for Ancient Greek, trained on texts from Homer up until the 4th century AD from the literary GLAUx corpus and the DukeNLP papyrus corpus.

The model has some design choices made to combat data sparsity:

Its input should always be in Unicode NFD (so separate Unicode signs for diacritics).
All grave accents should be replaced with acute accents (καί, not καὶ).
When a word contains two accents, the second one should be removed (εἶπε μοι, not εἶπέ μοι).

If you use it in conjunction with glaux-nlp, you can pass the tokenized sentence to normalize_tokens from tokenization.Tokenization, using normalization_rule=greek_glaux, which will do all these normalizations for you.

Citation

@misc{mercelis_electra-grc_2022,
    title = {electra-grc},
    url = {https://huggingface.co/mercelisw/electra-grc},
    abstract = {An ELECTRA-small model for Ancient Greek, trained on texts from Homer up until the 4th century AD.},
    author = {Mercelis, Wouter and Keersmaekers, Alek},
    year = {2022},
}

mercelisw
/

electra-grc

Citation

Model tree for mercelisw/electra-grc