electra-grc / README.md
mercelisw's picture
Added usage instructions (#2)
8258a35 verified
metadata
language:
  - grc
tags:
  - ELECTRA
  - TensorFlow

An ELECTRA-small model for Ancient Greek, trained on texts from Homer up until the 4th century AD from the literary GLAUx corpus and the DukeNLP papyrus corpus.

The model has some design choices made to combat data sparsity:

  • Its input should always be in Unicode NFD (so separate Unicode signs for diacritics).
  • All grave accents should be replaced with acute accents (καί, not καὶ).
  • When a word contains two accents, the second one should be removed (εἶπε μοι, not εἶπέ μοι).

If you use it in conjunction with glaux-nlp, you can pass the tokenized sentence to normalize_tokens from tokenization.Tokenization, using normalization_rule=greek_glaux, which will do all these normalizations for you.

Citation

@misc{mercelis_electra-grc_2022,
    title = {electra-grc},
    url = {https://huggingface.co/mercelisw/electra-grc},
    abstract = {An ELECTRA-small model for Ancient Greek, trained on texts from Homer up until the 4th century AD.},
    author = {Mercelis, Wouter and Keersmaekers, Alek},
    year = {2022},
}