SyllaBERTa: A Syllable-Based RoBERTa for Ancient Greek

SyllaBERTa is an experimental Transformer-based masked language model (MLM) trained on Ancient Greek texts, tokenized at the syllable level.
It is specifically designed to tackle tasks involving prosody, meter, and rhyme.


Model Summary

Attribute Value
Base architecture RoBERTa (custom configuration)
Vocabulary size 42,042 syllabic tokens
Hidden size 768
Number of layers 12
Attention heads 12
Intermediate size 3,072
Max sequence length 514
Pretraining objective Masked Language Modeling (MLM)
Optimizer AdamW
Loss function CrossEntropy with 15% token masking probability

The tokenizer is a custom subclass of PreTrainedTokenizer, operating on syllables rather than words or characters.
It:

  • Maps each syllable to an ID.
  • Supports diphthong merging and Greek orthographic phenomena.
  • Uses space-separated syllable tokens.

Example tokenization:

Input:
Κατέβην χθὲς εἰς Πειραιᾶ

Tokens:
['κα', 'τέ', 'βην', 'χθὲ', 'σεἰσ', 'πει', 'ραι', 'ᾶ']

Observe that words are fused at the syllabic level.


Usage Example

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Ericu950/SyllaBERTa", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Ericu950/SyllaBERTa", trust_remote_code=True)

# Encode a sentence
text = "Κατέβην χθὲς εἰς Πειραιᾶ μετὰ Γλαύκωνος τοῦ Ἀρίστωνος"
tokens = tokenizer.tokenize(text)
print(tokens)

# Insert a mask at random
import random
tokens[random.randint(0, len(tokens)-1)] = tokenizer.mask_token
masked_text = tokenizer.convert_tokens_to_string(tokens)

# Predict masked token
inputs = tokenizer(masked_text, return_tensors="pt", padding=True, truncation=True)
inputs.pop("token_type_ids", None)
with torch.no_grad():
    outputs = model(**inputs)

# Fetch prediction
logits = outputs.logits
mask_token_index = (inputs['input_ids'] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
top_tokens = logits[0, mask_token_index].topk(5, dim=-1).indices.squeeze(0)
predicted = tokenizer.convert_ids_to_tokens(top_tokens.tolist())

print("Top predictions:", predicted)

It should print:

Original tokens: ['κα', 'τέ', 'βην', 'χθὲ', 'σεἰσ', 'πει', 'ραι', 'ᾶ', 'με', 'τὰγ', 'λαύ', 'κω', 'νοσ', 'τοῦ', 'ἀ', 'ρίσ', 'τω', 'νοσ']

Masked at position 6
Masked text: κα τέ βην χθὲ σεἰσ πει [MASK] ᾶ με τὰγ λαύ κω νοσ τοῦ ἀ ρίσ τω νοσ

Top 5 predictions for masked token:
ραι          (score: 23.12)
ρα           (score: 14.69)
ραισ         (score: 12.63)
σαι          (score: 12.43)
ρη           (score: 12.26)

License

MIT License.


Authors

This work is part of ongoing research by Eric Cullhed (Uppsala University) and Albin Thörn Cleland (Lund University).


Acknowledgements

The computations were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council through grant agreement no. 2022-06725.

Downloads last month
30
Safetensors
Model size
118M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support