SyllaBERTa: A Syllable-Based RoBERTa for Ancient Greek

SyllaBERTa is an experimental Transformer-based masked language model (MLM) trained on Ancient Greek texts, tokenized at the syllable level.
It is specifically designed to tackle tasks involving prosody, meter, and rhyme.

Model Summary

Attribute	Value
Base architecture	RoBERTa (custom configuration)
Vocabulary size	42,042 syllabic tokens
Hidden size	768
Number of layers	12
Attention heads	12
Intermediate size	3,072
Max sequence length	514
Pretraining objective	Masked Language Modeling (MLM)
Optimizer	AdamW
Loss function	CrossEntropy with 15% token masking probability

The tokenizer is a custom subclass of PreTrainedTokenizer, operating on syllables rather than words or characters.
It:

Maps each syllable to an ID.
Supports diphthong merging and Greek orthographic phenomena.
Uses space-separated syllable tokens.

Example tokenization:

Input:
Κατέβην χθὲς εἰς Πειραιᾶ

Tokens:
['κα', 'τέ', 'βην', 'χθὲ', 'σεἰσ', 'πει', 'ραι', 'ᾶ']

Observe that words are fused at the syllabic level.

Usage Example

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Ericu950/SyllaBERTa", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Ericu950/SyllaBERTa", trust_remote_code=True)

# Encode a sentence
text = "Κατέβην χθὲς εἰς Πειραιᾶ μετὰ Γλαύκωνος τοῦ Ἀρίστωνος"
tokens = tokenizer.tokenize(text)
print(tokens)

# Insert a mask at random
import random
tokens[random.randint(0, len(tokens)-1)] = tokenizer.mask_token
masked_text = tokenizer.convert_tokens_to_string(tokens)

# Predict masked token
inputs = tokenizer(masked_text, return_tensors="pt", padding=True, truncation=True)
inputs.pop("token_type_ids", None)
with torch.no_grad():
    outputs = model(**inputs)

# Fetch prediction
logits = outputs.logits
mask_token_index = (inputs['input_ids'] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
top_tokens = logits[0, mask_token_index].topk(5, dim=-1).indices.squeeze(0)
predicted = tokenizer.convert_ids_to_tokens(top_tokens.tolist())

print("Top predictions:", predicted)

It should print:

Original tokens: ['κα', 'τέ', 'βην', 'χθὲ', 'σεἰσ', 'πει', 'ραι', 'ᾶ', 'με', 'τὰγ', 'λαύ', 'κω', 'νοσ', 'τοῦ', 'ἀ', 'ρίσ', 'τω', 'νοσ']

Masked at position 6
Masked text: κα τέ βην χθὲ σεἰσ πει [MASK] ᾶ με τὰγ λαύ κω νοσ τοῦ ἀ ρίσ τω νοσ

Top 5 predictions for masked token:
ραι          (score: 23.12)
ρα           (score: 14.69)
ραισ         (score: 12.63)
σαι          (score: 12.43)
ρη           (score: 12.26)

License

MIT License.

Authors

This work is part of ongoing research by Eric Cullhed (Uppsala University) and Albin Thörn Cleland (Lund University).

Acknowledgements

The computations were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council through grant agreement no. 2022-06725.