SyllaBERTa: A Syllable-Based RoBERTa for Ancient Greek
SyllaBERTa is an experimental Transformer-based masked language model (MLM) trained on Ancient Greek texts, tokenized at the syllable level.
It is specifically designed to tackle tasks involving prosody, meter, and rhyme.
Model Summary
Attribute | Value |
---|---|
Base architecture | RoBERTa (custom configuration) |
Vocabulary size | 42,042 syllabic tokens |
Hidden size | 768 |
Number of layers | 12 |
Attention heads | 12 |
Intermediate size | 3,072 |
Max sequence length | 514 |
Pretraining objective | Masked Language Modeling (MLM) |
Optimizer | AdamW |
Loss function | CrossEntropy with 15% token masking probability |
The tokenizer is a custom subclass of PreTrainedTokenizer
, operating on syllables rather than words or characters.
It:
- Maps each syllable to an ID.
- Supports diphthong merging and Greek orthographic phenomena.
- Uses space-separated syllable tokens.
Example tokenization:
Input:Κατέβην χθὲς εἰς Πειραιᾶ
Tokens:['κα', 'τέ', 'βην', 'χθὲ', 'σεἰσ', 'πει', 'ραι', 'ᾶ']
Observe that words are fused at the syllabic level.
Usage Example
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("Ericu950/SyllaBERTa", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Ericu950/SyllaBERTa", trust_remote_code=True)
# Encode a sentence
text = "Κατέβην χθὲς εἰς Πειραιᾶ μετὰ Γλαύκωνος τοῦ Ἀρίστωνος"
tokens = tokenizer.tokenize(text)
print(tokens)
# Insert a mask at random
import random
tokens[random.randint(0, len(tokens)-1)] = tokenizer.mask_token
masked_text = tokenizer.convert_tokens_to_string(tokens)
# Predict masked token
inputs = tokenizer(masked_text, return_tensors="pt", padding=True, truncation=True)
inputs.pop("token_type_ids", None)
with torch.no_grad():
outputs = model(**inputs)
# Fetch prediction
logits = outputs.logits
mask_token_index = (inputs['input_ids'] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
top_tokens = logits[0, mask_token_index].topk(5, dim=-1).indices.squeeze(0)
predicted = tokenizer.convert_ids_to_tokens(top_tokens.tolist())
print("Top predictions:", predicted)
It should print:
Original tokens: ['κα', 'τέ', 'βην', 'χθὲ', 'σεἰσ', 'πει', 'ραι', 'ᾶ', 'με', 'τὰγ', 'λαύ', 'κω', 'νοσ', 'τοῦ', 'ἀ', 'ρίσ', 'τω', 'νοσ']
Masked at position 6
Masked text: κα τέ βην χθὲ σεἰσ πει [MASK] ᾶ με τὰγ λαύ κω νοσ τοῦ ἀ ρίσ τω νοσ
Top 5 predictions for masked token:
ραι (score: 23.12)
ρα (score: 14.69)
ραισ (score: 12.63)
σαι (score: 12.43)
ρη (score: 12.26)
License
MIT License.
Authors
This work is part of ongoing research by Eric Cullhed (Uppsala University) and Albin Thörn Cleland (Lund University).
Acknowledgements
The computations were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council through grant agreement no. 2022-06725.
- Downloads last month
- 30