Aramaic Targum Diacritization (Vocalization) MarianMT Model

This model fine-tunes the Helsinki-NLP/opus-mt-afa-afa MarianMT model for the task of Aramaic diacritization: adding nikkud (vowel points) to consonantal Aramaic Targum text. The model is trained on a parallel corpus of consonantal and fully vocalized Aramaic, both in Hebrew script.

Model Details

Model Name: johnlockejrr/opus-arc-targum-vocalization
Base Model: Helsinki-NLP/opus-mt-afa-afa
Task: Aramaic diacritization (consonantal → vocalized)
Script: Hebrew script (consonantal and vocalized)
Domain: Targumic/Biblical Aramaic
License: MIT

Dataset

Source: Consonantal Aramaic Targum text (no nikkud)
Target: Fully vocalized Aramaic Targum text (with nikkud)
Format: CSV with columns consonantal (input) and vocalized (target)
Alignment: Verse-aligned or phrase-aligned

Training Configuration

Base Model: Helsinki-NLP/opus-mt-afa-afa
Batch Size: 8 (per device, gradient accumulation as needed)
Learning Rate: 1e-5
Epochs: 100 (typical)
FP16: Enabled
No language prefix (single language, Aramaic)
Tokenizer: MarianMT default
Max Input/Target Length: 512

Usage

Inference Example

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "johnlockejrr/opus-arc-targum-vocalization"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Diacritize consonantal Aramaic
consonantal = "בקדמין ברא יי ית שמיא וית ארעא"
inputs = tokenizer(consonantal, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(**inputs, max_length=512, num_beams=4)
vocalized = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Vocalized: {vocalized}")

Intended Use

Primary: Automatic diacritization (vocalization) of Aramaic Targum text
Research: Useful for digital humanities, Semitic linguistics, and textual studies
Education: Can assist in language learning and textual analysis

Limitations

Context: The model is trained at the phrase/verse level and does not have document-level context
Domain: Optimized for Targumic/Biblical Aramaic; may not generalize to other dialects
Orthography: Input must be consonantal Aramaic in Hebrew script
Ambiguity: Some words may have multiple valid vocalizations; the model predicts the most likely

Citation

If you use this model, please cite:

@misc{opus-arc-targum-vocalization,
  author = {John Locke Jr.},
  title = {Aramaic Targum Diacritization (Vocalization) MarianMT Model},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face model repository},
  howpublished = {\url{https://huggingface.co/johnlockejrr/opus-arc-targum-vocalization}},
}

Acknowledgements

Targumic Aramaic sources: Public domain or open-access editions
Helsinki-NLP: For the base MarianMT model

License

MIT

johnlockejrr
/

opus-arc-targum-vocalization