Aramaic Targum Diacritization (Vocalization) MarianMT Model

This model fine-tunes the Helsinki-NLP/opus-mt-afa-afa MarianMT model for the task of Aramaic diacritization: adding nikkud (vowel points) to consonantal Aramaic Targum text. The model is trained on a parallel corpus of consonantal and fully vocalized Aramaic, both in Hebrew script.

Model Details

  • Model Name: johnlockejrr/opus-arc-targum-vocalization
  • Base Model: Helsinki-NLP/opus-mt-afa-afa
  • Task: Aramaic diacritization (consonantal โ†’ vocalized)
  • Script: Hebrew script (consonantal and vocalized)
  • Domain: Targumic/Biblical Aramaic
  • License: MIT

Dataset

  • Source: Consonantal Aramaic Targum text (no nikkud)
  • Target: Fully vocalized Aramaic Targum text (with nikkud)
  • Format: CSV with columns consonantal (input) and vocalized (target)
  • Alignment: Verse-aligned or phrase-aligned

Training Configuration

  • Base Model: Helsinki-NLP/opus-mt-afa-afa
  • Batch Size: 8 (per device, gradient accumulation as needed)
  • Learning Rate: 1e-5
  • Epochs: 100 (typical)
  • FP16: Enabled
  • No language prefix (single language, Aramaic)
  • Tokenizer: MarianMT default
  • Max Input/Target Length: 512

Usage

Inference Example

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "johnlockejrr/opus-arc-targum-vocalization"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Diacritize consonantal Aramaic
consonantal = "ื‘ืงื“ืžื™ืŸ ื‘ืจื ื™ื™ ื™ืช ืฉืžื™ื ื•ื™ืช ืืจืขื"
inputs = tokenizer(consonantal, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(**inputs, max_length=512, num_beams=4)
vocalized = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Vocalized: {vocalized}")

Intended Use

  • Primary: Automatic diacritization (vocalization) of Aramaic Targum text
  • Research: Useful for digital humanities, Semitic linguistics, and textual studies
  • Education: Can assist in language learning and textual analysis

Limitations

  • Context: The model is trained at the phrase/verse level and does not have document-level context
  • Domain: Optimized for Targumic/Biblical Aramaic; may not generalize to other dialects
  • Orthography: Input must be consonantal Aramaic in Hebrew script
  • Ambiguity: Some words may have multiple valid vocalizations; the model predicts the most likely

Citation

If you use this model, please cite:

@misc{opus-arc-targum-vocalization,
  author = {John Locke Jr.},
  title = {Aramaic Targum Diacritization (Vocalization) MarianMT Model},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face model repository},
  howpublished = {\url{https://huggingface.co/johnlockejrr/opus-arc-targum-vocalization}},
}

Acknowledgements

  • Targumic Aramaic sources: Public domain or open-access editions
  • Helsinki-NLP: For the base MarianMT model

License

MIT

Downloads last month
13
Safetensors
Model size
61.4M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for johnlockejrr/opus-arc-targum-vocalization

Finetuned
(2)
this model

Space using johnlockejrr/opus-arc-targum-vocalization 1