Aramaic Targum Diacritization (Vocalization) MarianMT Model
This model fine-tunes the Helsinki-NLP/opus-mt-afa-afa MarianMT model for the task of Aramaic diacritization: adding nikkud (vowel points) to consonantal Aramaic Targum text. The model is trained on a parallel corpus of consonantal and fully vocalized Aramaic, both in Hebrew script.
Model Details
- Model Name:
johnlockejrr/opus-arc-targum-vocalization
- Base Model: Helsinki-NLP/opus-mt-afa-afa
- Task: Aramaic diacritization (consonantal โ vocalized)
- Script: Hebrew script (consonantal and vocalized)
- Domain: Targumic/Biblical Aramaic
- License: MIT
Dataset
- Source: Consonantal Aramaic Targum text (no nikkud)
- Target: Fully vocalized Aramaic Targum text (with nikkud)
- Format: CSV with columns
consonantal
(input) andvocalized
(target) - Alignment: Verse-aligned or phrase-aligned
Training Configuration
- Base Model: Helsinki-NLP/opus-mt-afa-afa
- Batch Size: 8 (per device, gradient accumulation as needed)
- Learning Rate: 1e-5
- Epochs: 100 (typical)
- FP16: Enabled
- No language prefix (single language, Aramaic)
- Tokenizer: MarianMT default
- Max Input/Target Length: 512
Usage
Inference Example
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = "johnlockejrr/opus-arc-targum-vocalization"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# Diacritize consonantal Aramaic
consonantal = "ืืงืืืื ืืจื ืื ืืช ืฉืืื ืืืช ืืจืขื"
inputs = tokenizer(consonantal, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(**inputs, max_length=512, num_beams=4)
vocalized = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Vocalized: {vocalized}")
Intended Use
- Primary: Automatic diacritization (vocalization) of Aramaic Targum text
- Research: Useful for digital humanities, Semitic linguistics, and textual studies
- Education: Can assist in language learning and textual analysis
Limitations
- Context: The model is trained at the phrase/verse level and does not have document-level context
- Domain: Optimized for Targumic/Biblical Aramaic; may not generalize to other dialects
- Orthography: Input must be consonantal Aramaic in Hebrew script
- Ambiguity: Some words may have multiple valid vocalizations; the model predicts the most likely
Citation
If you use this model, please cite:
@misc{opus-arc-targum-vocalization,
author = {John Locke Jr.},
title = {Aramaic Targum Diacritization (Vocalization) MarianMT Model},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face model repository},
howpublished = {\url{https://huggingface.co/johnlockejrr/opus-arc-targum-vocalization}},
}
Acknowledgements
- Targumic Aramaic sources: Public domain or open-access editions
- Helsinki-NLP: For the base MarianMT model
License
MIT
- Downloads last month
- 13
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for johnlockejrr/opus-arc-targum-vocalization
Base model
Helsinki-NLP/opus-mt-afa-afa