Aramaic Diacritization Model (MarianMT)
This model is a fine-tuned MarianMT model for Aramaic text diacritization (vocalization), converting consonantal Aramaic text to fully vocalized text with nikkud (vowel points).
Model Description
- Model type: MarianMT (Encoder-Decoder Transformer)
- Language: Aramaic (arc2arc)
- Task: Text diacritization/vocalization
- Base model: Helsinki-NLP/opus-mt-afa-afa
- Parameters: 61,924,352 (61.9M)
Model Architecture
- Architecture: MarianMT (Marian Machine Translation)
- Encoder layers: 6
- Decoder layers: 6
- Hidden size: 512
- Attention heads: 8
- Feed-forward dimension: 2048
- Vocabulary size: 33,714
- Max sequence length: 512 tokens
- Activation function: Swish
- Position embeddings: Static
Training Details
Training Configuration
- Training data: 12,110 examples
- Validation data: 1,514 examples
- Batch size: 8
- Gradient accumulation steps: 2
- Effective batch size: 16
- Learning rate: 1e-5
- Warmup steps: 1,000
- Max epochs: 100
- Training completed at: Epoch 36.33
- Mixed precision: FP16 enabled
Training Metrics
- Final training loss: 0.283
- Training runtime: 21,727 seconds (~6 hours)
- Training samples per second: 55.7
- Training steps per second: 3.48
Evaluation Results
Test Set Performance
- BLEU Score: 72.90
- Character Accuracy: 63.78%
- Evaluation Loss: 0.088
- Evaluation Runtime: 311.5 seconds
- Evaluation samples per second: 4.86
Usage
Basic Usage
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Load model and tokenizer
model_name = "johnlockejrr/aramaic-diacritization-model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# Example input (consonantal Aramaic text)
consonantal_text = "讘拽讚诪讬谉 讘专讗 讬讬 讬转 砖诪讬讗 讜讬转 讗专注讗"
# Tokenize input
inputs = tokenizer(consonantal_text, return_tensors="pt", max_length=512, truncation=True)
# Generate vocalized text
outputs = model.generate(**inputs, max_length=512, num_beams=4, early_stopping=True)
# Decode output
vocalized_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Input: {consonantal_text}")
print(f"Output: {vocalized_text}")
Using the Pipeline
from transformers import pipeline
diacritizer = pipeline("text2text-generation", model="johnlockejrr/aramaic-diacritization-model")
# Process text
consonantal_text = "讘专讗砖讬转 讘专讗 讗诇讛讬诐 讗转 讛砖诪讬诐 讜讗转 讛讗专抓"
vocalized_text = diacritizer(consonantal_text)[0]['generated_text']
print(vocalized_text)
Training Data
The model was trained on a custom Aramaic diacritization dataset with the following characteristics:
- Source: Consonantal Aramaic text (without vowel points)
- Target: Vocalized Aramaic text (with nikkud/vowel points)
- Data format: CSV with columns: consonantal, vocalized, book, chapter, verse
- Data split: 80% train, 10% validation, 10% test
- Text cleaning: Preserves nikkud in target text, removes punctuation from source
Data Preprocessing
- Input cleaning: Removes punctuation and formatting while preserving letters
- Target preservation: Maintains all nikkud (vowel points) and diacritical marks
- Length filtering: Removes sequences shorter than 2 characters or longer than 1000 characters
- Duplicate handling: Removes exact duplicates to prevent data leakage
Limitations and Bias
- Domain specificity: Trained primarily on religious/biblical Aramaic texts
- Vocabulary coverage: Limited to the vocabulary present in the training corpus
- Length constraints: Maximum input/output length of 512 tokens
- Style consistency: May not handle modern Aramaic dialects or contemporary usage
- Performance: Character accuracy of ~64% indicates room for improvement
Environmental Impact
- Hardware used: NVIDIA GPU (GTX 3060 12GB)
- Training time: ~6 hours
- Carbon emissions: Estimated low (single GPU, moderate training time)
- Energy efficiency: FP16 mixed precision used to reduce memory usage
Citation
If you use this model in your research, please cite:
@misc{aramaic-diacritization-2024,
title={Aramaic Diacritization Model},
author={John Locke Jr.},
year={2025},
howpublished={Hugging Face Model Hub},
url={https://huggingface.co/johnlockejrr/aramaic-diacritization-model}
}
License
[MIT]
Acknowledgments
- Base model: Helsinki-NLP/opus-mt-afa-afa
- Training framework: Hugging Face Transformers
- Dataset: Custom Aramaic diacritization corpus
Model Files
model.safetensors
- Model weights (234MB)config.json
- Model configurationtokenizer_config.json
- Tokenizer configurationsource.spm
/target.spm
- SentencePiece modelsvocab.json
- Vocabulary filegeneration_config.json
- Generation parameters
Training Scripts
The model was trained using custom scripts:
train_arc2arc_improved_deep.py
- Main training scriptrun_arc2arc_improved_deep.sh
- Training execution scriptrun_resume_arc2arc_deep.sh
- Resume training script
Contact
For questions, issues, or contributions, please open an issue on the model repository.
- Downloads last month
- 11
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
馃檵
Ask for provider support
Model tree for johnlockejrr/aramaic-diacritization-model
Base model
Helsinki-NLP/opus-mt-afa-afa