Aramaic Diacritization Model (MarianMT)

This model is a fine-tuned MarianMT model for Aramaic text diacritization (vocalization), converting consonantal Aramaic text to fully vocalized text with nikkud (vowel points).

Model Description

  • Model type: MarianMT (Encoder-Decoder Transformer)
  • Language: Aramaic (arc2arc)
  • Task: Text diacritization/vocalization
  • Base model: Helsinki-NLP/opus-mt-afa-afa
  • Parameters: 61,924,352 (61.9M)

Model Architecture

  • Architecture: MarianMT (Marian Machine Translation)
  • Encoder layers: 6
  • Decoder layers: 6
  • Hidden size: 512
  • Attention heads: 8
  • Feed-forward dimension: 2048
  • Vocabulary size: 33,714
  • Max sequence length: 512 tokens
  • Activation function: Swish
  • Position embeddings: Static

Training Details

Training Configuration

  • Training data: 12,110 examples
  • Validation data: 1,514 examples
  • Batch size: 8
  • Gradient accumulation steps: 2
  • Effective batch size: 16
  • Learning rate: 1e-5
  • Warmup steps: 1,000
  • Max epochs: 100
  • Training completed at: Epoch 36.33
  • Mixed precision: FP16 enabled

Training Metrics

  • Final training loss: 0.283
  • Training runtime: 21,727 seconds (~6 hours)
  • Training samples per second: 55.7
  • Training steps per second: 3.48

Evaluation Results

Test Set Performance

  • BLEU Score: 72.90
  • Character Accuracy: 63.78%
  • Evaluation Loss: 0.088
  • Evaluation Runtime: 311.5 seconds
  • Evaluation samples per second: 4.86

Usage

Basic Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load model and tokenizer
model_name = "johnlockejrr/aramaic-diacritization-model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Example input (consonantal Aramaic text)
consonantal_text = "讘拽讚诪讬谉 讘专讗 讬讬 讬转 砖诪讬讗 讜讬转 讗专注讗"

# Tokenize input
inputs = tokenizer(consonantal_text, return_tensors="pt", max_length=512, truncation=True)

# Generate vocalized text
outputs = model.generate(**inputs, max_length=512, num_beams=4, early_stopping=True)

# Decode output
vocalized_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Input: {consonantal_text}")
print(f"Output: {vocalized_text}")

Using the Pipeline

from transformers import pipeline

diacritizer = pipeline("text2text-generation", model="johnlockejrr/aramaic-diacritization-model")

# Process text
consonantal_text = "讘专讗砖讬转 讘专讗 讗诇讛讬诐 讗转 讛砖诪讬诐 讜讗转 讛讗专抓"
vocalized_text = diacritizer(consonantal_text)[0]['generated_text']
print(vocalized_text)

Training Data

The model was trained on a custom Aramaic diacritization dataset with the following characteristics:

  • Source: Consonantal Aramaic text (without vowel points)
  • Target: Vocalized Aramaic text (with nikkud/vowel points)
  • Data format: CSV with columns: consonantal, vocalized, book, chapter, verse
  • Data split: 80% train, 10% validation, 10% test
  • Text cleaning: Preserves nikkud in target text, removes punctuation from source

Data Preprocessing

  • Input cleaning: Removes punctuation and formatting while preserving letters
  • Target preservation: Maintains all nikkud (vowel points) and diacritical marks
  • Length filtering: Removes sequences shorter than 2 characters or longer than 1000 characters
  • Duplicate handling: Removes exact duplicates to prevent data leakage

Limitations and Bias

  • Domain specificity: Trained primarily on religious/biblical Aramaic texts
  • Vocabulary coverage: Limited to the vocabulary present in the training corpus
  • Length constraints: Maximum input/output length of 512 tokens
  • Style consistency: May not handle modern Aramaic dialects or contemporary usage
  • Performance: Character accuracy of ~64% indicates room for improvement

Environmental Impact

  • Hardware used: NVIDIA GPU (GTX 3060 12GB)
  • Training time: ~6 hours
  • Carbon emissions: Estimated low (single GPU, moderate training time)
  • Energy efficiency: FP16 mixed precision used to reduce memory usage

Citation

If you use this model in your research, please cite:

@misc{aramaic-diacritization-2024,
  title={Aramaic Diacritization Model},
  author={John Locke Jr.},
  year={2025},
  howpublished={Hugging Face Model Hub},
  url={https://huggingface.co/johnlockejrr/aramaic-diacritization-model}
}

License

[MIT]

Acknowledgments

Model Files

  • model.safetensors - Model weights (234MB)
  • config.json - Model configuration
  • tokenizer_config.json - Tokenizer configuration
  • source.spm / target.spm - SentencePiece models
  • vocab.json - Vocabulary file
  • generation_config.json - Generation parameters

Training Scripts

The model was trained using custom scripts:

  • train_arc2arc_improved_deep.py - Main training script
  • run_arc2arc_improved_deep.sh - Training execution script
  • run_resume_arc2arc_deep.sh - Resume training script

Contact

For questions, issues, or contributions, please open an issue on the model repository.

Downloads last month
11
Safetensors
Model size
61.4M params
Tensor type
F32
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for johnlockejrr/aramaic-diacritization-model

Finetuned
(2)
this model