johnlockejrr's picture
Update README.md
70417a6 verified
metadata
language:
  - arc
tags:
  - diacritization
  - aramaic
  - vocalization
  - targum
  - semitic-languages
  - sequence-to-sequence
license: mit
base_model: Helsinki-NLP/opus-mt-afa-afa
library_name: transformers

Aramaic Diacritization Model (MarianMT)

This model is a fine-tuned MarianMT model for Aramaic text diacritization (vocalization), converting consonantal Aramaic text to fully vocalized text with nikkud (vowel points).

Model Description

  • Model type: MarianMT (Encoder-Decoder Transformer)
  • Language: Aramaic (arc2arc)
  • Task: Text diacritization/vocalization
  • Base model: Helsinki-NLP/opus-mt-afa-afa
  • Parameters: 61,924,352 (61.9M)

Model Architecture

  • Architecture: MarianMT (Marian Machine Translation)
  • Encoder layers: 6
  • Decoder layers: 6
  • Hidden size: 512
  • Attention heads: 8
  • Feed-forward dimension: 2048
  • Vocabulary size: 33,714
  • Max sequence length: 512 tokens
  • Activation function: Swish
  • Position embeddings: Static

Training Details

Training Configuration

  • Training data: 12,110 examples
  • Validation data: 1,514 examples
  • Batch size: 8
  • Gradient accumulation steps: 2
  • Effective batch size: 16
  • Learning rate: 1e-5
  • Warmup steps: 1,000
  • Max epochs: 100
  • Training completed at: Epoch 36.33
  • Mixed precision: FP16 enabled

Training Metrics

  • Final training loss: 0.283
  • Training runtime: 21,727 seconds (~6 hours)
  • Training samples per second: 55.7
  • Training steps per second: 3.48

Evaluation Results

Test Set Performance

  • BLEU Score: 72.90
  • Character Accuracy: 63.78%
  • Evaluation Loss: 0.088
  • Evaluation Runtime: 311.5 seconds
  • Evaluation samples per second: 4.86

Usage

Basic Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load model and tokenizer
model_name = "johnlockejrr/aramaic-diacritization-model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Example input (consonantal Aramaic text)
consonantal_text = "讘拽讚诪讬谉 讘专讗 讬讬 讬转 砖诪讬讗 讜讬转 讗专注讗"

# Tokenize input
inputs = tokenizer(consonantal_text, return_tensors="pt", max_length=512, truncation=True)

# Generate vocalized text
outputs = model.generate(**inputs, max_length=512, num_beams=4, early_stopping=True)

# Decode output
vocalized_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Input: {consonantal_text}")
print(f"Output: {vocalized_text}")

Using the Pipeline

from transformers import pipeline

diacritizer = pipeline("text2text-generation", model="johnlockejrr/aramaic-diacritization-model")

# Process text
consonantal_text = "讘专讗砖讬转 讘专讗 讗诇讛讬诐 讗转 讛砖诪讬诐 讜讗转 讛讗专抓"
vocalized_text = diacritizer(consonantal_text)[0]['generated_text']
print(vocalized_text)

Training Data

The model was trained on a custom Aramaic diacritization dataset with the following characteristics:

  • Source: Consonantal Aramaic text (without vowel points)
  • Target: Vocalized Aramaic text (with nikkud/vowel points)
  • Data format: CSV with columns: consonantal, vocalized, book, chapter, verse
  • Data split: 80% train, 10% validation, 10% test
  • Text cleaning: Preserves nikkud in target text, removes punctuation from source

Data Preprocessing

  • Input cleaning: Removes punctuation and formatting while preserving letters
  • Target preservation: Maintains all nikkud (vowel points) and diacritical marks
  • Length filtering: Removes sequences shorter than 2 characters or longer than 1000 characters
  • Duplicate handling: Removes exact duplicates to prevent data leakage

Limitations and Bias

  • Domain specificity: Trained primarily on religious/biblical Aramaic texts
  • Vocabulary coverage: Limited to the vocabulary present in the training corpus
  • Length constraints: Maximum input/output length of 512 tokens
  • Style consistency: May not handle modern Aramaic dialects or contemporary usage
  • Performance: Character accuracy of ~64% indicates room for improvement

Environmental Impact

  • Hardware used: NVIDIA GPU (GTX 3060 12GB)
  • Training time: ~6 hours
  • Carbon emissions: Estimated low (single GPU, moderate training time)
  • Energy efficiency: FP16 mixed precision used to reduce memory usage

Citation

If you use this model in your research, please cite:

@misc{aramaic-diacritization-2024,
  title={Aramaic Diacritization Model},
  author={John Locke Jr.},
  year={2025},
  howpublished={Hugging Face Model Hub},
  url={https://huggingface.co/johnlockejrr/aramaic-diacritization-model}
}

License

[MIT]

Acknowledgments

Model Files

  • model.safetensors - Model weights (234MB)
  • config.json - Model configuration
  • tokenizer_config.json - Tokenizer configuration
  • source.spm / target.spm - SentencePiece models
  • vocab.json - Vocabulary file
  • generation_config.json - Generation parameters

Training Scripts

The model was trained using custom scripts:

  • train_arc2arc_improved_deep.py - Main training script
  • run_arc2arc_improved_deep.sh - Training execution script
  • run_resume_arc2arc_deep.sh - Resume training script

Contact

For questions, issues, or contributions, please open an issue on the model repository.