AKK-60m

Introducing AKK-60m, a model capable of handling a diverse number of cuneiform translation, transliteration, and correction tasks.

1. Model description

This is an instruct model, meaning it is capable of multiple tasks. It is intended primarily for translation + transliteration, but it can also be used for reverse translation as well.

Translation Instructions:

  • "Translate Akkadian cuneiform to English" + cuneiform signs → English
  • "Translate complex Akkadian transliteration to English" + complex transliteration → English
  • "Translate Akkadian simple transliteration to English" + simple transliteration → English
  • "Translate Akkadian grouped transliteration to English" + transliteration with special symbols → English
  • "Translate English to Akkadian cuneiform" + English → Akkadian cuneiform signs
  • "Translate English to simple Akkadian transliteration" + English → Akkadian simple transliteration with no special symbols
  • "Translate English to grouped Akkadian transliteration" + English → Akkadian transliteration grouped into words with special symbols

Transliteration Instructions:

  • "Transliterate Akkadian cuneiform to simple Latin Characters" + cuneiform signs → transliteration with no special symbols
  • "Transliterate Akkadian cuneiform to grouped Latin characters" + cuneiform signs → transliteration with special symbols/subscripts
  • "Group Akkadian transliteration into likely words" + simple transliteration → transliteration with special symbols/subscripts

Mising Sign Insructions:

  • 'Identify the missing signs: ' + string of Akkadian cuneiform, transliterations

Base model

This is a finetuned version of google's t5-small.

2. Usage (code snippet)

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_path = "Thalesian/AKK-60m"
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False, local_files_only=True)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)

# 1) Prepare your cuneiform input
prompt = "Translate Akkadian cuneiform to English: "
input_text = "𒄨 𒃼 𒁺 𒊭 𒀸 𒌅 𒆰 𒋾 𒀸 𒋩 𒂗 𒋙 𒆰 𒆳 𒆷 𒈠 𒄀 𒊑 𒋗 𒁶 𒋻 𒁁 𒋾 𒌑 𒁖 𒆥 𒄣 𒀀 𒁍 𒄫 𒄑 𒁍 𒉡 𒈠 𒍣 𒆥 𒆧 𒅎 𒉡 𒌋 "

# 2) Tokenize & get model outputs
inputs = tokenizer(prompt + input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=64)

# 3) Decode prediction
prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("Reference:", "young man valiant who through help assur lord all of not submissive one like a pottery bowl crush minutely like a flood flatten as nothing count")
print("Prediction:", prediction)

3. Training and evaluation data

Data was used from the Akkademia project, previously published in PNAS Nexus. Additional data for pre-training and training came from CDLI Akkadian data. More information on the training data, as well as the test and validation splits, can be found on both the GitHub and published methodology.

Training procedure

It was trained in 5 tranches with different datasets and collators:

  • a pretraining dataset (transliterations only) of CDLI transliterated data (389,834 lines) and Akkademia + CDLI translated data (126,649 lines)
  • a training dataset which included Akkademia and CDLI (126,649 lines)

And 3 different collation methods:

  • pretraining collation which introduces an asterisk to represent missing signs
  • missing sign translations, which randomly introduces an asterisk to represent missing signs
  • translation error, which randomly introduces the wrong sign into input data to simulate transliteration or glyph error

Final stage training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-05
  • train_batch_size: 128
  • eval_batch_size: 128
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 2
  • total_train_batch_size: 256
  • total_eval_batch_size: 256
  • optimizer: Use OptimizerNames.ADAMW_APEX_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 5000
  • num_epochs: 200

Framework versions

  • Transformers 4.50.3
  • PyTorch 2.6.0+cu126
  • Datasets 3.3.0
  • Tokenizers 0.21.1

4. Metrics

From Language From Script To Language To Script Bleu
Akkadian Cuneiform English Latin 70.11
Akkadian Transliteration English Latin 70.94
Akkadian Cuneiform Akkadian Transliteration 93.87
English Latin Akkadian Transliteration 45.51
English Latin Akkadian Cuneiform 47.10

5. Intended uses

– Short Akkadian lines, transliteration pipelines, reverse lookup experiments.

6. Limitations

– Context window is only 64 tokens, it is untested on long passages.

7. How to Cite

@misc{drake2025akk60m,
  title        = {{AKK-60m}: A T5-Small for Akkadian⇄English},
  author       = {Drake, B. Lee},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/Thalesian/AKK-60m}}
}
Downloads last month
5
Safetensors
Model size
60.7M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support