You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

🦉 Georgian KenLM Language Model (3-gram)

This is a KenLM 3-gram language model trained on Georgian (ქართული) text data, intended for use in automatic speech recognition (ASR) and other language modeling tasks.

🧾 Model Details

Language: Georgian (ka)
Model Type: KenLM n-gram
n-gram size: 3-gram
Format: .arpa
Tooling: KenLM

📂 Files

ge_model9.arpa – ARPA plaintext format

📚 Training Data

The model was trained on a curated collection of Georgian text from various domains:

News articles
Subtitles
Books and web content

Data was cleaned, tokenized with whitespace, and normalized to standard Georgian orthography.

💬 Intended Use

This model is ideal for:

Beam search decoding in ASR systems (e.g., Whisper, DeepSpeech, Vosk)
Scoring and reranking ASR hypotheses
Basic text modeling or spelling correction in Georgian

🧪 Example Usage

import kenlm

def transliterate_georgian(text):
    georgian_to_latin = {
    'ა': 'a', 'ბ': 'b', 'გ': 'g', 'დ': 'd', 'ე': 'e', 'ვ': 'v', 'ზ': 'z', 'თ': 'T', 'ი': 'i',
    'კ': 'k', 'ლ': 'l', 'მ': 'm', 'ნ': 'n', 'ო': 'o', 'პ': 'p', 'ჟ': 'J', 'რ': 'r', 'ს': 's',
    'ტ': 't', 'უ': 'u', 'ფ': 'f', 'ქ': 'q', 'ღ': 'R', 'ყ': 'y', 'შ': 'S', 'ჩ': 'C', 'ც': 'c',
    'ძ': 'Z', 'წ': 'w', 'ჭ': 'W', 'ხ': 'x', 'ჯ': 'j', 'ჰ': 'h'}
    
    return ''.join(georgian_to_latin.get(char, char) for char in text)

model = kenlm.Model("ge_model9.arpa")
sentence = "ეს არის ტესტი"
print(model.score(transliterate_georgian(sentence), bos=True, eos=True))

Citation

@misc{georgian-kenlm,
  title={Georgian KenLM Language Model},
  author={Giorgi G},
  year={2025},
  howpublished={\url{https://huggingface.co/psyfreak/GEO-KenLM}}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results

Metadata error: specify a dataset to view leaderboard