๐ฆ Georgian KenLM Language Model (3-gram)
This is a KenLM 3-gram language model trained on Georgian (แฅแแ แแฃแแ) text data, intended for use in automatic speech recognition (ASR) and other language modeling tasks.
๐งพ Model Details
- Language: Georgian (ka)
- Model Type: KenLM n-gram
- n-gram size: 3-gram
- Format: .arpa
- Tooling: KenLM
๐ Files
- ge_model9.arpaโ ARPA plaintext format
๐ Training Data
The model was trained on a curated collection of Georgian text from various domains:
- News articles
- Subtitles
- Books and web content
Data was cleaned, tokenized with whitespace, and normalized to standard Georgian orthography.
๐ฌ Intended Use
This model is ideal for:
- Beam search decoding in ASR systems (e.g., Whisper, DeepSpeech, Vosk)
- Scoring and reranking ASR hypotheses
- Basic text modeling or spelling correction in Georgian
๐งช Example Usage
import kenlm
def transliterate_georgian(text):
    georgian_to_latin = {
    'แ': 'a', 'แ': 'b', 'แ': 'g', 'แ': 'd', 'แ': 'e', 'แ': 'v', 'แ': 'z', 'แ': 'T', 'แ': 'i',
    'แ': 'k', 'แ': 'l', 'แ': 'm', 'แ': 'n', 'แ': 'o', 'แ': 'p', 'แ': 'J', 'แ ': 'r', 'แก': 's',
    'แข': 't', 'แฃ': 'u', 'แค': 'f', 'แฅ': 'q', 'แฆ': 'R', 'แง': 'y', 'แจ': 'S', 'แฉ': 'C', 'แช': 'c',
    'แซ': 'Z', 'แฌ': 'w', 'แญ': 'W', 'แฎ': 'x', 'แฏ': 'j', 'แฐ': 'h'}
    
    return ''.join(georgian_to_latin.get(char, char) for char in text)
model = kenlm.Model("ge_model9.arpa")
sentence = "แแก แแ แแก แขแแกแขแ"
print(model.score(transliterate_georgian(sentence), bos=True, eos=True))
Citation
@misc{georgian-kenlm,
  title={Georgian KenLM Language Model},
  author={Giorgi G},
  year={2025},
  howpublished={\url{https://huggingface.co/psyfreak/GEO-KenLM}}
}
	Inference Providers
	NEW
	
	
	This model isn't deployed by any Inference Provider.
	๐
			
		Ask for provider support
