ArabovMK
/

tajik-fasttext-model

@@ -1,5 +1,5 @@
 ---
-language: tg
 license: mit
 tags:
 - fasttext
@@ -10,11 +10,35 @@ tags:
 # Tajik FastText Word Embedding Model
-This repository contains a pretrained **FastText** model for the Tajik language, trained on an extensive corpus of Tajik texts.
-## 📊 Training Corpus Statistics
-### 📚 Books (99 total):
 - Programming: 6
 - History: 4
 - Religion: 12
@@ -24,41 +48,98 @@ This repository contains a pretrained **FastText** model for the Tajik language,
 - Poetry: 21
 - Textbooks: 28
-### 📰 Articles (134,497 total):
 - Asia-Plus: 20,471
 - Khovar: 21,557
 - Ovozi Tojik: 7,495
 - Farazh: 4,679
 - Wikipedia: 80,295
-### ✅ Total Corpus:
-- **Total documents**: 134,596 (99 books + 134,497 articles)
-- **Total tokens**: 33,535,383 words
-- **Unique lemmas**: 649,308
-## Model Details
-- **Model type**: FastText (with subword information)
-- **Vector size**: 300 dimensions
-- **Window size**: 5
-- **Min word count**: 5
-## Files Included
-| File | Description |
-|------|-------------|
-| `tajik_fasttext.model` | Gensim model file |
-| `*.npy` files | Supporting vector files |
-## Usage Example
 ```python
 from gensim.models import FastText
 model = FastText.load("tajik_fasttext.model")
-vector = model.wv["падар"]  # Get word vector
 similar_words = model.wv.most_similar("модар")  # Find similar words
 ```
-## Citation
 If you use this model, please cite:
 ```bibtex
 @misc{ArabovMK_Tajik_FastText,
   author = {ArabovMK},

 ---
+language: en
 license: mit
 tags:
 - fasttext
 # Tajik FastText Word Embedding Model
+This repository contains a pretrained **FastText** model for the **Tajik language**, trained on a large corpus of Tajik texts. The model supports **subword information**, allowing it to generate embeddings even for rare or unseen (OOV) words.
+The model is suitable for use in various NLP tasks such as:
+- Semantic analysis
+- Text classification
+- Machine translation
+- Synonym detection and thesaurus building
+- Enhancing other models through embedding initialization
+Licensed under the [MIT License](LICENSE), which allows free usage in both research and commercial applications.
+---
+## 📊 Model Overview
+| Parameter         | Value                      |
+|------------------|----------------------------|
+| Model Type        | FastText (with subwords)   |
+| Vector Size       | 300                        |
+| Vocabulary Size   | 145,232                    |
+| OOV Support       | Yes                        |
+| Context Window    | 5                          |
+| Min Word Count    | ≥ 5                        |
+---
+## 📚 Training Corpus
+### Books (Total: 99)
 - Programming: 6
 - History: 4
 - Religion: 12
 - Poetry: 21
 - Textbooks: 28
+### Articles (Total: 134,497)
 - Asia-Plus: 20,471
 - Khovar: 21,557
 - Ovozi Tojik: 7,495
 - Farazh: 4,679
 - Wikipedia: 80,295
+### Total Corpus Statistics
+- **Documents**: 134,596
+- **Tokens**: 33,535,383
+- **Unique Lemmas**: 649,308
+---
+## 🧪 Model Comparison with Meta FastText
+We evaluated our model against Meta’s pretrained FastText using semantic similarity and Spearman correlation:
+| Model             | Spearman Correlation | OOV Support |
+|------------------|----------------------|-------------|
+| FastText (Meta)  | **0.703**            | Yes         |
+| **FastText (ours)** | **0.622**        | **Yes**     |
+While Meta FastText achieves better overall performance, our model demonstrates strong results on Tajik-specific morphology and semantics.
+---
+## 🔍 Example Similar Words
+| Word      | Nearest Neighbors (FastText) |
+|-----------|-------------------------------|
+| кӯдак     | кӯдаку(0.82), хурдкӯдак(0.81), кӯдакам(0.81), кӯдакат(0.81), кӯдаке(0.81) |
+| муаллим   | муаллиме(0.90), муаллимат(0.89), муаллимин(0.89), муаллиму(0.88), муаллима(0.88) |
+| об        | оби(0.79), обро(0.74), обмӯрии(0.70), обшустаи(0.68), обшуста(0.66) |
+| мард      | марда(0.87), мардхӯ(0.85), мардвор(0.85), мардро(0.83), зан(0.82) |
+| деҳа      | деҳайи(0.83), деҳаю(0.80), деҳавз(0.78), деҳакӣ(0.76), деҳодеҳ(0.74) |
+| китоб     | китобӣ(0.84), китобгуна(0.83), китобча(0.81), китобсӯзӣ(0.81), китобро(0.81) |
+| меҳмон    | меҳмонӣ(0.86), меҳмоншо(0.85), меҳмонат(0.83), меҳмонҳона(0.82), меҳмони(0.82) |
+| шаҳр      | шаҳрӯ(0.82), шаҳрча(0.80), бушаҳр(0.79), шаҳрат(0.79), навшаҳр(0.79) |
+| падар     | падаршӯ(0.89), падарӣ(0.84), падаршӯву(0.84), падаре(0.84), падаршон(0.83) |
+| модар     | модаршӯ(0.86), модаршӯяш(0.83), модару(0.81), модаре(0.81), модарвор(0.80) |
+---
+## 🧩 Handling OOV (Out-of-Vocabulary) Words
+FastText supports generating vectors for unknown words via subword units (n-grams). Here are some examples:
+| Unknown Word | Closest Matches (FastText) |
+|--------------|----------------------------|
+| кӯдакона     | кӯдаконаи(0.82), кӯдаконат(0.81), кӯдаконае(0.81) |
+| меҳмонамон   | меҳмон(0.77), меҳмонҳо(0.77), меҳмонам(0.76) |
+| муаллимон    | муаллимони(0.89), муаллимоне(0.88), муаллимону(0.83) |
+| деҳоти       | дарҷамоати(0.79), чамоати(0.74), ҷамоати(0.81) |
+| саводнок     | саводнокӣ(0.88), саводнокиву(0.85), саводнокии(0.84) |
+---
+## 📌 Features for Tajik Language
+Our model performs well on:
+- **Semantic similarity**: e.g., "мард" ↔ "зан", "к��тоб" ↔ "китобгуна"
+- **Morphological variants**: e.g., "кӯдак" → "кӯдаку", "кӯдаки"
+- **Rare/compound words**: thanks to subword representations like "саводнок", "деҳоти"
+---
+## 💡 Usage Example
 ```python
 from gensim.models import FastText
 model = FastText.load("tajik_fasttext.model")
+vector = model.wv["падар"]  # Get vector for a word
 similar_words = model.wv.most_similar("модар")  # Find similar words
 ```
+---
+## 🗂️ Files Included
+| File               | Description                                  |
+|--------------------|----------------------------------------------|
+| `tajik_fasttext.model` | Gensim FastText model file                 |
+| `*.npy` files         | Supporting NumPy arrays for vectors        |
+---
+## 📚 Citation
 If you use this model, please cite:
 ```bibtex
 @misc{ArabovMK_Tajik_FastText,
   author = {ArabovMK},