ArabovMK commited on
Commit
96c7443
·
verified ·
1 Parent(s): a123bd6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +103 -22
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- language: tg
3
  license: mit
4
  tags:
5
  - fasttext
@@ -10,11 +10,35 @@ tags:
10
 
11
  # Tajik FastText Word Embedding Model
12
 
13
- This repository contains a pretrained **FastText** model for the Tajik language, trained on an extensive corpus of Tajik texts.
14
 
15
- ## 📊 Training Corpus Statistics
 
 
 
 
 
16
 
17
- ### 📚 Books (99 total):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  - Programming: 6
19
  - History: 4
20
  - Religion: 12
@@ -24,41 +48,98 @@ This repository contains a pretrained **FastText** model for the Tajik language,
24
  - Poetry: 21
25
  - Textbooks: 28
26
 
27
- ### 📰 Articles (134,497 total):
28
  - Asia-Plus: 20,471
29
  - Khovar: 21,557
30
  - Ovozi Tojik: 7,495
31
  - Farazh: 4,679
32
  - Wikipedia: 80,295
33
 
34
- ### Total Corpus:
35
- - **Total documents**: 134,596 (99 books + 134,497 articles)
36
- - **Total tokens**: 33,535,383 words
37
- - **Unique lemmas**: 649,308
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
 
39
- ## Model Details
40
- - **Model type**: FastText (with subword information)
41
- - **Vector size**: 300 dimensions
42
- - **Window size**: 5
43
- - **Min word count**: 5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
 
45
- ## Files Included
46
- | File | Description |
47
- |------|-------------|
48
- | `tajik_fasttext.model` | Gensim model file |
49
- | `*.npy` files | Supporting vector files |
50
 
51
- ## Usage Example
52
  ```python
53
  from gensim.models import FastText
54
 
55
  model = FastText.load("tajik_fasttext.model")
56
- vector = model.wv["падар"] # Get word vector
57
  similar_words = model.wv.most_similar("модар") # Find similar words
58
  ```
59
 
60
- ## Citation
 
 
 
 
 
 
 
 
 
 
 
 
61
  If you use this model, please cite:
 
62
  ```bibtex
63
  @misc{ArabovMK_Tajik_FastText,
64
  author = {ArabovMK},
 
1
  ---
2
+ language: en
3
  license: mit
4
  tags:
5
  - fasttext
 
10
 
11
  # Tajik FastText Word Embedding Model
12
 
13
+ This repository contains a pretrained **FastText** model for the **Tajik language**, trained on a large corpus of Tajik texts. The model supports **subword information**, allowing it to generate embeddings even for rare or unseen (OOV) words.
14
 
15
+ The model is suitable for use in various NLP tasks such as:
16
+ - Semantic analysis
17
+ - Text classification
18
+ - Machine translation
19
+ - Synonym detection and thesaurus building
20
+ - Enhancing other models through embedding initialization
21
 
22
+ Licensed under the [MIT License](LICENSE), which allows free usage in both research and commercial applications.
23
+
24
+ ---
25
+
26
+ ## 📊 Model Overview
27
+
28
+ | Parameter | Value |
29
+ |------------------|----------------------------|
30
+ | Model Type | FastText (with subwords) |
31
+ | Vector Size | 300 |
32
+ | Vocabulary Size | 145,232 |
33
+ | OOV Support | Yes |
34
+ | Context Window | 5 |
35
+ | Min Word Count | ≥ 5 |
36
+
37
+ ---
38
+
39
+ ## 📚 Training Corpus
40
+
41
+ ### Books (Total: 99)
42
  - Programming: 6
43
  - History: 4
44
  - Religion: 12
 
48
  - Poetry: 21
49
  - Textbooks: 28
50
 
51
+ ### Articles (Total: 134,497)
52
  - Asia-Plus: 20,471
53
  - Khovar: 21,557
54
  - Ovozi Tojik: 7,495
55
  - Farazh: 4,679
56
  - Wikipedia: 80,295
57
 
58
+ ### Total Corpus Statistics
59
+ - **Documents**: 134,596
60
+ - **Tokens**: 33,535,383
61
+ - **Unique Lemmas**: 649,308
62
+
63
+ ---
64
+
65
+ ## 🧪 Model Comparison with Meta FastText
66
+
67
+ We evaluated our model against Meta’s pretrained FastText using semantic similarity and Spearman correlation:
68
+
69
+ | Model | Spearman Correlation | OOV Support |
70
+ |------------------|----------------------|-------------|
71
+ | FastText (Meta) | **0.703** | Yes |
72
+ | **FastText (ours)** | **0.622** | **Yes** |
73
+
74
+ While Meta FastText achieves better overall performance, our model demonstrates strong results on Tajik-specific morphology and semantics.
75
+
76
+ ---
77
+
78
+ ## 🔍 Example Similar Words
79
+
80
+ | Word | Nearest Neighbors (FastText) |
81
+ |-----------|-------------------------------|
82
+ | кӯдак | кӯдаку(0.82), хурдкӯдак(0.81), кӯдакам(0.81), кӯдакат(0.81), кӯдаке(0.81) |
83
+ | муаллим | муаллиме(0.90), муаллимат(0.89), муаллимин(0.89), муаллиму(0.88), муаллима(0.88) |
84
+ | об | оби(0.79), обро(0.74), обмӯрии(0.70), обшустаи(0.68), обшуста(0.66) |
85
+ | мард | марда(0.87), мардхӯ(0.85), мардвор(0.85), мардро(0.83), зан(0.82) |
86
+ | деҳа | деҳайи(0.83), деҳаю(0.80), деҳавз(0.78), деҳакӣ(0.76), деҳодеҳ(0.74) |
87
+ | китоб | китобӣ(0.84), китобгуна(0.83), китобча(0.81), китобсӯзӣ(0.81), китобро(0.81) |
88
+ | меҳмон | меҳмонӣ(0.86), меҳмоншо(0.85), меҳмонат(0.83), меҳмонҳона(0.82), меҳмони(0.82) |
89
+ | шаҳр | шаҳрӯ(0.82), шаҳрча(0.80), бушаҳр(0.79), шаҳрат(0.79), навшаҳр(0.79) |
90
+ | падар | падаршӯ(0.89), падарӣ(0.84), падаршӯву(0.84), падаре(0.84), падаршон(0.83) |
91
+ | модар | модаршӯ(0.86), модаршӯяш(0.83), модару(0.81), модаре(0.81), модарвор(0.80) |
92
 
93
+ ---
94
+
95
+ ## 🧩 Handling OOV (Out-of-Vocabulary) Words
96
+
97
+ FastText supports generating vectors for unknown words via subword units (n-grams). Here are some examples:
98
+
99
+ | Unknown Word | Closest Matches (FastText) |
100
+ |--------------|----------------------------|
101
+ | кӯдакона | кӯдаконаи(0.82), кӯдаконат(0.81), кӯдаконае(0.81) |
102
+ | меҳмонамон | меҳмон(0.77), меҳмонҳо(0.77), меҳмонам(0.76) |
103
+ | муаллимон | муаллимони(0.89), муаллимоне(0.88), муаллимону(0.83) |
104
+ | деҳоти | дарҷамоати(0.79), чамоати(0.74), ҷамоати(0.81) |
105
+ | саводнок | саводнокӣ(0.88), саводнокиву(0.85), саводнокии(0.84) |
106
+
107
+ ---
108
+
109
+ ## 📌 Features for Tajik Language
110
+
111
+ Our model performs well on:
112
+ - **Semantic similarity**: e.g., "мард" ↔ "зан", "к��тоб" ↔ "китобгуна"
113
+ - **Morphological variants**: e.g., "кӯдак" → "кӯдаку", "кӯдаки"
114
+ - **Rare/compound words**: thanks to subword representations like "саводнок", "деҳоти"
115
+
116
+ ---
117
 
118
+ ## 💡 Usage Example
 
 
 
 
119
 
 
120
  ```python
121
  from gensim.models import FastText
122
 
123
  model = FastText.load("tajik_fasttext.model")
124
+ vector = model.wv["падар"] # Get vector for a word
125
  similar_words = model.wv.most_similar("модар") # Find similar words
126
  ```
127
 
128
+ ---
129
+
130
+ ## 🗂️ Files Included
131
+
132
+ | File | Description |
133
+ |--------------------|----------------------------------------------|
134
+ | `tajik_fasttext.model` | Gensim FastText model file |
135
+ | `*.npy` files | Supporting NumPy arrays for vectors |
136
+
137
+ ---
138
+
139
+ ## 📚 Citation
140
+
141
  If you use this model, please cite:
142
+
143
  ```bibtex
144
  @misc{ArabovMK_Tajik_FastText,
145
  author = {ArabovMK},