yasserrmd commited on
Commit
806bdbc
·
verified ·
1 Parent(s): f1e7009

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +140 -8
README.md CHANGED
@@ -7,17 +7,149 @@ tags:
7
  - transformers
8
  - unsloth
9
  - lfm2
10
- license: apache-2.0
 
 
 
 
 
 
 
11
  language:
12
- - en
13
  ---
14
 
15
- # Uploaded finetuned model
16
 
17
- - **Developed by:** yasserrmd
18
- - **License:** apache-2.0
19
- - **Finetuned from model :** unsloth/LFM2-1.2B
20
 
21
- This lfm2 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
22
 
23
- [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  - transformers
8
  - unsloth
9
  - lfm2
10
+ - arabic
11
+ - dialect
12
+ - emirati
13
+ - conversational
14
+ - causal-lm
15
+ - instruction-tuned
16
+ - trl
17
+ license: cc-by-nc-4.0
18
  language:
19
+ - ar
20
  ---
21
 
 
22
 
 
 
 
23
 
24
+ # kallamni-1.2b-v1m
25
 
26
+ **Kallamni 1.2B v1m** is a **1.2B parameter Arabic conversational model** fine-tuned specifically for **spoken Emirati Arabic (اللهجة الإماراتية المحكية)**.
27
+ It is designed to generate **natural, fluent, and culturally relevant** responses for daily-life conversations, rather than formal Modern Standard Arabic (MSA).
28
+
29
+ ---
30
+
31
+ ## Model Summary
32
+
33
+ * **Model type:** Causal LM, instruction-tuned for chat.
34
+ * **Languages:** Emirati Arabic dialect (spoken style).
35
+ * **Fine-tuning:** 3 epochs with LoRA adapters.
36
+ * **Frameworks:** [Unsloth](https://github.com/unslothai/unsloth) + [TRL](https://github.com/huggingface/trl).
37
+ * **Dataset:** 12,324 synthetic Emirati Arabic Q\&A pairs generated using **GPT-5** and **GPT-4o**.
38
+
39
+ ---
40
+
41
+ ## Dataset
42
+
43
+ * **Size:** 12,324 examples.
44
+ * **Source:** Synthetic Q\&A pairs created via GPT-5 + GPT-4o, filtered for Emirati dialect.
45
+ * **Domains covered:**
46
+
47
+ * Daily life conversations (shopping, weather, greetings, family, transport).
48
+ * Social and cultural events (Eid, weddings, gatherings).
49
+ * Household and personal routines.
50
+ * **Format:** Chat-style examples with `<|im_start|>user` / `<|im_start|>assistant` tokens, e.g.:
51
+
52
+ ```text
53
+ <|startoftext|><|im_start|>user
54
+ شو تسوي إذا انقطع الإنترنت في البيت؟<|im_end|>
55
+ <|im_start|>assistant
56
+ أول شي أتصل بالشركة، وإذا ما ردوا أستخدم داتا التلفون لين يرجع النت.<|im_end|>
57
+ ```
58
+
59
+ ---
60
+
61
+ ## ⚙️ Training
62
+
63
+ * **Frameworks:**
64
+
65
+ * **Unsloth** → optimized finetuning, memory efficiency, \~2× faster training.
66
+ * **TRL (SFTTrainer)** → supervised fine-tuning with instruction alignment.
67
+ * **Base model:** Lightweight 1.2B causal LM.
68
+ * **Epochs:** 3 full passes over the dataset.
69
+ * **Fine-tuning strategy:**
70
+
71
+ * LoRA adapters on attention + MLP layers.
72
+ * Chat template applied consistently with TRL.
73
+
74
+ ---
75
+
76
+ ## Usage
77
+
78
+ You can load and run the model with `transformers`:
79
+
80
+ ```python
81
+ from transformers import AutoModelForCausalLM, AutoTokenizer
82
+
83
+ # Load model and tokenizer
84
+ model_id = "yasserrmd/kallamni-1.2b-v1m"
85
+ model = AutoModelForCausalLM.from_pretrained(
86
+ model_id,
87
+ device_map="auto",
88
+ torch_dtype="bfloat16",
89
+ # attn_implementation="flash_attention_2" # Uncomment if GPU supports it
90
+ )
91
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
92
+
93
+ # Generate answer
94
+ prompt = "شو تسوي إذا انقطع الإنترنت في البيت؟"
95
+ input_ids = tokenizer.apply_chat_template(
96
+ [{"role": "user", "content": prompt}],
97
+ add_generation_prompt=True,
98
+ return_tensors="pt",
99
+ tokenize=True,
100
+ ).to(model.device)
101
+
102
+ output = model.generate(
103
+ input_ids,
104
+ do_sample=True,
105
+ temperature=0.3,
106
+ min_p=0.15,
107
+ repetition_penalty=1.05,
108
+ max_new_tokens=256,
109
+ )
110
+
111
+ print(tokenizer.decode(output[0], skip_special_tokens=False))
112
+
113
+ # Example output:
114
+ # <|startoftext|><|im_start|>user
115
+ # شو تسوي إذا انقطع الإنترنت في البيت؟<|im_end|>
116
+ # <|im_start|>assistant
117
+ # أول شي أتصل بالشركة، وإذا ما ردوا أستخدم داتا التلفون لين يرجع النت.<|im_end|>
118
+ ```
119
+
120
+ ---
121
+
122
+ ## Performance
123
+
124
+ * **Dialect accuracy:** \~85% Emirati consistency.
125
+ * **Answer relevance:** \~90% good/semi-good.
126
+ * **Weak cases:** occasional semi-formal phrasing or generic filler.
127
+ * **Strengths:**
128
+
129
+ * Culturally aligned Emirati expressions.
130
+ * Natural conversational length (8–15 words minimum).
131
+ * Balanced coverage of family, work, travel, and social contexts.
132
+
133
+ ---
134
+
135
+ ## Intended Use
136
+
137
+ * **Chatbots & voice assistants** for Emirati Arabic.
138
+ * **Language learning tools** for practicing dialect.
139
+ * **Dataset building block** for Gulf Arabic LLM research.
140
+
141
+ ---
142
+
143
+ ## Limitations
144
+
145
+ * May mix in some MSA or generic Arabic in rare cases.
146
+ * Not suitable for factual QA outside daily conversations.
147
+ * Not designed for professional/legal/medical contexts.
148
+
149
+ ---
150
+
151
+ ## Acknowledgements
152
+
153
+ * **Unsloth** team for efficient fine-tuning tooling.
154
+ * **TRL** from Hugging Face for alignment training.
155
+ * Synthetic dataset generation powered by **GPT-5** and **GPT-4o**.