Update README.md

Browse files

Files changed (1) hide show

README.md +140 -8

README.md CHANGED Viewed

@@ -7,17 +7,149 @@ tags:
 - transformers
 - unsloth
 - lfm2
-license: apache-2.0
 language:
-- en
 ---
-# Uploaded finetuned  model
-- **Developed by:** yasserrmd
-- **License:** apache-2.0
-- **Finetuned from model :** unsloth/LFM2-1.2B
-This lfm2 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
-[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)

 - transformers
 - unsloth
 - lfm2
+- arabic
+- dialect
+- emirati
+- conversational
+- causal-lm
+- instruction-tuned
+- trl
+license: cc-by-nc-4.0
 language:
+- ar
 ---
+# kallamni-1.2b-v1m
+**Kallamni 1.2B v1m** is a **1.2B parameter Arabic conversational model** fine-tuned specifically for **spoken Emirati Arabic (اللهجة الإماراتية المحكية)**.
+It is designed to generate **natural, fluent, and culturally relevant** responses for daily-life conversations, rather than formal Modern Standard Arabic (MSA).
+---
+## Model Summary
+* **Model type:** Causal LM, instruction-tuned for chat.
+* **Languages:** Emirati Arabic dialect (spoken style).
+* **Fine-tuning:** 3 epochs with LoRA adapters.
+* **Frameworks:** [Unsloth](https://github.com/unslothai/unsloth) + [TRL](https://github.com/huggingface/trl).
+* **Dataset:** 12,324 synthetic Emirati Arabic Q\&A pairs generated using **GPT-5** and **GPT-4o**.
+---
+## Dataset
+* **Size:** 12,324 examples.
+* **Source:** Synthetic Q\&A pairs created via GPT-5 + GPT-4o, filtered for Emirati dialect.
+* **Domains covered:**
+  * Daily life conversations (shopping, weather, greetings, family, transport).
+  * Social and cultural events (Eid, weddings, gatherings).
+  * Household and personal routines.
+* **Format:** Chat-style examples with `<|im_start|>user` / `<|im_start|>assistant` tokens, e.g.:
+```text
+<|startoftext|><|im_start|>user
+شو تسوي إذا انقطع الإنترنت في البيت؟<|im_end|>
+<|im_start|>assistant
+أول شي أتصل بالشركة، وإذا ما ردوا أستخدم داتا التلفون لين يرجع النت.<|im_end|>
+```
+---
+## ⚙️ Training
+* **Frameworks:**
+  * **Unsloth** → optimized finetuning, memory efficiency, \~2× faster training.
+  * **TRL (SFTTrainer)** → supervised fine-tuning with instruction alignment.
+* **Base model:** Lightweight 1.2B causal LM.
+* **Epochs:** 3 full passes over the dataset.
+* **Fine-tuning strategy:**
+  * LoRA adapters on attention + MLP layers.
+  * Chat template applied consistently with TRL.
+---
+## Usage
+You can load and run the model with `transformers`:
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+# Load model and tokenizer
+model_id = "yasserrmd/kallamni-1.2b-v1m"
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    device_map="auto",
+    torch_dtype="bfloat16",
+#    attn_implementation="flash_attention_2"  # Uncomment if GPU supports it
+)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+# Generate answer
+prompt = "شو تسوي إذا انقطع الإنترنت في البيت؟"
+input_ids = tokenizer.apply_chat_template(
+    [{"role": "user", "content": prompt}],
+    add_generation_prompt=True,
+    return_tensors="pt",
+    tokenize=True,
+).to(model.device)
+output = model.generate(
+    input_ids,
+    do_sample=True,
+    temperature=0.3,
+    min_p=0.15,
+    repetition_penalty=1.05,
+    max_new_tokens=256,
+)
+print(tokenizer.decode(output[0], skip_special_tokens=False))
+# Example output:
+# <|startoftext|><|im_start|>user
+# شو تسوي إذا انقطع الإنترنت في البيت؟<|im_end|>
+# <|im_start|>assistant
+# أول شي أتصل بالشركة، وإذا ما ردوا أستخدم داتا التلفون لين يرجع النت.<|im_end|>
+```
+---
+## Performance
+* **Dialect accuracy:** \~85% Emirati consistency.
+* **Answer relevance:** \~90% good/semi-good.
+* **Weak cases:** occasional semi-formal phrasing or generic filler.
+* **Strengths:**
+  * Culturally aligned Emirati expressions.
+  * Natural conversational length (8–15 words minimum).
+  * Balanced coverage of family, work, travel, and social contexts.
+---
+## Intended Use
+* **Chatbots & voice assistants** for Emirati Arabic.
+* **Language learning tools** for practicing dialect.
+* **Dataset building block** for Gulf Arabic LLM research.
+---
+## Limitations
+* May mix in some MSA or generic Arabic in rare cases.
+* Not suitable for factual QA outside daily conversations.
+* Not designed for professional/legal/medical contexts.
+---
+## Acknowledgements
+* **Unsloth** team for efficient fine-tuning tooling.
+* **TRL** from Hugging Face for alignment training.
+* Synthetic dataset generation powered by **GPT-5** and **GPT-4o**.