File size: 2,600 Bytes
f09a29c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b00cd99
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61b05da
 
 
 
b00cd99
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90d5849
 
 
b00cd99
 
 
 
 
 
 
 
 
90d5849
b00cd99
 
 
90d5849
 
 
b00cd99
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
---
base_model: LiquidAI/LFM2-350M
tags:
- text-generation-inference
- transformers
- unsloth
- lfm2
- trl
- sft
- arabic
license: apache-2.0
language:
- ar
datasets:
- arbml/tashkeela
---

# Tashkeel-350M

**Arabic Diacritization Model** | **ู†ูŽู…ููˆุฐูŽุฌูŽ ุชูŽุดู’ูƒููŠู„ู ุงู„ู†ูุตููˆุตู ุงู„ู’ุนูŽุฑูŽุจููŠูŽุฉู**

ู†ู…ูˆุฐุฌ ุจุญุฌู… 350 ู…ู„ูŠูˆู† ุจุงุฑุงู…ุชุฑ ู…ุฎุตุต ู„ุชุดูƒูŠู„ ุงู„ู†ุตูˆุต ุงู„ุนุฑุจูŠุฉ. ุชู… ุชุฏุฑูŠุจ ู‡ุฐุง ุงู„ู†ู…ูˆุฐุฌ ุจุถุจุท ู†ู…ูˆุฐุฌ 

`LiquidAI/LFM2-350M` 

ุนู„ู‰ ู…ุฌู…ูˆุนุฉ ุงู„ุจูŠุงู†ุงุช

 `arbml/tashkeela`.

- **ุงู„ู†ู…ูˆุฐุฌ ุงู„ุฃุณุงุณูŠ:** [LiquidAI/LFM2-350M](https://huggingface.co/LiquidAI/LFM2-350M)
- **ู…ุฌู…ูˆุนุฉ ุงู„ุจูŠุงู†ุงุช:** [arbml/tashkeela](https://huggingface.co/datasets/arbml/tashkeela)

### ูƒูŠููŠุฉ ุงู„ุงุณุชุฎุฏุงู…

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

#ุชุญู…ูŠู„ ุงู„ู†ู…ูˆุฐุฌ
model_id = "Etherll/Tashkeel-350M"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="bfloat16",
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# ุฅุถุงูุฉ ุงู„ุชุดูƒูŠู„
prompt = "ุงู„ุณู„ุงู… ุนู„ูŠูƒู…" 
input_ids = tokenizer.apply_chat_template(
    [
      {'role': 'system','content': "ุฃู†ุช ู†ู…ูˆุฐุฌ ู„ุชุดูƒูŠู„ ุงู„ู†ุตูˆุต ุงู„ุนุฑุจูŠุฉ."},
      {"role": "user", "content": prompt}
    ],
    add_generation_prompt=True,
    return_tensors="pt",
    tokenize=True,
).to(model.device)

output = model.generate(
    input_ids,
    do_sample=False,  
)

print(tokenizer.decode(output[0, input_ids.shape[-1]:], skip_special_tokens=True))
```

### ู…ุซุงู„
*   **ุงู„ู†ุต ุงู„ู…ุฏุฎู„:** `ุงู„ุณู„ุงู… ุนู„ูŠูƒู…`
*   **ุงู„ู†ุงุชุฌ:** `ุงูŽู„ุณูŽู„ูŽุงู…ู ุนูŽู„ูŽูŠู’ูƒูู…ู’`

---
---

# Tashkeel-350M (English)

A 350M parameter model for Arabic diacritization (Tashkeel). This model is a fine-tune of `LiquidAI/LFM2-350M` on the `arbml/tashkeela` dataset.

- **Base Model:** [LiquidAI/LFM2-350M](https://huggingface.co/LiquidAI/LFM2-350M)
- **Dataset:** [arbml/tashkeela](https://huggingface.co/datasets/arbml/tashkeela)

### How to Use
The Python code for usage is the same as listed in the Arabic section above.

### Example
*   **Input:** `ุงู„ุณู„ุงู… ุนู„ูŠูƒู…`
*   **Output:** `ุงูŽู„ุณูŽู„ูŽุงู…ู ุนูŽู„ูŽูŠู’ูƒูู…ู’`

This lfm2 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.

[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)