File size: 1,955 Bytes
b343b24
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1eb5445
b343b24
 
1eb5445
b343b24
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
---
base_model:
- mistralai/Mistral-Nemo-Instruct-2407
language:
- ku
- en
license: apache-2.0
tags:
- text-generation-inference
- transformers
- unsloth
- mistral
datasets:
- nazimali/kurdish-wikipedia-articles
library_name: transformers
---

Continued pre-training on `mistralai/Mistral-Nemo-Instruct-2407` using the Kurdish wiki dataset with `unsloth`.
This model should be further fine-tuned since the pre-training was to improve Kurdish language understanding.
It's a quantized model using `bitsandbytes` so that it uses less memory. See [bitsandbytes documentation](https://huggingface.co/docs/transformers/main/en/quantization/bitsandbytes#bitsandbytes).

There isn't a standard or even a good Kurdish metric to evaluate the model (that I could find).
Will make it my next project to create an evaluation so that there's a reproducible baseline for Kurdish.


Will look into a multi-GPU training setup so don't have to wait all day for results. Would like to train it with both Kurmanji and Sorani.


### Use

Should be fine-tuned further for a specific task. See instruction fine-tuned model [nazimali/Mistral-Nemo-Kurdish-Instruct](https://huggingface.co/nazimali/Mistral-Nemo-Kurdish-Instruct).

### Training

Transformers `4.44.2`  
1 NVIDIA A100 80GB PCIe  
Duration 6h 31m 4s  

```json
{
  "total_flos": 4121524790259794000,
  "train/epoch": 1,
  "train/global_step": 1960,
  "train/grad_norm": 3.1958093643188477,
  "train/learning_rate": 0,
  "train/loss": 1.2108,
  "train_loss": 1.256846008738693,
  "train_runtime": 23227.1752,
  "train_samples_per_second": 2.7,
  "train_steps_per_second": 0.084
}
```

#### Pre-training data:

- `nazimali/kurdish-wikipedia-articles`
    - Dataset number of rows: 63,076
    - Filtered columns `title, text`
      - Must have at least 1 character
- Number of rows used for training: 62,720

#### Training prompt format:

```python
training_prompt = """Gotara Wikipedia
### Sernav: {}

### Gotar:
{}"""