File size: 11,784 Bytes
2228dd4
 
 
 
 
3243a1d
2228dd4
362ed93
2228dd4
3243a1d
2228dd4
 
 
 
 
1a8950c
3243a1d
2228dd4
3243a1d
 
 
2228dd4
3243a1d
1a8950c
3243a1d
 
 
7eb7f81
3243a1d
 
 
1a8950c
2228dd4
 
3243a1d
2228dd4
362ed93
2228dd4
 
 
3243a1d
 
 
2228dd4
3243a1d
2228dd4
3243a1d
2228dd4
3243a1d
2228dd4
 
3243a1d
2228dd4
3243a1d
2228dd4
 
3243a1d
2228dd4
3243a1d
106d286
2228dd4
3243a1d
cb1d63b
 
 
 
 
 
 
 
 
 
 
 
 
 
2228dd4
3243a1d
 
8f0685d
 
 
 
 
 
 
 
 
1a8950c
8f0685d
 
 
 
 
 
 
 
1a8950c
8f0685d
 
 
 
 
 
 
1a8950c
2228dd4
3243a1d
 
e1d904f
 
 
 
 
1a8950c
2228dd4
3243a1d
 
cb1d63b
e1d904f
 
2228dd4
cb1d63b
1a8950c
cb1d63b
3243a1d
 
e1d904f
 
 
 
 
1a8950c
e1d904f
1a8950c
8f0685d
 
 
 
 
 
 
 
1a8950c
8f0685d
 
 
 
 
 
 
1a8950c
8f0685d
3243a1d
 
e1d904f
 
 
 
 
1a8950c
e1d904f
1a8950c
2228dd4
3243a1d
1a8950c
e1d904f
 
 
 
 
1a8950c
3243a1d
 
2228dd4
dd9804c
751d24a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dd9804c
2228dd4
 
 
 
3243a1d
2228dd4
 
 
 
 
 
3243a1d
2228dd4
 
 
3243a1d
2228dd4
 
 
3243a1d
2228dd4
 
 
 
 
 
3243a1d
2228dd4
 
 
3243a1d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
---
library_name: transformers
tags: []
---

# SabiYarn

Test the whole generation capabilities here: https://huggingface.co/spaces/BeardedMonster/SabiYarn_125M

Pretrained model on Nigerian languages including English using a causal language modeling (CLM) Multi-task objective. 

## Model Details

### Model Description

SabiYarn-125M is the first of a series of transformer models (adopted from nanogpt and inspired by GPT-J's architecture) pretrained on a large corpus of Nigerian language data in a self-supervised fashion. This means it was pretrained on the raw texts only, 
with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences.

More precisely, inputs are sequences of continuous text of a certain length and the targets are the same sequence, shifted one token (word or piece of word) to the right. 
The model uses internally a mask-mechanism to make sure the predictions for the token i only uses the inputs from 1 to i but not the future tokens. It also makes sure attention
is not calculated across documents.

This way, the model learns an inner representation of the languages that can then be used to extract features useful for downstream tasks. The model is best at what 
it was pretrained for however, which is generating coherent texts.

This is the smallest version, with 125M parameters. 

- **Developed by:** Aletheia.ai Research Lab
- **Funded by [optional]:** Personal
- **Shared by [optional]:** Jeffreypaul
- **Model type:** GPTJX (Adopted from NanoGPT)
- **Language(s) (NLP):** Majorly English, Yoruba, Hausa, Igbo, Pidgin and some others: Fulah/Fulfulde, Efik, Urhobo.


### Model Sources [optional]

- **Demo:** https://huggingface.co/spaces/BeardedMonster/SabiYarn_125M

## Uses

You can use the raw model for text generation or fine-tune it to a downstream task.

## Bias, Risks, and Limitations

The training data used for this model is mostly an aggregation of data available on huggingface for nigerian languages. We know it contains a lot of unfiltered content from the internet, which is far from neutral. 

Because large-scale language models of this size do not distinguish fact from fiction, we don’t support use-cases that require the generated text to be true.

Additionally, language models often reflect the biases inherent to the systems they were trained on, so we do not recommend that they be deployed into systems that interact with humans > unless the deployers first carry out a study of biases relevant to the intended use-case. 


### Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.


## How to Get Started with the Model

Use the code below to get started with the model.
<b> Use transformers version 4.41.2 for appropriate text generation </b>

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import GenerationConfig

generation_config = GenerationConfig(
    max_length=100,            # Maximum length of the generated sequence
    num_beams=5,               # Number of beams for beam search
    do_sample=True,            # Whether to use sampling instead of greedy decoding
    temperature=0.9,           # Sampling temperature
    top_k=50,                  # Top-K sampling
    top_p=0.95,                # Top-P (nucleus) sampling
    repetition_penalty=2.0,    # Repetition penalty to reduce repetitive outputs
    length_penalty=1.7,        # Length penalty to favor longer sequences
    early_stopping=True        # Stop early when all beams have finished
)

repo_name = "BeardedMonster/SabiYarn-125M"
model = AutoModelForCausalLM.from_pretrained(repo_name, trust_remote_code=True)
tokenizer= AutoTokenizer.from_pretrained(repo_name, trust_remote_code=True)

#Test on Urhobo
input_ids = tokenizer("Eshare nana ri vwo ẹguọnọ rẹ iyono rẹ Aristotle vẹ Plato na,", return_tensors="pt")["input_ids"]  
output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
input_len = len(input_ids[0])
print(tokenizer.decode(output[0][input_len:]))

#Output
""" ọ da tobọ dianẹ ayen rhọnvwe kerhọ-ọ. Ọtiọyena, e de ruiruo aghwoghwo ọkieje. (1 Kọr. 7:9; 1 Kọr. 12:2) Vwọrẹ uyota"""

#Test on Efik
input_ids = tokenizer("Ke eyo Jesus ye mme mbet esie, etop emi ama ada ifụre ọsọk", return_tensors="pt")["input_ids"]  
output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
input_len = len(input_ids[0])
print(tokenizer.decode(output[0][input_len:]))

#Output
""". Edi ediwak nditọ Israel ẹtịn̄ ẹnọ nnyịn mîkemeke ndinam n̄kpọ Abasi.|end_of_text|Ebe foto si, Getty Images Ebe foto si, Getty Images Nkọwa foto, Ndị"""

input_ids = tokenizer("Ke eyo Jesus ye mme mbet esie, etop emi ama ada ifụre ọsọk mme Jew oro esịt okobụn̄ọde ke ntak idiọkido ke Israel, oro ẹkenyụn̄ ẹdude ke mfụhọ ke itie-ufụn mme nsunsu ido edinam Ido Ukpono Mme Jew eke akpa isua ikie.", return_tensors="pt")["input_ids"]  
output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
input_len = len(input_ids[0])
print(tokenizer.decode(output[0][input_len:]))

#Output
"""Kûsịn idem nnyịme ndifiọk nditọete nnyịn inemesịt onyụn̄ anam nnyịn ikpọn̄utom nnyịn. (Matt. 26:31; Luke 22:42"""

# Test on English
input_ids = tokenizer("How are you?", return_tensors="pt")["input_ids"]  
output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
input_len = len(input_ids[0])
print(tokenizer.decode(output[0][input_len:]))

#Output
"""I'm doing alright, thanks for asking. How about you? I'm doing well too. Thanks for asking. So, what have you been up to lately? Not much, just hanging out with friends and family. You know how it is. Yeah,"""

# Test on Yoruba
input_ids = tokenizer("Awọn eeyan Cairo, ni Egypt ti bẹrẹ si n to lawọn ileesẹ to n ṣe burẹdi bayii.", return_tensors="pt")["input_ids"]  
output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
input_len = len(input_ids[0])
print(tokenizer.decode(output[0][input_len:]))

#Output
"""|end_of_text|Ti o ba fẹ lati wa ni irú si rẹ awọn olumulo, o le se amọnà wọn taara sinu wa àwárí ojúewé. Yi ni asopọ, iwọ yoo wa wa julọ gbajumo reluwe ipa- -- https://www.saveatrain.com/rout"""

# Test on Igbo
input_ids = tokenizer("N'ala Igbo, ọtụtụ ndị mmadụ kwenyere na e nwere mmiri ara na elu-ilu", return_tensors="pt")["input_ids"]  
output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
input_len = len(input_ids[0])
print(tokenizer.decode(output[0][input_len:]))

#Output
""". Ọ bụ ezie na ọ bụ otu n'ime ihe ndị kasị dị ịrịba ama na nke kachasị ewu ewu na Africa, a na-elekarị ya anya dị ka otu n'ime ndị kasị baa ọgaranya n'ụwa.
Nkọwapụta 
Ebe nrụọrụ weebụ na-ahụ maka gburugburu ebe"""

# Test on FulFulde/Fulah
input_ids = tokenizer("Jos un peeta gallure nɗer ɗi woyla caaka ɓanngeere lardu Naajeeriya. Gelle ɗen haa e ɗuuɗiri ɗun kamano", return_tensors="pt")["input_ids"]  
output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
input_len = len(input_ids[0])
print(tokenizer.decode(output[0][input_len:]))

#Output
"""jogiiji maɓɓe nder lesdi Naajeeriya. |end_o|end_of_text|** Muhammadu_Buhari ** Muhammadu Buhari ko leydi e hukuma pamarun e hukuma pamarun e hukuma pamarun e hukuma pamarun e hukum"""

input_ids = tokenizer("Si hooreejo leydi on (himo wi’ee kadi persidan) accitii laamu, ko woote waɗetee, ɓurɗo jogaade yimɓe on halfinee laamu yeru happu.", return_tensors="pt")["input_ids"]  
output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
input_len = len(input_ids[0])
print(tokenizer.decode(output[0][input_len:]))

#Output
"""|end_of_text|So en nganndii e hitaande 2010, o wiyi : “ko ñalawma hannde golle pulaar walla mbiyen jogiiɗo”. Eɗen mbaawi wiyde «u2008"""

# Test on Hausa
input_ids = tokenizer("Ministan ya ƙara da cewa dole ne Mista Netanyahu ya sanya ranar da", return_tensors="pt")["input_ids"]  
output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
input_len = len(input_ids[0])
print(tokenizer.decode(output[0][input_len:]))

#Output
"""za a rantsar da shi a matsayin shugaban ƙasar Isra'ila.|end_of_text|Home > Products > Kamarar Tsaro Ta Cctv (Lambobin 24 Kamarar Tsaro Ta Cctv)
Kamarar Tsaro Ta Cctv - ma'aikata, ma'aikata, mai sayarwa daga Sin
Mu masu sana'a ne Kam"""

# Test on Pidgin
input_ids = tokenizer('Di protesters wey dey wear black and red shirt tok say "enough be enough"', return_tensors="pt")["input_ids"]  
output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
input_len = len(input_ids[0])
print(tokenizer.decode(output[0][input_len:]))

#Output
"""for di protest.|end_of_text|Wia dis foto come from, AFP/Getty Images Wetin we call dis foto, Some of di people wey dem arrest on top social media na one of di main reasons why some of di protesters enta street to protest against"""

```

Other tasks (e.g translation, classification etc) typically have 2 tags. The first signifies the kind/type of task and the second signifies the end of the input, prompting the model to begin generation. They are as follows:
- Translation
  ```python
  <translate> <yor>, <translate> .... <ibo>, <translate> ... <hau>
  ```
- Instruction following
  ```python
   <prompt><response>
  ```
- Sentiment Analysis
  ```python
   <classify> .... <sentiment>
  ```
- Topic Classification
  ```python
   <classify> .... <topic>
  ```
- Text summarization
  ```python
  <summarize> ... <summary>
  ```
- Headline Generation
  ```python
  <topic>... <headline>
  ```
- Text Diacritization
  ```python
   <diacritize>.... <yor>
  ```
- Question answering
  ```python
  <qa> <context>..... <question> .... <options>...<answer> or <qa> <context> .... <answer> or \
  # The below were noted to work better.
  <prompt> Context:... Question:... <response> or <prompt> Context:... Question:... Option A. Option B. ... <response> or <prompt> Context_question_options here <response>
  ```
- Named Entity Recognition
  ```python
   <NER>.... <tag>
  ```
- Text cleaning
   ```python
   <clean>...<correct>
  ```

You should typically put user's input between these 2 tags. Currently, model also doesnt perform very well on NER due to the scarce data on this.

## Training Details

### Training Data

We wanted to train this model on a corpus as large as possible. To build it, we collated all relevant datasets on huggingface and additionally scraped a few websites. The resulting dataset weights 43 GB of texts pre-cleaning and 28 GB post-cleaning but has not been publicly released.


### Training Procedure

#### Preprocessing [optional]

The texts are tokenized using Bloom's tokenizer retrained on our dataset with a vocabulary size of 52,050. The inputs are sequences of 1024 consecutive tokens.

#### Training Hyperparameters

- **Training regime:** Model trained on a single GPU with effectual token batch size of 409,600 tokens per update for over 800 steps.

## Evaluation

Model yet to be evaluated.

#### Metrics

<!-- These are the evaluation metrics being used, ideally with a description of why. -->

### Model Architecture and Objective
Architecture is very similar to GPT-J

## Model Card Authors [optional]

Jeffreypaul (BeardedMonster)