rsepulvedat commited on
Commit
291abf6
·
verified ·
1 Parent(s): 7e01f5a

Upload 8 files

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,300 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - ca
5
+ metrics:
6
+ - accuracy
7
+ - bleu
8
+ base_model:
9
+ - BSC-LT/salamandra-2b
10
+ ---
11
+ # Aitana-2B-S
12
+
13
+ ## Table of Contents
14
+ <details>
15
+ <summary>Click to expand</summary>
16
+
17
+ - [Model description](#model-description)
18
+ - [Intended uses and limitations](#intended-uses-and-limitations)
19
+ - [How to use](#how-to-use)
20
+ - [Training](#training)
21
+ - [Evaluation](#evaluation)
22
+ - [Additional information](#additional-information)
23
+
24
+
25
+ </details>
26
+
27
+
28
+ ## Model description
29
+
30
+ Aitana-2B-S is a generative language model with a decoder-only architecture. This model has been trained based on Salamandra-2B, using data in Valencian to
31
+ achieve greater representation of this minority language, which is very similar to Catalan. This model has been continuously pre-trained for two epochs,
32
+ processing 2.12 billion tokens throughout the training process. Due to the data sources used, the political and administrative domains are highly present
33
+ in the model's register. The data has also been anonymised during pre-processing to avoid training with data that could violate people's privacy.
34
+
35
+ This model is based on Salamandra-2B as the basis for training and uses the same tokenizer.
36
+
37
+
38
+ ## Intended uses and limitations
39
+
40
+ Aitana-2B-S is a base model that can be used for causal language modeling, it can be used as is for text generation,
41
+ although fine/instruction-tuning on specific tasks is recommended for its final use.
42
+
43
+ This language model has been trained with data in a formal register, namely related to the
44
+ administrative and political domain, so it is expected that using it in text-generation tasks
45
+ will produce text in this same format.
46
+
47
+
48
+ ## How to use
49
+ ```python
50
+ import torch
51
+ from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
52
+ input_text = "Les corts valencianes han pres la decisió de"
53
+ model_id = "gplsi/Aitana-2B-S"
54
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
55
+ generator = pipeline(
56
+ "text-generation",
57
+ model=model_id,
58
+ tokenizer=tokenizer,
59
+ torch_dtype=torch.bfloat16,
60
+ trust_remote_code=True,
61
+ device_map="auto",
62
+ )
63
+ generation = generator(
64
+ input_text,
65
+ do_sample=True,
66
+ top_k=10,
67
+ eos_token_id=tokenizer.eos_token_id,
68
+ )
69
+ print(f"Result: {generation[0]['generated_text']}")
70
+
71
+ ```
72
+
73
+ ## Training
74
+
75
+ ### Training data
76
+
77
+ The training corpus has been obtained using web scraping on public data from different sources such as the
78
+ [Official Gazette of the University of Alicante (BOUA)](https://www.boua.ua.es/ca), [the Official Gazette of the Generalitat Valenciana (DOGV)](https://dogv.gva.es/va) and accurate data provided by
79
+ [the Valencian Courts (DSCV and DSCCV)](https://www.cortsvalencianes.es/ca-va/). Giving a total of 1.304 million tokens, according to the following table.
80
+
81
+
82
+ Dataset | Language | Total Sentences | Total Words | Total Numbers | Other Symbols | Unique Words | Total Tokens | Average sentence Length | Average Word Length |
83
+ |---------|----------|--------------------|-------------|---------------|---------------|--------------|--------------|-------------------------|---------------------|
84
+ **BOUA** | va | 0.606M | 12.355M | 0.488M | 0.055M | 0.211M | 12.899M | 21.27 | 4.89
85
+ **DOGCV** | va | 4.569M | 50.566M | 6.339M | 0.613M | 17.436M | 57.517M | 12.59 | 4.68
86
+ **DOGV** | va | 18.598M | 311.380M | 24.138M | 2.731M | 11.416M | 338.250M | 18.19 | 4.88
87
+ **DSCCV** | va | 2.353M | 46.116M | 0.554M | 2.352m | 5.031M | 46.672M | 19.84 | 4.56
88
+ **DSCV** | va | 1.646M | 32.496M | 0.433M | 1.427m | 3.796M | 32.930M | 20.01 | 4.65
89
+ **UN** | va | 0.394M | 12.289M | 0.253M | 0.015M | 0.533M | 12.556M | 31.86 | 4.86
90
+ **VJ** | va | 0.913M | 23.594M | 0.466M | 23.314m | 0.849M | 24.084M | 26.39 | 4.57
91
+
92
+
93
+ Several of the downloaded sources have already been used in the Meta-Llama-3-8B training, so the date of data collection for the previous
94
+ model has been taken into account and those web pages have been scraped from that date.
95
+
96
+ Information on the datasets used for training is shown below:
97
+
98
+ - Official Bulletin of the University of Alicante (BOUA): These are the documents issued by the University of Alicante related to grants, regulations, and different resolutions of laws published periodically, specifically the Valencian version.
99
+
100
+ - Legacy Official Journal of the Generalitat Valenciana (DOGCV): This journal contains historical documents issued by the Valencian Community. These documents were initially recorded on paper and digitised with the standardisation of the digital format. They have the same subject matter as the DOGV documents but were generated between 1980 and 1997.
101
+
102
+ - Official Journal of the Generalitat Valenciana (DOGV): These documents contain official communications of the Valencian Community. They mainly deal with issuing laws, legal measures, and public sector communication. These journals were issued from 1998 to 2023.
103
+
104
+ - Valencian Parliament Diary Dataset (DSCCV) contains records from various committee meetings held in the parliament, with each meeting documented in a separate text file.
105
+
106
+ - Journal of the Valencian Parliament (DSCV): in this case, the transcripts of the different meetings held in the parliament's plenary sessions, with data from 1999 to 2022.
107
+
108
+ - University news (UN): We have news in a colloquial register from different universities that have Valencian as an official language, including the universities of Valencia, Alicante, Jaume I, and the Polytechnic University of Valencia.
109
+
110
+ - Valencian Journals (VJ): These include various types of Valencian journals with colloquial records to facilitate daily record-keeping alongside the legal and bureaucratic documents from previous records. These include a total of 10 different journals.
111
+
112
+
113
+
114
+ ### Training parameters
115
+
116
+ During the training of the model, a high context window was desired when generating text, so it was decided to use an input size of 2048
117
+ tokens and a minimum context window of 512 in case of truncating the input sequences. 80% of the data obtained was used for the training stage,
118
+ while 20% was used during the evaluation stage. A summary of the parameters used during training can be seen in the following table:
119
+
120
+ Parameter | Value |
121
+ |-------------------------------|-------|
122
+ Epochs | 2 |
123
+ Learning Rate | 2e-5 |
124
+ Warmup Steps | 0 |
125
+ Precision | bf-16 |
126
+ Weight decay | 1e-1 |
127
+ Training Fraction | 0.95 |
128
+ Evaluation Fraction | 0.05 |
129
+ Input size (tokens) | 2048 |
130
+ Minimum context window (tokens) | 512 |
131
+
132
+
133
+
134
+ ### Distributed Training Strategy
135
+ A distributed training strategy called Fully Sharded Data Parallel (FSDP)
136
+ has been used. With this, the entire model has been loaded among the 4 A100s available for training with a mini-batch size of size 1 and a total gradient accumulation step of 64.
137
+
138
+ ### Languages
139
+ In addition to the data already used for the training of Meta-Llama-3-8B, data completely in Valencian from the sources mentioned in
140
+ the previous section has been used.
141
+
142
+ ## Evaluation
143
+
144
+ In the following table, we can see the results obtained with different benchmarks in comparison with
145
+ the model used for continuous pre-training. The results have been obtained from the model pre-trained;
146
+ no instruction tuning or fine-tuning of any kind has been performed.
147
+
148
+
149
+ ### Valencian
150
+
151
+ #### Classification Benchmarks
152
+
153
+ | Dataset | Lang. | Task | Metric | Salamandra-2B | Aitana-2B-S |
154
+ |------------------------------|--------|----------------------------|-------------|-----------|-----------------------|
155
+ | XNLI | va |Natural Language Inference | acc | **0.475** | 0.473 |
156
+
157
+
158
+
159
+ #### Generation Benchmarks
160
+
161
+ | Dataset | Lang. | Task | Metric | Salamandra-2B | Aitana-2B-S |
162
+ |------------------------------|--------|----------------------------|-------------|---------------|-----------------------|
163
+ | Cocoteros | va |Reading Comprehension | bleu | **6.32** | 5.76 |
164
+ | Phrases ca-va | va-ca |Translation - Adaptation | bleu | 79.82 | **81.92** |
165
+ | Phrases va-ca | va-ca |Translation - Adaptation | bleu | **78.05** | 76.53 |
166
+ | Phrases va-es | va-es |Translation | bleu | **76.04** | 75.99 |
167
+ | Phrases es-va | es-va |Translation | bleu | 58.86 | **61.51** |
168
+
169
+
170
+
171
+
172
+
173
+ ### Catalan
174
+
175
+ #### Classification Benchmarks
176
+
177
+ | Dataset | Lang. | Task | Metric | Salamandra-2B | Aitana-2B-S |
178
+ |------------------------------|--------|---------------------------|-------------|---------------|-------------------|
179
+ | Belebele Cat_latn | ca | Reading Comprehension | acc | 0.231 | **0.257** |
180
+ | COPA | ca | Commonsense Reasoning | acc | 0.700 | **0.712** |
181
+ | XStoryCloze | ca | Commonsense Reasoning | acc | 0.655 | **0.657** |
182
+ | OpenBookQA | ca | Question Answering | acc | **0.294** | 0.282 |
183
+ | PAWS | ca | Paraphrasing | acc | **0.556** | 0.551 |
184
+ | PiQA | ca | Question Answering | acc | 0.643 | **0.646** |
185
+ | SiQA | ca | Question Answering | acc | **0.434** | 0.432 |
186
+ | ARC Easy | ca | Question Answering | acc | **0.551** | 0.549 |
187
+ | ARC Challenge | ca | Question Answering | acc | **0.290** | 0.288 |
188
+ | XNLI | ca | Natural Language Inference| acc | 0.473 | **0.480** |
189
+ | Teca | ca | Natural Language Inference| acc | **0.465** | 0.459 |
190
+ | WNLI | ca | Natural Language Inference| acc | **0.577** | 0.563 |
191
+ | Catcola | ca | Linguistic Acceptability | acc | **0.543** | 0.525 |
192
+ | Catcola | ca | Linguistic Acceptability | mcc | **0.046** | 0.023 |
193
+ | Catalanqa | ca | Question Answering | F1 | **0.668** | 0.655 |
194
+ | Mgsm direct | ca | Math | exact match | 0.024 | **0.028** |
195
+ | Catalanqa | ca | Question Answering | exact match | **0.437** | 0.415 |
196
+ | Xquad | ca | Question Answering | exact match | **0.371** | 0.354 |
197
+ | Xquad | ca | Question Answering | F1 | **0.579** | 0.566 |
198
+
199
+
200
+
201
+
202
+ #### Generation Benchmarks
203
+
204
+ | Dataset | Lang. | Task | Metric | Salamandra-2B | Aitana-2B-S |
205
+ |------------------------------|--------|--------------------------|--------|----------------|-----------------------|
206
+ | Cabreu abstractive | ca | Summarization | bleu | 5.78 | **6.24** |
207
+ | Cabreu extractive | ca | Summarization | bleu | **42.89** | 41.19 |
208
+ | Cabreu extreme | ca | Summarization | bleu | 3.29 | **3.81** |
209
+
210
+ ### Spanish
211
+
212
+
213
+ #### Classification Benchmarks
214
+
215
+ | Dataset | Lang. | Task | Metric | Salamandra-2B | Aitana-2B-S |
216
+ |------------------------------|--------|---------------------------|-------------|-----------|-----------------------|
217
+ | Belebele Cat_latn | es | Reading Comprehension | acc | **0.228** | 0.224 |
218
+ | PAWS | es | Paraphrasing | acc | **0.561** | 0.543 |
219
+ | XNLI | es | Natural Language Inference| acc | **0.439** | 0.422 |
220
+ | WNLI | es | Natural Language Inference| acc | 0.563 | 0.563 |
221
+ | XStoryCloze | es | Commonsense Reasoning | acc | **0.653** | 0.652 |
222
+ | Escola | es | Linguistic Acceptability | acc | **0.593** | 0.536 |
223
+ | Escola | es | Linguistic Acceptability | mcc | **0.031** | 0.010 |
224
+ | OpenbookQA | es | Question Answering | acc | 0.308 | **0.314** |
225
+ | MGSM Direct | es | Math | exact match | 0.020 | 0.020 |
226
+ | XQUAD | es | Question Answering | exact match | **0.377** | 0.373 |
227
+ | XQUAD | es | Question Answering | F1 | **0.584** | 0.583 |
228
+
229
+
230
+
231
+
232
+ #### Generation Benchmarks
233
+
234
+ | Dataset | Lang. | Task | Metric | Salamandra-2B | Aitana-2B-S |
235
+ |------------------------------|--------|----------------|---------|-------------|-----------------------|
236
+ | Cocoteros | es |Reading Comprehension| bleu | **8.46** | 7.35 |
237
+ | XLSum | es | Summarization | bleu | **0.801** | 0.434 |
238
+
239
+
240
+
241
+ ### English
242
+
243
+
244
+ #### Classification Benchmarks
245
+
246
+ | Dataset | Lang. | Task | Metric | Salamandra-2B | Aitana-2B-S |
247
+ |------------------------------|--------|----------------------------|-------------|-----------|-----------------------|
248
+ | Arc Challenge | en | Question Answering | acc | 0.370 | **0.374** |
249
+ | Arc Easy | en | Question Answering | acc | **0.722** | 0.719 |
250
+ | Belebele Eng_latn | en | Reading Comprehension | acc | 0.216 | **0.229** |
251
+ | PAWS | en | Paraphrasing | acc | 0.561 | **0.562** |
252
+ | XNLI | en | Natural Language Inference | acc | **0.462** | 0.446 |
253
+ | XStoryCloze | en | Commonsense Reasoning | acc | 0.711 | **0.713** |
254
+ | OpenBookQA | en | Question Answering | acc | 0.300 | **0.308** |
255
+ | PiQA | en | Question Answering | acc | 0.737 | **0.743** |
256
+ | Social iqa | en | Question Answering | acc | **0.454** | 0.451 |
257
+ | WNLI | en | Natural Language Inference | acc | 0.465 | **0.578** |
258
+ | MGSM Direct | en | Math | exact match | 0.064 | 0.064 |
259
+ | TriviaQA | en | Question Answering | exact match | -0.019 | **0.015** |
260
+
261
+
262
+
263
+
264
+ ## Additional information
265
+
266
+
267
+ ### Author
268
+
269
+ Language and Information System Group GPLSI
270
+
271
+
272
+ ### Contact
273
+
274
+ For further information, please send an email to GPLSI
275
+
276
+
277
+
278
+ ### Copyright
279
+
280
+ Copyright(c) 2025 by GPLSI(https://gplsi.dlsi.ua.es/).
281
+
282
+
283
+ ### License
284
+
285
+ [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0)
286
+
287
+
288
+ ### Funding
289
+ This work was funded by [ILENIA](https://proyectoilenia.es/)-[VIVES](https://vives.gplsi.es/) project <<2022/TL22/00215334>>
290
+
291
+
292
+ ### Disclaimer
293
+
294
+ The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0.
295
+
296
+ Be aware that the model may have biases and/or any other undesirable distortions.
297
+
298
+ When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it) or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
299
+
300
+ In no event shall the owner and creator of the model (GPLSI) be liable for any results arising from the use made by third parties.
config.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "BSC-LT/salamandra-2b",
3
+ "architectures": [
4
+ "LlamaForCausalLM"
5
+ ],
6
+ "attention_bias": false,
7
+ "attention_dropout": 0.0,
8
+ "bos_token_id": 1,
9
+ "eos_token_id": 2,
10
+ "head_dim": 128,
11
+ "hidden_act": "silu",
12
+ "hidden_size": 2048,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 5440,
15
+ "max_position_embeddings": 8192,
16
+ "mlp_bias": false,
17
+ "model_type": "llama",
18
+ "num_attention_heads": 16,
19
+ "num_hidden_layers": 24,
20
+ "num_key_value_heads": 16,
21
+ "pretraining_tp": 1,
22
+ "rms_norm_eps": 1e-05,
23
+ "rope_scaling": null,
24
+ "rope_theta": 10000.0,
25
+ "tie_word_embeddings": false,
26
+ "torch_dtype": "bfloat16",
27
+ "transformers_version": "4.44.2",
28
+ "use_cache": false,
29
+ "vocab_size": 256000
30
+ }
generation_config.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "do_sample": true,
6
+ "temperature": 0.1,
7
+ "top_p": 0.95,
8
+ "max_new_tokens": 40,
9
+ "repetition_penalty": 1.2,
10
+ "transformers_version": "4.44.2"
11
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2161af68419ee601de3bf72f0fe1fb532cee2deabb71f2e745a78e75b2e22a1b
3
+ size 4507005744
special_tokens_map.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "</s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": "</s>",
17
+ "unk_token": {
18
+ "content": "<unk>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ }
24
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c205db342a8a7321a4147629d806caac1c70a8cb1cebf4c0f3636dec7a3452e0
3
+ size 19092405
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ab94ddf46d14f0279254858d53770c5319c5129d47291ee2bada530271cb1292
3
+ size 4813276
tokenizer_config.json ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": true,
3
+ "add_eos_token": false,
4
+ "add_prefix_space": true,
5
+ "added_tokens_decoder": {
6
+ "0": {
7
+ "content": "<unk>",
8
+ "lstrip": false,
9
+ "normalized": false,
10
+ "rstrip": false,
11
+ "single_word": false,
12
+ "special": true
13
+ },
14
+ "1": {
15
+ "content": "<s>",
16
+ "lstrip": false,
17
+ "normalized": false,
18
+ "rstrip": false,
19
+ "single_word": false,
20
+ "special": true
21
+ },
22
+ "2": {
23
+ "content": "</s>",
24
+ "lstrip": false,
25
+ "normalized": false,
26
+ "rstrip": false,
27
+ "single_word": false,
28
+ "special": true
29
+ },
30
+ "4": {
31
+ "content": "<|im_start|>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false,
36
+ "special": true
37
+ },
38
+ "5": {
39
+ "content": "<|im_end|>",
40
+ "lstrip": false,
41
+ "normalized": false,
42
+ "rstrip": false,
43
+ "single_word": false,
44
+ "special": true
45
+ }
46
+ },
47
+ "additional_special_tokens": [
48
+ "<|im_start|>",
49
+ "<|im_end|>"
50
+ ],
51
+ "bos_token": "<s>",
52
+ "chat_template": "{%- if not date_string is defined %}{%- set date_string = \"2025-06-30\" %}{%- endif %}{%- set system_message = messages[0].content if messages[0].role == \"system\" else \"I am Aitana, experimental model developed at the University of Alicante by the Language and Information Systems Group (GPLSI). My knowledge base was last updated on May 2025. Today Date: \"+ date_string +\"\nSoy Aitana, un modelo experimental desarrollado en la Universidad de Alicante por el Grupo de Procesamiento del Lenguaje y Sistemas de Información (GPLSI). Mi base de conocimiento se actualizó por última vez en mayo de 2025.\nSoc Aitana, un model experimental desenvolupat a la Universitat d'Alacant pel Grup de Processament del Llenguatge i Sistemes d'Informació (GPLSI). La meua base de coneixement es va actualitzar per última vegada en maig del 2025.\" -%}{%- if messages[0].role == \"system\" -%}{%- set messages = messages[1:] -%}{%- endif -%}{{ \"<|im_start|>system\n\" + system_message + \"<|im_end|>\n\" }}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
53
+ "clean_up_tokenization_spaces": false,
54
+ "eos_token": "</s>",
55
+ "legacy": true,
56
+ "model_max_length": 8192,
57
+ "pad_token": "<unk>",
58
+ "padding_side": "right",
59
+ "sp_model_kwargs": {},
60
+ "spaces_between_special_tokens": false,
61
+ "tokenizer_class": "LlamaTokenizer",
62
+ "unk_token": "<unk>",
63
+ "use_default_system_prompt": false
64
+ }