ilacunza commited on
Commit
4a4887c
·
verified ·
1 Parent(s): 11b3804

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,323 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ pipeline_tag: translation
5
+ base_model:
6
+ - BSC-LT/salamandra-2b-instruct
7
+ ---
8
+
9
+
10
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/633b489acbdbadd99c0b75ef/MhsW4ODhK6ofYq8DnpyKc.png)
11
+
12
+ # SalamandraTA-2B-academic Model Card
13
+
14
+ This repository contains the model SalamandraTA-2B-academic, which is a Machine Translation fine-tunning of the [Salamandra2B-Instruct](https://huggingface.co/BSC-LT/salamandraTA-2b-instruct).
15
+ This model has been obtained following the procedures shown in **CITE PAPER AS SOON AS AVAILABLE**.
16
+
17
+
18
+ > [!WARNING]
19
+ > **DISCLAIMER:** This version of Salamandra is tailored exclusively for translation tasks. Even if the Machine Translation version has been obtained after fine-tunning an instructed version the chat capabilities have not been tested. For this we refer to the used [instructed version](https://huggingface.co/BSC-LT/salamandraTA-2b-instruct).
20
+
21
+
22
+ ---
23
+
24
+ ## Model Details
25
+
26
+ ### Architecture
27
+
28
+ | | |
29
+ |-------------------------|:--------------|
30
+ | Total Parameters | 2,253,490,176 |
31
+ | Embedding Parameters | 524,288,000 |
32
+ | Layers | 24 |
33
+ | Hidden size | 2,048 |
34
+ | Attention heads | 16 |
35
+ | Context length | 8,192 |
36
+ | Vocabulary size | 256,000 |
37
+ | Precision | bfloat16 |
38
+ | Embedding type | RoPE |
39
+ | Activation Function | SwiGLU |
40
+ | Layer normalization | RMS Norm |
41
+ | Flash attention | ✅ |
42
+ | Grouped Query Attention | ❌ |
43
+ | Num. query groups | N/A |
44
+
45
+ ---
46
+
47
+ ## Intended Use
48
+
49
+ ### Direct Use
50
+
51
+ The model is intended for both research and commercial use in any of the languages included in the training data for general machine translation tasks.
52
+
53
+ ### Out-of-scope Use
54
+
55
+ The model is not intended for malicious activities, such as harming others or violating human rights.
56
+ Any downstream application must comply with current laws and regulations.
57
+ Irresponsible usage in production environments without proper risk assessment and mitigation is also discouraged.
58
+
59
+ ---
60
+
61
+ ## Hardware and Software
62
+
63
+ ### Training Framework
64
+
65
+ SalamandraTA-2B-academic was instructed with [FastChat](https://github.com/lm-sys/FastChat).
66
+
67
+ ### Compute Infrastructure
68
+
69
+ All models were trained on [MareNostrum 5](https://www.bsc.es/ca/marenostrum/marenostrum-5), a pre-exascale EuroHPC supercomputer hosted and
70
+ operated by Barcelona Supercomputing Center.
71
+
72
+ The accelerated partition is composed of 1,120 nodes with the following specifications:
73
+ - 4x Nvidia Hopper GPUs with 64GB HBM2 memory
74
+ - 2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores)
75
+ - 4x NDR200 (BW per node 800Gb/s)
76
+ - 512 GB of Main memory (DDR5)
77
+ - 460GB on NVMe storage
78
+
79
+ ---
80
+
81
+
82
+ ## How to use
83
+
84
+ SalamandraTA-2B-academic was fine-tuned using ACAD-Train dataset which focuses on pairs involving English, Iberian Peninsula languages, and several Central European languages, namely: Asturian (ast), Catalan (ca), German (de), Greek (el), Spanish (es), English (en), Basque (eu), French (fr), Galician (gl), Italian (it), Dutch (nl) and Portuguese (pt). The dataset includes 48 unique language pairs. Since each pair is used for translation in both directions (e.g., English to Spanish and Spanish to English), this results in the 96 total supported directions. The most frequent language pairs, accounting for 96.5% of the dataset, are:
85
+
86
+
87
+ - English - Spanish (en-es)
88
+ - English - French (en-fr)
89
+ - English - Catalan (en-ca)
90
+ - Catalan - Spanish (ca-es)
91
+ - Spanish - French (es-fr)
92
+ - English - Portuguese (en-pt)
93
+
94
+ A comprehensive list of all language pairs included in the [ACAD-Train dataset](https://huggingface.co/datasets/LangTech-MT/ACAData).
95
+
96
+ The instruction-following model uses the commonly adopted ChatML template:
97
+
98
+ ```
99
+ <|im_start|>system
100
+ {SYSTEM PROMPT}<|im_end|>
101
+ <|im_start|>user
102
+ {USER PROMPT}<|im_end|>
103
+ <|im_start|>assistant
104
+ {MODEL RESPONSE}<|im_end|>
105
+ <|im_start|>user
106
+ [...]
107
+ ```
108
+
109
+ The easiest way to apply it is by using the tokenizer's built-in functions, as shown in the following snippet.
110
+
111
+
112
+ ```python
113
+ from datetime import datetime
114
+ from transformers import AutoTokenizer, AutoModelForCausalLM
115
+ import transformers
116
+ import torch
117
+
118
+ model_id = "LangTech-MT/salamandraTA-2B-academic"
119
+
120
+ # Input parameters
121
+ source = 'English'
122
+ target = 'Spanish'
123
+ sentence = "With the purpose of analyzing women’s perceptions and classifying their modes of understanding a positive human papillomavirus (HPV+) test, we conducted 38 in‑depth interviews with women who had received an HPV diagnosis (normal and abnormal Pap smear), screened in Jujuy’s public health system in 2016. A typology based on women’s understandings of the result was developed: 1) understanding; 2) lack of understanding; a) underestimation; b) overestimation; c) confusion. The interviewees who experienced confusion over the results reported contradictory perceptions in relation to a positive HPV test and its severity; those who underestimated it tended to mention the absence of symptoms and expressed little concern over the result; while those who overestimated it considered themselves sick and described concern, narrating a biographical disruption and physical pain. These findings confirm the need to improve the delivery of results and the provision of information in order to decrease psychosocial impact and increase follow‑up adherence in HPV‑positive women."
124
+
125
+ text = f"Translate the following text from {source} into {target}.\n{source}: {sentence} \n{target}:"
126
+
127
+ # Load tokenizer and model
128
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
129
+
130
+ model = AutoModelForCausalLM.from_pretrained(
131
+ model_id,
132
+ device_map="auto",
133
+ torch_dtype=torch.bfloat16
134
+ )
135
+
136
+ # Construct prompt using chat template
137
+ message = [ { "role": "user", "content": text } ]
138
+ date_string = datetime.today().strftime('%Y-%m-%d')
139
+
140
+ prompt = tokenizer.apply_chat_template(
141
+ message,
142
+ tokenize=False,
143
+ add_generation_prompt=True,
144
+ date_string=date_string
145
+ )
146
+
147
+ inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
148
+ input_length = inputs.shape[1]
149
+
150
+ # Generate output
151
+ outputs = model.generate(
152
+ input_ids=inputs.to(model.device),
153
+ max_new_tokens=400,
154
+ early_stopping=True,
155
+ num_beams=5
156
+ )
157
+
158
+ # Decode and print output
159
+ print(tokenizer.decode(outputs[0, input_length:], skip_special_tokens=True))
160
+ # Con el propósito de analizar las percepciones de las mujeres y clasificar sus modos de comprensión de un resultado positivo de virus del papiloma humano (VPH+), en 2016 realizamos 38 entrevistas en profundidad a mujeres con diagnóstico de VPH (citología normal y anormal) detectado en el sistema público de salud de Jujuy. Se elaboró una tipología basada en la comprensión del resultado por parte de las mujeres: 1) comprensión; 2) falta de comprensión; a) subestimación; b) sobreestimación; c) confusión. Las entrevistadas que experimentaron confusión informaron percepciones contradictorias sobre el VPH+ y su gravedad; quienes lo subestimaron tendían a mencionar la ausencia de síntomas y mostraron poca preocupación; mientras que aquellas que lo sobreestimaron se consideraban enfermas, describían preocupación, narrando una ruptura biográfica y dolor físico. Estos hallazgos confirman la necesidad de mejorar la entrega de resultados y la provisión de información para disminuir el impacto psicosocial y aumentar la adherencia al seguimiento en mujeres con VPH positivo.
161
+ ```
162
+
163
+ Using this template, each turn is preceded by a `<|im_start|>` delimiter and the role of the entity
164
+ (either `user`, for content supplied by the user, or `assistant` for LLM responses), and finished with the `<|im_end|>` token.
165
+
166
+ #### Machine Translation Prompt
167
+
168
+ The following prompt template is recommended, since it is the one used during training:
169
+
170
+ ```
171
+ Translate the following text from {source} into {target}.
172
+ {source}: {source sentence}
173
+ {target}:
174
+ ```
175
+ <details>
176
+ <summary>Show an example</summary>
177
+
178
+ ```python
179
+ source = 'English'
180
+ target = 'Spanish'
181
+ source_sentence = "With the purpose of analyzing women’s perceptions and classifying their modes of understanding a positive human papillomavirus (HPV+) test, we conducted 38 in‑depth interviews with women who had received an HPV diagnosis (normal and abnormal Pap smear), screened in Jujuy’s public health system in 2016. A typology based on women’s understandings of the result was developed: 1) understanding; 2) lack of understanding; a) underestimation; b) overestimation; c) confusion. The interviewees who experienced confusion over the results reported contradictory perceptions in relation to a positive HPV test and its severity; those who underestimated it tended to mention the absence of symptoms and expressed little concern over the result; while those who overestimated it considered themselves sick and described concern, narrating a biographical disruption and physical pain. These findings confirm the need to improve the delivery of results and the provision of information in order to decrease psychosocial impact and increase follow‑up adherence in HPV‑positive women."
182
+
183
+ text = f"Translate the following text from {source} into {target}.\n{source}: {source_sentence} \n{target}:"
184
+ # Con el propósito de analizar las percepciones de las mujeres y clasificar sus modos de comprensión de un resultado positivo de virus del papiloma humano (VPH+), en 2016 realizamos 38 entrevistas en profundidad a mujeres con diagnóstico de VPH (citología normal y anormal) detectado en el sistema público de salud de Jujuy. Se elaboró una tipología basada en la comprensión del resultado por parte de las mujeres: 1) comprensión; 2) falta de comprensión; a) subestimación; b) sobreestimación; c) confusión. Las entrevistadas que experimentaron confusión informaron percepciones contradictorias sobre el VPH+ y su gravedad; quienes lo subestimaron tendían a mencionar la ausencia de síntomas y mostraron poca preocupación; mientras que aquellas que lo sobreestimaron se consideraban enfermas, describían preocupación, narrando una ruptura biográfica y dolor físico. Estos hallazgos confirman la necesidad de mejorar la entrega de resultados y la provisión de información para disminuir el impacto psicosocial y aumentar la adherencia al seguimiento en mujeres con VPH positivo.
185
+ ```
186
+
187
+ </details>
188
+
189
+
190
+ ### Instruction Tuning Data
191
+
192
+ The corpus used for the instruction tuning is [ACAData](https://huggingface.co/datasets/LangTech-MT/ACAData).
193
+ For more details about the corpus construction, you can refer to the [Paper](*add link to paper).
194
+
195
+
196
+ ## Evaluation
197
+
198
+
199
+ Aggregated results for the xx ↔ en and xx ↔ es translation directions in ACAD-Bench dataset. Baselines are grouped into **large-scale proprietary general models**, **medium- to small-sized open-weights models** and **dedicated MMNMT models**. For every metric the top-scoring system is shown in **bold**. For a more detailed evaluation discussion, please refer to the paper.
200
+
201
+
202
+ <details>
203
+ <summary>xx → en</summary>
204
+
205
+ | Direction | Model | d-BLEU | BP | Blonde | Comet | Comet-Kiwi |
206
+ | :--- | :--- | :---: | :---: | :---: | :---: | :---: |
207
+ | xx → en | GPT-mini | 46.03 | **1.00** | 0.60 | **0.84** | 0.77 |
208
+ | | GPT-nano | 41.30 | 0.97 | 0.55 | **0.84** | **0.78** |
209
+ | | Gemini-2 | 48.65 | **1.00** | 0.61 | **0.84** | 0.77 |
210
+ | | Gemini-2.5 | 45.10 | 0.98 | 0.58 | **0.84** | 0.77 |
211
+ | | Llama-3-8B | 43.12 | 0.99 | 0.56 | 0.83 | 0.76 |
212
+ | | Gemma-3-27B | 46.37 | 0.98 | 0.59 | **0.84** | 0.77 |
213
+ | | MADLAD-7B | 38.69 | 0.86 | 0.51 | 0.81 | 0.77 |
214
+ | | Salamandra-2B | 37.09 | 0.92 | 0.52 | 0.82 | 0.75 |
215
+ | | &nbsp;&nbsp;+ ACADTRAIN | 48.45 | **1.00** | 0.61 | 0.83 | 0.76 |
216
+ | | Salamandra-7B | 45.87 | 0.99 | 0.59 | 0.83 | 0.76 |
217
+ | | &nbsp;&nbsp;+ ACADTRAIN | **50.07** | **1.00** | **0.62** | **0.84** | 0.76 |
218
+
219
+ </details>
220
+
221
+
222
+ <details>
223
+ <summary>en → xx</summary>
224
+
225
+ | Direction | Model | d-BLEU | BP | Blonde | Comet | Comet-Kiwi |
226
+ | :--- | :--- | :---: | :---: | :---: | :---: | :---: |
227
+ | en → xx | GPT-mini | 45.01 | 0.99 | - | 0.86 | **0.82** |
228
+ | | GPT-nano | 43.78 | **1.00** | - | 0.86 | **0.82** |
229
+ | | Gemini-2 | 48.00 | 0.99 | - | **0.87** | **0.82** |
230
+ | | Gemini-2.5 | 47.75 | 0.99 | - | **0.87** | **0.82** |
231
+ | | Llama-3-8B | 39.87 | 0.99 | - | 0.85 | 0.81 |
232
+ | | Gemma-3-27B | 46.29 | 0.99 | - | 0.86 | **0.82** |
233
+ | | MADLAD-7B | 36.08 | 0.82 | - | 0.83 | 0.80 |
234
+ | | Salamandra-2B | 32.91 | 0.90 | - | 0.83 | 0.78 |
235
+ | | &nbsp;&nbsp;+ ACADTRAIN | 46.86 | 0.98 | - | 0.86 | 0.81 |
236
+ | | Salamandra-7B | 42.55 | 0.98 | - | 0.86 | 0.81 |
237
+ | | &nbsp;&nbsp;+ ACADTRAIN | **49.20** | 0.98 | - | 0.86 | 0.81 |
238
+
239
+ </details>
240
+
241
+
242
+ <details>
243
+ <summary>xx → es</summary>
244
+
245
+ | Direction | Model | d-BLEU | BP | Blonde | Comet | Comet-Kiwi |
246
+ | :--- | :--- | :---: | :---: | :---: | :---: | :---: |
247
+ | xx → es | GPT-mini | 60.60 | 0.98 | - | 0.86 | **0.82** |
248
+ | | GPT-nano | 57.88 | **0.99** | - | 0.86 | **0.82** |
249
+ | | Gemini-2 | 62.02 | 0.99 | - | 0.86 | **0.82** |
250
+ | | Gemini-2.5 | 61.43 | 0.98 | - | **0.87** | **0.82** |
251
+ | | Llama-3-8B | 55.4 | 0.98 | - | 0.86 | 0.81 |
252
+ | | Gemma-3-27B | 60.71 | 0.98 | - | 0.86 | **0.82** |
253
+ | | MADLAD-7B | 43.44 | 0.76 | - | 0.83 | 0.81 |
254
+ | | Salamandra-2B | 50.09 | 0.92 | - | 0.85 | 0.80 |
255
+ | | &nbsp;&nbsp;+ ACADTRAIN | 61.97 | 0.98 | - | 0.86 | **0.82** |
256
+ | | Salamandra-7B | 57.55 | 0.98 | - | 0.86 | **0.82** |
257
+ | | &nbsp;&nbsp;+ ACADTRAIN | **63.60** | 0.98 | - | 0.86 | **0.82** |
258
+
259
+ </details>
260
+
261
+
262
+ <details>
263
+ <summary>es → xx</summary>
264
+
265
+ | Direction | Model | d-BLEU | BP | Blonde | Comet | Comet-Kiwi |
266
+ | :--- | :--- | :---: | :---: | :---: | :---: | :---: |
267
+ | es → xx | GPT-mini | 54.19 | **0.99** | - | **0.86** | **0.81** |
268
+ | | GPT-nano | 51.95 | **0.99** | - | **0.86** | **0.81** |
269
+ | | Gemini-2 | 60.28 | **0.99** | - | **0.86** | **0.81** |
270
+ | | Gemini-2.5 | 57.61 | **0.99** | - | **0.86** | **0.81** |
271
+ | | Llama-3-8B | 52.12 | **0.99** | - | 0.85 | 0.80 |
272
+ | | Gemma-3-27B | 57.31 | **0.99** | - | **0.86** | **0.81** |
273
+ | | MADLAD-7B | 40.13 | 0.79 | - | 0.83 | **0.81** |
274
+ | | Salamandra-2B | 47.84 | 0.94 | - | 0.84 | 0.80 |
275
+ | | &nbsp;&nbsp;+ ACADTRAIN | 60.09 | **0.99** | - | **0.86** | **0.81** |
276
+ | | Salamandra-7B | 55.65 | 0.98 | - | **0.86** | 0.80 |
277
+ | | &nbsp;&nbsp;+ ACADTRAIN | **61.61** | **0.99** | - | **0.86** | **0.81** |
278
+
279
+ </details>
280
+
281
+
282
+ ## Ethical Considerations and Limitations
283
+
284
+ Detailed information on the work done to examine the presence of unwanted social and cognitive biases in the base model can be found
285
+ at [Salamandra-2B model card](https://huggingface.co/BSC-LT/salamandra-2b).
286
+ No specific analysis has yet been carried out in order to evaluate potential biases or limitations in translation accuracy across different languages, dialects, or domains. However, we recognize the importance of identifying and addressing any harmful stereotypes, cultural inaccuracies, or systematic performance discrepancies that may arise in Machine Translation. As such, we plan to continue performing more analyses as we implement the necessary metrics and methods within our evaluation framework [MT-Lens](https://github.com/langtech-bsc/mt-evaluation).
287
+ Note that the model has only undergone preliminary instruction tuning.
288
+ We urge developers to consider potential limitations and conduct safety testing and tuning tailored to their specific applications.
289
+
290
+ ## Additional information
291
+
292
+ ### Author
293
+ The Language Technologies Unit from Barcelona Supercomputing Center.
294
+
295
+ ### Contact
296
+ For further information, please send an email to <[email protected]>.
297
+
298
+ ### Copyright
299
+ Copyright(c) 2025 by Language Technologies Unit, Barcelona Supercomputing Center.
300
+
301
+ ### Funding
302
+ This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Modelos del Lenguaje.
303
+
304
+ This work has been promoted and financed by the Government of Catalonia through the [Aina project](https://projecteaina.cat/).
305
+
306
+ This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the [project ILENIA] (https://proyectoilenia.es/) with reference 2022/TL22/00215337.
307
+
308
+
309
+ ### Disclaimer
310
+ Be aware that the model may contain biases or other unintended distortions.
311
+ When third parties deploy systems or provide services based on this model, or use the model themselves,
312
+ they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations,
313
+ including those governing the use of Artificial Intelligence.
314
+
315
+ The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use.
316
+
317
+ ### Citation
318
+ ```
319
+ *ADD PAPER CITATION*
320
+ ```
321
+
322
+ ### License
323
+ [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
config.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/gpfs/projects/bsc88/text/models/instruction-tuning/models/out_instructed_models/salamandra_v1.0_december2024/00_out-of-ft-pipeline/salamandra2b_v0.2_100%_annx1_instruct_ca-en-es-eu-gl-pt_v1.0",
3
+ "architectures": [
4
+ "LlamaForCausalLM"
5
+ ],
6
+ "attention_bias": false,
7
+ "attention_dropout": 0.0,
8
+ "bos_token_id": 1,
9
+ "eos_token_id": 2,
10
+ "head_dim": 128,
11
+ "hidden_act": "silu",
12
+ "hidden_size": 2048,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 5440,
15
+ "max_position_embeddings": 8192,
16
+ "mlp_bias": false,
17
+ "model_type": "llama",
18
+ "num_attention_heads": 16,
19
+ "num_hidden_layers": 24,
20
+ "num_key_value_heads": 16,
21
+ "pretraining_tp": 1,
22
+ "rms_norm_eps": 1e-05,
23
+ "rope_scaling": null,
24
+ "rope_theta": 10000.0,
25
+ "tie_word_embeddings": false,
26
+ "torch_dtype": "bfloat16",
27
+ "transformers_version": "4.40.2",
28
+ "use_cache": true,
29
+ "vocab_size": 256000
30
+ }
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "transformers_version": "4.40.2"
6
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:280c48db0061ad8b8a57a41522d203efd1cf6ccf288c27765235218ee09038b8
3
+ size 4507005744
special_tokens_map.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>"
5
+ ],
6
+ "bos_token": {
7
+ "content": "<s>",
8
+ "lstrip": false,
9
+ "normalized": false,
10
+ "rstrip": false,
11
+ "single_word": false
12
+ },
13
+ "eos_token": {
14
+ "content": "</s>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false
19
+ },
20
+ "pad_token": {
21
+ "content": "<unk>",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false
26
+ },
27
+ "unk_token": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false
33
+ }
34
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:139de51e6bbe12b772a255e157829f43bd67b63a4d55f1fe0e3abce37b2d8c9a
3
+ size 19066993
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fa490e57cebce5cb1a0a5b1a5d3fa4de05aee53dc3a44791f1c3401db44d802d
3
+ size 4813274
tokenizer_config.json ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": true,
3
+ "add_eos_token": false,
4
+ "add_prefix_space": true,
5
+ "added_tokens_decoder": {
6
+ "0": {
7
+ "content": "<unk>",
8
+ "lstrip": false,
9
+ "normalized": false,
10
+ "rstrip": false,
11
+ "single_word": false,
12
+ "special": true
13
+ },
14
+ "1": {
15
+ "content": "<s>",
16
+ "lstrip": false,
17
+ "normalized": false,
18
+ "rstrip": false,
19
+ "single_word": false,
20
+ "special": true
21
+ },
22
+ "2": {
23
+ "content": "</s>",
24
+ "lstrip": false,
25
+ "normalized": false,
26
+ "rstrip": false,
27
+ "single_word": false,
28
+ "special": true
29
+ },
30
+ "4": {
31
+ "content": "<|im_start|>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false,
36
+ "special": true
37
+ },
38
+ "5": {
39
+ "content": "<|im_end|>",
40
+ "lstrip": false,
41
+ "normalized": false,
42
+ "rstrip": false,
43
+ "single_word": false,
44
+ "special": true
45
+ }
46
+ },
47
+ "additional_special_tokens": [
48
+ "<|im_start|>",
49
+ "<|im_end|>"
50
+ ],
51
+ "bos_token": "<s>",
52
+ "chat_template": "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
53
+ "clean_up_tokenization_spaces": false,
54
+ "eos_token": "</s>",
55
+ "legacy": true,
56
+ "model_max_length": 8192,
57
+ "pad_token": "<unk>",
58
+ "padding_side": "right",
59
+ "sp_model_kwargs": {},
60
+ "spaces_between_special_tokens": false,
61
+ "tokenizer_class": "LlamaTokenizer",
62
+ "unk_token": "<unk>",
63
+ "use_default_system_prompt": false
64
+ }