Upload folder using huggingface_hub

Browse files

Files changed (9) hide show

.gitattributes +1 -0
README.md +323 -0
config.json +30 -0
generation_config.json +6 -0
model.safetensors +3 -0
special_tokens_map.json +34 -0
tokenizer.json +3 -0
tokenizer.model +3 -0
tokenizer_config.json +64 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,323 @@

+---
+license: apache-2.0
+library_name: transformers
+pipeline_tag: translation
+base_model:
+- BSC-LT/salamandra-2b-instruct
+---
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/633b489acbdbadd99c0b75ef/MhsW4ODhK6ofYq8DnpyKc.png)
+# SalamandraTA-2B-academic Model Card
+This repository contains the model SalamandraTA-2B-academic, which is a Machine Translation fine-tunning of the [Salamandra2B-Instruct](https://huggingface.co/BSC-LT/salamandraTA-2b-instruct).
+This model has been obtained following the procedures shown in **CITE PAPER AS SOON AS AVAILABLE**.
+> [!WARNING]
+> **DISCLAIMER:** This version of Salamandra is tailored exclusively for translation tasks. Even if the Machine Translation version has been obtained after fine-tunning an instructed version the chat capabilities have not been tested. For this we refer to the used [instructed version](https://huggingface.co/BSC-LT/salamandraTA-2b-instruct).
+---
+## Model Details
+### Architecture
+|                         |               |
+|-------------------------|:--------------|
+| Total Parameters        | 2,253,490,176 |
+| Embedding Parameters    | 524,288,000   |
+| Layers                  | 24            |
+| Hidden size             | 2,048         |
+| Attention heads         | 16            |
+| Context length          | 8,192         |
+| Vocabulary size         | 256,000       |
+| Precision               | bfloat16      |
+| Embedding type          | RoPE          |
+| Activation Function     | SwiGLU        |
+| Layer normalization     | RMS Norm      |
+| Flash attention         | ✅            |
+| Grouped Query Attention | ❌            |
+| Num. query groups       | N/A           |
+---
+## Intended Use
+### Direct Use
+The model is intended for both research and commercial use in any of the languages included in the training data for general machine translation tasks.
+### Out-of-scope Use
+The model is not intended for malicious activities, such as harming others or violating human rights.
+Any downstream application must comply with current laws and regulations.
+Irresponsible usage in production environments without proper risk assessment and mitigation is also discouraged.
+---
+## Hardware and Software
+### Training Framework
+SalamandraTA-2B-academic was instructed with [FastChat](https://github.com/lm-sys/FastChat).
+### Compute Infrastructure
+All models were trained on [MareNostrum 5](https://www.bsc.es/ca/marenostrum/marenostrum-5), a pre-exascale EuroHPC supercomputer hosted and
+operated by Barcelona Supercomputing Center.
+The accelerated partition is composed of 1,120 nodes with the following specifications:
+- 4x Nvidia Hopper GPUs with 64GB HBM2 memory
+- 2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores)
+- 4x NDR200 (BW per node 800Gb/s)
+- 512 GB of Main memory (DDR5)
+- 460GB on NVMe storage
+---
+## How to use
+SalamandraTA-2B-academic was fine-tuned using ACAD-Train dataset which focuses on pairs involving English, Iberian Peninsula languages, and several Central European languages, namely: Asturian (ast), Catalan (ca), German (de), Greek (el), Spanish (es), English (en), Basque (eu), French (fr), Galician (gl), Italian (it), Dutch (nl) and Portuguese (pt). The dataset includes 48 unique language pairs. Since each pair is used for translation in both directions (e.g., English to Spanish and Spanish to English), this results in the 96 total supported directions. The most frequent language pairs, accounting for 96.5% of the dataset, are:
+- English - Spanish (en-es)
+- English - French (en-fr)
+- English - Catalan (en-ca)
+- Catalan - Spanish (ca-es)
+- Spanish - French (es-fr)
+- English - Portuguese (en-pt)
+A comprehensive list of all language pairs included in the [ACAD-Train dataset](https://huggingface.co/datasets/LangTech-MT/ACAData).
+The instruction-following model uses the commonly adopted ChatML template:
+```
+<|im_start|>system
+{SYSTEM PROMPT}<|im_end|>
+<|im_start|>user
+{USER PROMPT}<|im_end|>
+<|im_start|>assistant
+{MODEL RESPONSE}<|im_end|>
+<|im_start|>user
+[...]
+```
+The easiest way to apply it is by using the tokenizer's built-in functions, as shown in the following snippet.
+```python
+from datetime import datetime
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import transformers
+import torch
+model_id = "LangTech-MT/salamandraTA-2B-academic"
+# Input parameters
+source = 'English'
+target = 'Spanish'
+sentence = "With the purpose of analyzing women’s perceptions and classifying their modes of understanding a positive human papillomavirus (HPV+) test, we conducted 38 in‑depth interviews with women who had received an HPV diagnosis (normal and abnormal Pap smear), screened in Jujuy’s public health system in 2016. A typology based on women’s understandings of the result was developed: 1) understanding; 2) lack of understanding; a) underestimation; b) overestimation; c) confusion. The interviewees who experienced confusion over the results reported contradictory perceptions in relation to a positive HPV test and its severity; those who underestimated it tended to mention the absence of symptoms and expressed little concern over the result; while those who overestimated it considered themselves sick and described concern, narrating a biographical disruption and physical pain. These findings confirm the need to improve the delivery of results and the provision of information in order to decrease psychosocial impact and increase follow‑up adherence in HPV‑positive women."
+text = f"Translate the following text from {source} into {target}.\n{source}: {sentence} \n{target}:"
+# Load tokenizer and model
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    device_map="auto",
+    torch_dtype=torch.bfloat16
+  )
+# Construct prompt using chat template
+message = [ { "role": "user", "content": text } ]
+date_string = datetime.today().strftime('%Y-%m-%d')
+prompt = tokenizer.apply_chat_template(
+    message,
+    tokenize=False,
+    add_generation_prompt=True,
+    date_string=date_string
+)
+inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
+input_length = inputs.shape[1]
+# Generate output
+outputs = model.generate(
+    input_ids=inputs.to(model.device),
+    max_new_tokens=400,
+    early_stopping=True,
+    num_beams=5
+)
+# Decode and print output
+print(tokenizer.decode(outputs[0, input_length:], skip_special_tokens=True))
+# Con el propósito de analizar las percepciones de las mujeres y clasificar sus modos de comprensión de un resultado positivo de virus del papiloma humano (VPH+), en 2016 realizamos 38 entrevistas en profundidad a mujeres con diagnóstico de VPH (citología normal y anormal) detectado en el sistema público de salud de Jujuy. Se elaboró una tipología basada en la comprensión del resultado por parte de las mujeres: 1) comprensión; 2) falta de comprensión; a) subestimación; b) sobreestimación; c) confusión. Las entrevistadas que experimentaron confusión informaron percepciones contradictorias sobre el VPH+ y su gravedad; quienes lo subestimaron tendían a mencionar la ausencia de síntomas y mostraron poca preocupación; mientras que aquellas que lo sobreestimaron se consideraban enfermas, describían preocupación, narrando una ruptura biográfica y dolor físico. Estos hallazgos confirman la necesidad de mejorar la entrega de resultados y la provisión de información para disminuir el impacto psicosocial y aumentar la adherencia al seguimiento en mujeres con VPH positivo.
+```
+Using this template, each turn is preceded by a `<|im_start|>` delimiter and the role of the entity
+(either `user`, for content supplied by the user, or `assistant` for LLM responses), and finished with the `<|im_end|>` token.
+#### Machine Translation Prompt
+The following prompt template is recommended, since it is the one used during training:
+```
+Translate the following text from {source} into {target}.
+{source}: {source sentence}
+{target}:
+```
+<details>
+<summary>Show an example</summary>
+```python
+source = 'English'
+target = 'Spanish'
+source_sentence = "With the purpose of analyzing women’s perceptions and classifying their modes of understanding a positive human papillomavirus (HPV+) test, we conducted 38 in‑depth interviews with women who had received an HPV diagnosis (normal and abnormal Pap smear), screened in Jujuy’s public health system in 2016. A typology based on women’s understandings of the result was developed: 1) understanding; 2) lack of understanding; a) underestimation; b) overestimation; c) confusion. The interviewees who experienced confusion over the results reported contradictory perceptions in relation to a positive HPV test and its severity; those who underestimated it tended to mention the absence of symptoms and expressed little concern over the result; while those who overestimated it considered themselves sick and described concern, narrating a biographical disruption and physical pain. These findings confirm the need to improve the delivery of results and the provision of information in order to decrease psychosocial impact and increase follow‑up adherence in HPV‑positive women."
+text = f"Translate the following text from {source} into {target}.\n{source}: {source_sentence} \n{target}:"
+# Con el propósito de analizar las percepciones de las mujeres y clasificar sus modos de comprensión de un resultado positivo de virus del papiloma humano (VPH+), en 2016 realizamos 38 entrevistas en profundidad a mujeres con diagnóstico de VPH (citología normal y anormal) detectado en el sistema público de salud de Jujuy. Se elaboró una tipología basada en la comprensión del resultado por parte de las mujeres: 1) comprensión; 2) falta de comprensión; a) subestimación; b) sobreestimación; c) confusión. Las entrevistadas que experimentaron confusión informaron percepciones contradictorias sobre el VPH+ y su gravedad; quienes lo subestimaron tendían a mencionar la ausencia de síntomas y mostraron poca preocupación; mientras que aquellas que lo sobreestimaron se consideraban enfermas, describían preocupación, narrando una ruptura biográfica y dolor físico. Estos hallazgos confirman la necesidad de mejorar la entrega de resultados y la provisión de información para disminuir el impacto psicosocial y aumentar la adherencia al seguimiento en mujeres con VPH positivo.
+```
+</details>
+### Instruction Tuning Data
+The corpus used for the instruction tuning is [ACAData](https://huggingface.co/datasets/LangTech-MT/ACAData).
+For more details about the corpus construction, you can refer to the [Paper](*add link to paper).
+## Evaluation
+Aggregated results for the xx ↔ en and xx ↔ es translation directions in ACAD-Bench dataset. Baselines are grouped into **large-scale proprietary general models**, **medium- to small-sized open-weights models** and **dedicated MMNMT models**. For every metric the top-scoring system is shown in **bold**. For a more detailed evaluation discussion, please refer to the paper.
+<details>
+<summary>xx → en</summary>
+| Direction | Model | d-BLEU | BP | Blonde | Comet | Comet-Kiwi |
+| :--- | :--- | :---: | :---: | :---: | :---: | :---: |
+| xx → en | GPT-mini | 46.03 | **1.00** | 0.60 | **0.84** | 0.77 |
+| | GPT-nano | 41.30 | 0.97 | 0.55 | **0.84** | **0.78** |
+| | Gemini-2 | 48.65 | **1.00** | 0.61 | **0.84** | 0.77 |
+| | Gemini-2.5 | 45.10 | 0.98 | 0.58 | **0.84** | 0.77 |
+| | Llama-3-8B | 43.12 | 0.99 | 0.56 | 0.83 | 0.76 |
+| | Gemma-3-27B | 46.37 | 0.98 | 0.59 | **0.84** | 0.77 |
+| | MADLAD-7B | 38.69 | 0.86 | 0.51 | 0.81 | 0.77 |
+| | Salamandra-2B | 37.09 | 0.92 | 0.52 | 0.82 | 0.75 |
+| | &nbsp;&nbsp;+ ACADTRAIN | 48.45 | **1.00** | 0.61 | 0.83 | 0.76 |
+| | Salamandra-7B | 45.87 | 0.99 | 0.59 | 0.83 | 0.76 |
+| | &nbsp;&nbsp;+ ACADTRAIN | **50.07** | **1.00** | **0.62** | **0.84** | 0.76 |
+</details>
+<details>
+<summary>en → xx</summary>
+| Direction | Model | d-BLEU | BP | Blonde | Comet | Comet-Kiwi |
+| :--- | :--- | :---: | :---: | :---: | :---: | :---: |
+| en → xx | GPT-mini | 45.01 | 0.99 | - | 0.86 | **0.82** |
+| | GPT-nano | 43.78 | **1.00** | - | 0.86 | **0.82** |
+| | Gemini-2 | 48.00 | 0.99 | - | **0.87** | **0.82** |
+| | Gemini-2.5 | 47.75 | 0.99 | - | **0.87** | **0.82** |
+| | Llama-3-8B | 39.87 | 0.99 | - | 0.85 | 0.81 |
+| | Gemma-3-27B | 46.29 | 0.99 | - | 0.86 | **0.82** |
+| | MADLAD-7B | 36.08 | 0.82 | - | 0.83 | 0.80 |
+| | Salamandra-2B | 32.91 | 0.90 | - | 0.83 | 0.78 |
+| | &nbsp;&nbsp;+ ACADTRAIN | 46.86 | 0.98 | - | 0.86 | 0.81 |
+| | Salamandra-7B | 42.55 | 0.98 | - | 0.86 | 0.81 |
+| | &nbsp;&nbsp;+ ACADTRAIN | **49.20** | 0.98 | - | 0.86 | 0.81 |
+</details>
+<details>
+<summary>xx → es</summary>
+| Direction | Model | d-BLEU | BP | Blonde | Comet | Comet-Kiwi |
+| :--- | :--- | :---: | :---: | :---: | :---: | :---: |
+| xx → es | GPT-mini | 60.60 | 0.98 | - | 0.86 | **0.82** |
+| | GPT-nano | 57.88 | **0.99** | - | 0.86 | **0.82** |
+| | Gemini-2 | 62.02 | 0.99 | - | 0.86 | **0.82** |
+| | Gemini-2.5 | 61.43 | 0.98 | - | **0.87** | **0.82** |
+| | Llama-3-8B | 55.4 | 0.98 | - | 0.86 | 0.81 |
+| | Gemma-3-27B | 60.71 | 0.98 | - | 0.86 | **0.82** |
+| | MADLAD-7B | 43.44 | 0.76 | - | 0.83 | 0.81 |
+| | Salamandra-2B | 50.09 | 0.92 | - | 0.85 | 0.80 |
+| | &nbsp;&nbsp;+ ACADTRAIN | 61.97 | 0.98 | - | 0.86 | **0.82** |
+| | Salamandra-7B | 57.55 | 0.98 | - | 0.86 | **0.82** |
+| | &nbsp;&nbsp;+ ACADTRAIN | **63.60** | 0.98 | - | 0.86 | **0.82** |
+</details>
+<details>
+<summary>es → xx</summary>
+| Direction | Model | d-BLEU | BP | Blonde | Comet | Comet-Kiwi |
+| :--- | :--- | :---: | :---: | :---: | :---: | :---: |
+| es → xx | GPT-mini | 54.19 | **0.99** | - | **0.86** | **0.81** |
+| | GPT-nano | 51.95 | **0.99** | - | **0.86** | **0.81** |
+| | Gemini-2 | 60.28 | **0.99** | - | **0.86** | **0.81** |
+| | Gemini-2.5 | 57.61 | **0.99** | - | **0.86** | **0.81** |
+| | Llama-3-8B | 52.12 | **0.99** | - | 0.85 | 0.80 |
+| | Gemma-3-27B | 57.31 | **0.99** | - | **0.86** | **0.81** |
+| | MADLAD-7B | 40.13 | 0.79 | - | 0.83 | **0.81** |
+| | Salamandra-2B | 47.84 | 0.94 | - | 0.84 | 0.80 |
+| | &nbsp;&nbsp;+ ACADTRAIN | 60.09 | **0.99** | - | **0.86** | **0.81** |
+| | Salamandra-7B | 55.65 | 0.98 | - | **0.86** | 0.80 |
+| | &nbsp;&nbsp;+ ACADTRAIN | **61.61** | **0.99** | - | **0.86** | **0.81** |
+</details>
+## Ethical Considerations and Limitations
+Detailed information on the work done to examine the presence of unwanted social and cognitive biases in the base model can be found
+at [Salamandra-2B model card](https://huggingface.co/BSC-LT/salamandra-2b).
+No specific analysis has yet been carried out in order to evaluate potential biases or limitations in translation accuracy across different languages, dialects, or domains. However, we recognize the importance of identifying and addressing any harmful stereotypes, cultural inaccuracies, or systematic performance discrepancies that may arise in Machine Translation. As such, we plan to continue performing more analyses as we implement the necessary metrics and methods within our evaluation framework [MT-Lens](https://github.com/langtech-bsc/mt-evaluation).
+Note that the model has only undergone preliminary instruction tuning.
+We urge developers to consider potential limitations and conduct safety testing and tuning tailored to their specific applications.
+## Additional information
+### Author
+The Language Technologies Unit from Barcelona Supercomputing Center.
+### Contact
+For further information, please send an email to <[email protected]>.
+### Copyright
+Copyright(c) 2025 by Language Technologies Unit, Barcelona Supercomputing Center.
+### Funding
+This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Modelos del Lenguaje.
+This work has been promoted and financed by the Government of Catalonia through the [Aina project](https://projecteaina.cat/).
+This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the [project ILENIA] (https://proyectoilenia.es/) with reference 2022/TL22/00215337.
+### Disclaimer
+Be aware that the model may contain biases or other unintended distortions.
+When third parties deploy systems or provide services based on this model, or use the model themselves,
+they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations,
+including those governing the use of Artificial Intelligence.
+The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use.
+### Citation
+```
+*ADD PAPER CITATION*
+```
+### License
+[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)

config.json ADDED Viewed

	@@ -0,0 +1,30 @@

+{
+  "_name_or_path": "/gpfs/projects/bsc88/text/models/instruction-tuning/models/out_instructed_models/salamandra_v1.0_december2024/00_out-of-ft-pipeline/salamandra2b_v0.2_100%_annx1_instruct_ca-en-es-eu-gl-pt_v1.0",
+  "architectures": [
+    "LlamaForCausalLM"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "head_dim": 128,
+  "hidden_act": "silu",
+  "hidden_size": 2048,
+  "initializer_range": 0.02,
+  "intermediate_size": 5440,
+  "max_position_embeddings": 8192,
+  "mlp_bias": false,
+  "model_type": "llama",
+  "num_attention_heads": 16,
+  "num_hidden_layers": 24,
+  "num_key_value_heads": 16,
+  "pretraining_tp": 1,
+  "rms_norm_eps": 1e-05,
+  "rope_scaling": null,
+  "rope_theta": 10000.0,
+  "tie_word_embeddings": false,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.40.2",
+  "use_cache": true,
+  "vocab_size": 256000
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "transformers_version": "4.40.2"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:280c48db0061ad8b8a57a41522d203efd1cf6ccf288c27765235218ee09038b8
+size 4507005744

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,34 @@

+{
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>"
+  ],
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:139de51e6bbe12b772a255e157829f43bd67b63a4d55f1fe0e3abce37b2d8c9a
+size 19066993

tokenizer.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fa490e57cebce5cb1a0a5b1a5d3fa4de05aee53dc3a44791f1c3401db44d802d
+size 4813274

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,64 @@

+{
+  "add_bos_token": true,
+  "add_eos_token": false,
+  "add_prefix_space": true,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>"
+  ],
+  "bos_token": "<s>",
+  "chat_template": "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "</s>",
+  "legacy": true,
+  "model_max_length": 8192,
+  "pad_token": "<unk>",
+  "padding_side": "right",
+  "sp_model_kwargs": {},
+  "spaces_between_special_tokens": false,
+  "tokenizer_class": "LlamaTokenizer",
+  "unk_token": "<unk>",
+  "use_default_system_prompt": false
+}