SalamandraTA-2B-academic Model Card
This repository contains the model SalamandraTA-2B-academic, which is a Machine Translation fine-tunning of the Salamandra2B-Instruct. This model has been obtained following the procedures shown in ACADATA: Parallel Dataset of Academic Data for Machine Translation.
DISCLAIMER: This version of Salamandra is tailored exclusively for translation tasks. Even if the Machine Translation version has been obtained after fine-tunning an instructed version the chat capabilities have not been tested. For this we refer to the used instructed version.
Model Details
Architecture
Total Parameters | 2,253,490,176 |
Embedding Parameters | 524,288,000 |
Layers | 24 |
Hidden size | 2,048 |
Attention heads | 16 |
Context length | 8,192 |
Vocabulary size | 256,000 |
Precision | bfloat16 |
Embedding type | RoPE |
Activation Function | SwiGLU |
Layer normalization | RMS Norm |
Flash attention | ✅ |
Grouped Query Attention | ❌ |
Num. query groups | N/A |
Intended Use
Direct Use
The model is intended for both research and commercial use in any of the languages included in the training data for general machine translation tasks.
Out-of-scope Use
The model is not intended for malicious activities, such as harming others or violating human rights. Any downstream application must comply with current laws and regulations. Irresponsible usage in production environments without proper risk assessment and mitigation is also discouraged.
Hardware and Software
Training Framework
SalamandraTA-2B-academic was instructed with FastChat.
Compute Infrastructure
All models were trained on MareNostrum 5, a pre-exascale EuroHPC supercomputer hosted and operated by Barcelona Supercomputing Center.
The accelerated partition is composed of 1,120 nodes with the following specifications:
- 4x Nvidia Hopper GPUs with 64GB HBM2 memory
- 2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores)
- 4x NDR200 (BW per node 800Gb/s)
- 512 GB of Main memory (DDR5)
- 460GB on NVMe storage
How to use
SalamandraTA-2B-academic was fine-tuned using ACAD-Train dataset which focuses on pairs involving English, Iberian Peninsula languages, and several Central European languages, namely: Asturian (ast), Catalan (ca), German (de), Greek (el), Spanish (es), English (en), Basque (eu), French (fr), Galician (gl), Italian (it), Dutch (nl) and Portuguese (pt). The dataset includes 48 unique language pairs. Since each pair is used for translation in both directions (e.g., English to Spanish and Spanish to English), this results in the 96 total supported directions. The most frequent language pairs, accounting for 96.5% of the dataset, are:
- English - Spanish (en-es)
- English - French (en-fr)
- English - Catalan (en-ca)
- Catalan - Spanish (ca-es)
- Spanish - French (es-fr)
- English - Portuguese (en-pt)
A comprehensive list of all language pairs included in the ACAD-Train dataset.
The instruction-following model uses the commonly adopted ChatML template:
<|im_start|>system
{SYSTEM PROMPT}<|im_end|>
<|im_start|>user
{USER PROMPT}<|im_end|>
<|im_start|>assistant
{MODEL RESPONSE}<|im_end|>
<|im_start|>user
[...]
The easiest way to apply it is by using the tokenizer's built-in functions, as shown in the following snippet.
from datetime import datetime
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch
model_id = "LangTech-MT/salamandraTA-2B-academic"
# Input parameters
source = 'English'
target = 'Spanish'
sentence = "With the purpose of analyzing women’s perceptions and classifying their modes of understanding a positive human papillomavirus (HPV+) test, we conducted 38 in‑depth interviews with women who had received an HPV diagnosis (normal and abnormal Pap smear), screened in Jujuy’s public health system in 2016. A typology based on women’s understandings of the result was developed: 1) understanding; 2) lack of understanding; a) underestimation; b) overestimation; c) confusion. The interviewees who experienced confusion over the results reported contradictory perceptions in relation to a positive HPV test and its severity; those who underestimated it tended to mention the absence of symptoms and expressed little concern over the result; while those who overestimated it considered themselves sick and described concern, narrating a biographical disruption and physical pain. These findings confirm the need to improve the delivery of results and the provision of information in order to decrease psychosocial impact and increase follow‑up adherence in HPV‑positive women."
text = f"Translate the following text from {source} into {target}.\n{source}: {sentence} \n{target}:"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16
)
# Construct prompt using chat template
message = [ { "role": "user", "content": text } ]
date_string = datetime.today().strftime('%Y-%m-%d')
prompt = tokenizer.apply_chat_template(
message,
tokenize=False,
add_generation_prompt=True,
date_string=date_string
)
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
input_length = inputs.shape[1]
# Generate output
outputs = model.generate(
input_ids=inputs.to(model.device),
max_new_tokens=400,
early_stopping=True,
num_beams=5
)
# Decode and print output
print(tokenizer.decode(outputs[0, input_length:], skip_special_tokens=True))
# Con el propósito de analizar las percepciones de las mujeres y clasificar sus modos de comprensión de un resultado positivo de virus del papiloma humano (VPH+), en 2016 realizamos 38 entrevistas en profundidad a mujeres con diagnóstico de VPH (citología normal y anormal) detectado en el sistema público de salud de Jujuy. Se elaboró una tipología basada en la comprensión del resultado por parte de las mujeres: 1) comprensión; 2) falta de comprensión; a) subestimación; b) sobreestimación; c) confusión. Las entrevistadas que experimentaron confusión informaron percepciones contradictorias sobre el VPH+ y su gravedad; quienes lo subestimaron tendían a mencionar la ausencia de síntomas y mostraron poca preocupación; mientras que aquellas que lo sobreestimaron se consideraban enfermas, describían preocupación, narrando una ruptura biográfica y dolor físico. Estos hallazgos confirman la necesidad de mejorar la entrega de resultados y la provisión de información para disminuir el impacto psicosocial y aumentar la adherencia al seguimiento en mujeres con VPH positivo.
Using this template, each turn is preceded by a <|im_start|>
delimiter and the role of the entity
(either user
, for content supplied by the user, or assistant
for LLM responses), and finished with the <|im_end|>
token.
Machine Translation Prompt
The following prompt template is recommended, since it is the one used during training:
Translate the following text from {source} into {target}.
{source}: {source sentence}
{target}:
Show an example
source = 'English'
target = 'Spanish'
source_sentence = "With the purpose of analyzing women’s perceptions and classifying their modes of understanding a positive human papillomavirus (HPV+) test, we conducted 38 in‑depth interviews with women who had received an HPV diagnosis (normal and abnormal Pap smear), screened in Jujuy’s public health system in 2016. A typology based on women’s understandings of the result was developed: 1) understanding; 2) lack of understanding; a) underestimation; b) overestimation; c) confusion. The interviewees who experienced confusion over the results reported contradictory perceptions in relation to a positive HPV test and its severity; those who underestimated it tended to mention the absence of symptoms and expressed little concern over the result; while those who overestimated it considered themselves sick and described concern, narrating a biographical disruption and physical pain. These findings confirm the need to improve the delivery of results and the provision of information in order to decrease psychosocial impact and increase follow‑up adherence in HPV‑positive women."
text = f"Translate the following text from {source} into {target}.\n{source}: {source_sentence} \n{target}:"
# Con el propósito de analizar las percepciones de las mujeres y clasificar sus modos de comprensión de un resultado positivo de virus del papiloma humano (VPH+), en 2016 realizamos 38 entrevistas en profundidad a mujeres con diagnóstico de VPH (citología normal y anormal) detectado en el sistema público de salud de Jujuy. Se elaboró una tipología basada en la comprensión del resultado por parte de las mujeres: 1) comprensión; 2) falta de comprensión; a) subestimación; b) sobreestimación; c) confusión. Las entrevistadas que experimentaron confusión informaron percepciones contradictorias sobre el VPH+ y su gravedad; quienes lo subestimaron tendían a mencionar la ausencia de síntomas y mostraron poca preocupación; mientras que aquellas que lo sobreestimaron se consideraban enfermas, describían preocupación, narrando una ruptura biográfica y dolor físico. Estos hallazgos confirman la necesidad de mejorar la entrega de resultados y la provisión de información para disminuir el impacto psicosocial y aumentar la adherencia al seguimiento en mujeres con VPH positivo.
Instruction Tuning Data
The corpus used for the instruction tuning is ACAData. For more details about the corpus construction, you can refer to the [Paper](*add link to paper).
Evaluation
Aggregated results for the xx ↔ en and xx ↔ es translation directions in ACAD-Bench dataset. Baselines are grouped into large-scale proprietary general models, medium- to small-sized open-weights models and dedicated MMNMT models. For every metric the top-scoring system is shown in bold. For a more detailed evaluation discussion, please refer to the paper.
xx → en
Direction | Model | d-BLEU | BP | Blonde | Comet | Comet-Kiwi |
---|---|---|---|---|---|---|
xx → en | GPT-mini | 46.03 | 1.00 | 0.60 | 0.84 | 0.77 |
GPT-nano | 41.30 | 0.97 | 0.55 | 0.84 | 0.78 | |
Gemini-2 | 48.65 | 1.00 | 0.61 | 0.84 | 0.77 | |
Gemini-2.5 | 45.10 | 0.98 | 0.58 | 0.84 | 0.77 | |
Llama-3-8B | 43.12 | 0.99 | 0.56 | 0.83 | 0.76 | |
Gemma-3-27B | 46.37 | 0.98 | 0.59 | 0.84 | 0.77 | |
MADLAD-7B | 38.69 | 0.86 | 0.51 | 0.81 | 0.77 | |
Salamandra-2B | 37.09 | 0.92 | 0.52 | 0.82 | 0.75 | |
+ ACADTRAIN | 48.45 | 1.00 | 0.61 | 0.83 | 0.76 | |
Salamandra-7B | 45.87 | 0.99 | 0.59 | 0.83 | 0.76 | |
+ ACADTRAIN | 50.07 | 1.00 | 0.62 | 0.84 | 0.76 |
en → xx
Direction | Model | d-BLEU | BP | Blonde | Comet | Comet-Kiwi |
---|---|---|---|---|---|---|
en → xx | GPT-mini | 45.01 | 0.99 | - | 0.86 | 0.82 |
GPT-nano | 43.78 | 1.00 | - | 0.86 | 0.82 | |
Gemini-2 | 48.00 | 0.99 | - | 0.87 | 0.82 | |
Gemini-2.5 | 47.75 | 0.99 | - | 0.87 | 0.82 | |
Llama-3-8B | 39.87 | 0.99 | - | 0.85 | 0.81 | |
Gemma-3-27B | 46.29 | 0.99 | - | 0.86 | 0.82 | |
MADLAD-7B | 36.08 | 0.82 | - | 0.83 | 0.80 | |
Salamandra-2B | 32.91 | 0.90 | - | 0.83 | 0.78 | |
+ ACADTRAIN | 46.86 | 0.98 | - | 0.86 | 0.81 | |
Salamandra-7B | 42.55 | 0.98 | - | 0.86 | 0.81 | |
+ ACADTRAIN | 49.20 | 0.98 | - | 0.86 | 0.81 |
xx → es
Direction | Model | d-BLEU | BP | Blonde | Comet | Comet-Kiwi |
---|---|---|---|---|---|---|
xx → es | GPT-mini | 60.60 | 0.98 | - | 0.86 | 0.82 |
GPT-nano | 57.88 | 0.99 | - | 0.86 | 0.82 | |
Gemini-2 | 62.02 | 0.99 | - | 0.86 | 0.82 | |
Gemini-2.5 | 61.43 | 0.98 | - | 0.87 | 0.82 | |
Llama-3-8B | 55.4 | 0.98 | - | 0.86 | 0.81 | |
Gemma-3-27B | 60.71 | 0.98 | - | 0.86 | 0.82 | |
MADLAD-7B | 43.44 | 0.76 | - | 0.83 | 0.81 | |
Salamandra-2B | 50.09 | 0.92 | - | 0.85 | 0.80 | |
+ ACADTRAIN | 61.97 | 0.98 | - | 0.86 | 0.82 | |
Salamandra-7B | 57.55 | 0.98 | - | 0.86 | 0.82 | |
+ ACADTRAIN | 63.60 | 0.98 | - | 0.86 | 0.82 |
es → xx
Direction | Model | d-BLEU | BP | Blonde | Comet | Comet-Kiwi |
---|---|---|---|---|---|---|
es → xx | GPT-mini | 54.19 | 0.99 | - | 0.86 | 0.81 |
GPT-nano | 51.95 | 0.99 | - | 0.86 | 0.81 | |
Gemini-2 | 60.28 | 0.99 | - | 0.86 | 0.81 | |
Gemini-2.5 | 57.61 | 0.99 | - | 0.86 | 0.81 | |
Llama-3-8B | 52.12 | 0.99 | - | 0.85 | 0.80 | |
Gemma-3-27B | 57.31 | 0.99 | - | 0.86 | 0.81 | |
MADLAD-7B | 40.13 | 0.79 | - | 0.83 | 0.81 | |
Salamandra-2B | 47.84 | 0.94 | - | 0.84 | 0.80 | |
+ ACADTRAIN | 60.09 | 0.99 | - | 0.86 | 0.81 | |
Salamandra-7B | 55.65 | 0.98 | - | 0.86 | 0.80 | |
+ ACADTRAIN | 61.61 | 0.99 | - | 0.86 | 0.81 |
Ethical Considerations and Limitations
Detailed information on the work done to examine the presence of unwanted social and cognitive biases in the base model can be found at Salamandra-2B model card. No specific analysis has yet been carried out in order to evaluate potential biases or limitations in translation accuracy across different languages, dialects, or domains. However, we recognize the importance of identifying and addressing any harmful stereotypes, cultural inaccuracies, or systematic performance discrepancies that may arise in Machine Translation. As such, we plan to continue performing more analyses as we implement the necessary metrics and methods within our evaluation framework MT-Lens. Note that the model has only undergone preliminary instruction tuning. We urge developers to consider potential limitations and conduct safety testing and tuning tailored to their specific applications.
Additional information
Author
The Language Technologies Unit from Barcelona Supercomputing Center.
Contact
For further information, please send an email to [email protected].
Copyright
Copyright(c) 2025 by Language Technologies Unit, Barcelona Supercomputing Center.
Funding
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Modelos del Lenguaje.
This work has been promoted and financed by the Government of Catalonia through the Aina project.
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the [project ILENIA] (https://proyectoilenia.es/) with reference 2022/TL22/00215337.
Disclaimer
Be aware that the model may contain biases or other unintended distortions. When third parties deploy systems or provide services based on this model, or use the model themselves, they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations, including those governing the use of Artificial Intelligence.
The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use.
Citation
@misc{lacunza2025acadataparalleldatasetacademic,
title={ACADATA: Parallel Dataset of Academic Data for Machine Translation},
author={Iñaki Lacunza and Javier Garcia Gilabert and Francesca De Luca Fornaciari and Javier Aula-Blasco and Aitor Gonzalez-Agirre and Maite Melero and Marta Villegas},
year={2025},
eprint={2510.12621},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.12621},
}
License
- Downloads last month
- 28