Aitana-2B-S

Table of Contents

Click to expand

Model description

Aitana-2B-S is a generative language model with a decoder-only architecture. This model has been trained based on Salamandra-2B, using data in Valencian to achieve greater representation of this minority language, which is very similar to Catalan. This model has been continuously pre-trained for two epochs, processing 2.12 billion tokens throughout the training process. Due to the data sources used, the political and administrative domains are highly present in the model's register. The data has also been anonymised during pre-processing to avoid training with data that could violate people's privacy.

This model is based on Salamandra-2B as the basis for training and uses the same tokenizer.

Intended uses and limitations

Aitana-2B-S is a base model that can be used for causal language modeling, it can be used as is for text generation, although fine/instruction-tuning on specific tasks is recommended for its final use.

This language model has been trained with data in a formal register, namely related to the administrative and political domain, so it is expected that using it in text-generation tasks will produce text in this same format.

How to use

import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
input_text = "Les corts valencianes han pres la decisió de"
model_id  = "gplsi/Aitana-2B-S"
tokenizer = AutoTokenizer.from_pretrained(model_id)
generator = pipeline(
    "text-generation",
    model=model_id,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)
generation = generator(
    input_text,
    do_sample=True,
    top_k=10,
    eos_token_id=tokenizer.eos_token_id,
)
print(f"Result: {generation[0]['generated_text']}")

Training

Training data

The training corpus has been obtained using web scraping on public data from different sources such as the Official Gazette of the University of Alicante (BOUA), the Official Gazette of the Generalitat Valenciana (DOGV) and accurate data provided by the Valencian Courts (DSCV and DSCCV). Giving a total of 1.304 million tokens, according to the following table.

Dataset Language Total Sentences Total Words Total Numbers Other Symbols Unique Words Total Tokens Average sentence Length Average Word Length
BOUA va 0.606M 12.355M 0.488M 0.055M 0.211M 12.899M 21.27 4.89
DOGCV va 4.569M 50.566M 6.339M 0.613M 17.436M 57.517M 12.59 4.68
DOGV va 18.598M 311.380M 24.138M 2.731M 11.416M 338.250M 18.19 4.88
DSCCV va 2.353M 46.116M 0.554M 2.352m 5.031M 46.672M 19.84 4.56
DSCV va 1.646M 32.496M 0.433M 1.427m 3.796M 32.930M 20.01 4.65
UN va 0.394M 12.289M 0.253M 0.015M 0.533M 12.556M 31.86 4.86
VJ va 0.913M 23.594M 0.466M 23.314m 0.849M 24.084M 26.39 4.57

Several of the downloaded sources have already been used in the Meta-Llama-3-8B training, so the date of data collection for the previous model has been taken into account and those web pages have been scraped from that date.

Information on the datasets used for training is shown below:

  • Official Bulletin of the University of Alicante (BOUA): These are the documents issued by the University of Alicante related to grants, regulations, and different resolutions of laws published periodically, specifically the Valencian version.

  • Legacy Official Journal of the Generalitat Valenciana (DOGCV): This journal contains historical documents issued by the Valencian Community. These documents were initially recorded on paper and digitised with the standardisation of the digital format. They have the same subject matter as the DOGV documents but were generated between 1980 and 1997.

  • Official Journal of the Generalitat Valenciana (DOGV): These documents contain official communications of the Valencian Community. They mainly deal with issuing laws, legal measures, and public sector communication. These journals were issued from 1998 to 2023.

  • Valencian Parliament Diary Dataset (DSCCV) contains records from various committee meetings held in the parliament, with each meeting documented in a separate text file.

  • Journal of the Valencian Parliament (DSCV): in this case, the transcripts of the different meetings held in the parliament's plenary sessions, with data from 1999 to 2022.

  • University news (UN): We have news in a colloquial register from different universities that have Valencian as an official language, including the universities of Valencia, Alicante, Jaume I, and the Polytechnic University of Valencia.

  • Valencian Journals (VJ): These include various types of Valencian journals with colloquial records to facilitate daily record-keeping alongside the legal and bureaucratic documents from previous records. These include a total of 10 different journals.

Training parameters

During the training of the model, a high context window was desired when generating text, so it was decided to use an input size of 2048 tokens and a minimum context window of 512 in case of truncating the input sequences. 80% of the data obtained was used for the training stage, while 20% was used during the evaluation stage. A summary of the parameters used during training can be seen in the following table:

Parameter Value
Epochs 2
Learning Rate 2e-5
Warmup Steps 0
Precision bf-16
Weight decay 1e-1
Training Fraction 0.95
Evaluation Fraction 0.05
Input size (tokens) 2048
Minimum context window (tokens) 512

Distributed Training Strategy

A distributed training strategy called Fully Sharded Data Parallel (FSDP) has been used. With this, the entire model has been loaded among the 4 A100s available for training with a mini-batch size of size 1 and a total gradient accumulation step of 64.

Languages

In addition to the data already used for the training of Meta-Llama-3-8B, data completely in Valencian from the sources mentioned in the previous section has been used.

Evaluation

In the following table, we can see the results obtained with different benchmarks in comparison with the model used for continuous pre-training. The results have been obtained from the model pre-trained; no instruction tuning or fine-tuning of any kind has been performed.

Valencian

Classification Benchmarks

Dataset Lang. Task Metric Salamandra-2B Aitana-2B-S
XNLI va Natural Language Inference acc 0.475 0.473

Generation Benchmarks

Dataset Lang. Task Metric Salamandra-2B Aitana-2B-S
Cocoteros va Reading Comprehension bleu 6.32 5.76
Phrases ca-va va-ca Translation - Adaptation bleu 79.82 81.92
Phrases va-ca va-ca Translation - Adaptation bleu 78.05 76.53
Phrases va-es va-es Translation bleu 76.04 75.99
Phrases es-va es-va Translation bleu 58.86 61.51

Catalan

Classification Benchmarks

Dataset Lang. Task Metric Salamandra-2B Aitana-2B-S
Belebele Cat_latn ca Reading Comprehension acc 0.231 0.257
COPA ca Commonsense Reasoning acc 0.700 0.712
XStoryCloze ca Commonsense Reasoning acc 0.655 0.657
OpenBookQA ca Question Answering acc 0.294 0.282
PAWS ca Paraphrasing acc 0.556 0.551
PiQA ca Question Answering acc 0.643 0.646
SiQA ca Question Answering acc 0.434 0.432
ARC Easy ca Question Answering acc 0.551 0.549
ARC Challenge ca Question Answering acc 0.290 0.288
XNLI ca Natural Language Inference acc 0.473 0.480
Teca ca Natural Language Inference acc 0.465 0.459
WNLI ca Natural Language Inference acc 0.577 0.563
Catcola ca Linguistic Acceptability acc 0.543 0.525
Catcola ca Linguistic Acceptability mcc 0.046 0.023
Catalanqa ca Question Answering F1 0.668 0.655
Mgsm direct ca Math exact match 0.024 0.028
Catalanqa ca Question Answering exact match 0.437 0.415
Xquad ca Question Answering exact match 0.371 0.354
Xquad ca Question Answering F1 0.579 0.566

Generation Benchmarks

Dataset Lang. Task Metric Salamandra-2B Aitana-2B-S
Cabreu abstractive ca Summarization bleu 5.78 6.24
Cabreu extractive ca Summarization bleu 42.89 41.19
Cabreu extreme ca Summarization bleu 3.29 3.81

Spanish

Classification Benchmarks

Dataset Lang. Task Metric Salamandra-2B Aitana-2B-S
Belebele Cat_latn es Reading Comprehension acc 0.228 0.224
PAWS es Paraphrasing acc 0.561 0.543
XNLI es Natural Language Inference acc 0.439 0.422
WNLI es Natural Language Inference acc 0.563 0.563
XStoryCloze es Commonsense Reasoning acc 0.653 0.652
Escola es Linguistic Acceptability acc 0.593 0.536
Escola es Linguistic Acceptability mcc 0.031 0.010
OpenbookQA es Question Answering acc 0.308 0.314
MGSM Direct es Math exact match 0.020 0.020
XQUAD es Question Answering exact match 0.377 0.373
XQUAD es Question Answering F1 0.584 0.583

Generation Benchmarks

Dataset Lang. Task Metric Salamandra-2B Aitana-2B-S
Cocoteros es Reading Comprehension bleu 8.46 7.35
XLSum es Summarization bleu 0.801 0.434

English

Classification Benchmarks

Dataset Lang. Task Metric Salamandra-2B Aitana-2B-S
Arc Challenge en Question Answering acc 0.370 0.374
Arc Easy en Question Answering acc 0.722 0.719
Belebele Eng_latn en Reading Comprehension acc 0.216 0.229
PAWS en Paraphrasing acc 0.561 0.562
XNLI en Natural Language Inference acc 0.462 0.446
XStoryCloze en Commonsense Reasoning acc 0.711 0.713
OpenBookQA en Question Answering acc 0.300 0.308
PiQA en Question Answering acc 0.737 0.743
Social iqa en Question Answering acc 0.454 0.451
WNLI en Natural Language Inference acc 0.465 0.578
MGSM Direct en Math exact match 0.064 0.064
TriviaQA en Question Answering exact match -0.019 0.015

Additional information

Author

Language and Information System Group GPLSI

Contact

For further information, please send an email to GPLSI

Copyright

Copyright(c) 2025 by GPLSI(https://gplsi.dlsi.ua.es/).

License

Apache License 2.0

Funding

This work was funded by ILENIA-VIVES project <<2022/TL22/00215334>>

Disclaimer

The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0.

Be aware that the model may have biases and/or any other undesirable distortions.

When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it) or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.

In no event shall the owner and creator of the model (GPLSI) be liable for any results arising from the use made by third parties.

Downloads last month
0
Safetensors
Model size
2.25B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for gplsi/Aitana-2B-S

Finetuned
(3)
this model