Aitana-2B-S

Click to expand

Model description
Intended uses and limitations
How to use
Training
Evaluation
Additional information

Model description

Aitana-2B-S is a generative language model with a decoder-only architecture. This model has been trained based on Salamandra-2B, using data in Valencian to achieve greater representation of this minority language, which is very similar to Catalan. This model has been continuously pre-trained for two epochs, processing 2.12 billion tokens throughout the training process. Due to the data sources used, the political and administrative domains are highly present in the model's register. The data has also been anonymised during pre-processing to avoid training with data that could violate people's privacy.

This model is based on Salamandra-2B as the basis for training and uses the same tokenizer.

Intended uses and limitations

Aitana-2B-S is a base model that can be used for causal language modeling, it can be used as is for text generation, although fine/instruction-tuning on specific tasks is recommended for its final use.

This language model has been trained with data in a formal register, namely related to the administrative and political domain, so it is expected that using it in text-generation tasks will produce text in this same format.

How to use

import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
input_text = "Les corts valencianes han pres la decisió de"
model_id  = "gplsi/Aitana-2B-S"
tokenizer = AutoTokenizer.from_pretrained(model_id)
generator = pipeline(
    "text-generation",
    model=model_id,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)
generation = generator(
    input_text,
    do_sample=True,
    top_k=10,
    eos_token_id=tokenizer.eos_token_id,
)
print(f"Result: {generation[0]['generated_text']}")

Training

Training data

The training corpus has been obtained using web scraping on public data from different sources such as the Official Gazette of the University of Alicante (BOUA), the Official Gazette of the Generalitat Valenciana (DOGV) and accurate data provided by the Valencian Courts (DSCV and DSCCV). Giving a total of 1.304 million tokens, according to the following table.

Dataset	Language	Total Sentences	Total Words	Total Numbers	Other Symbols	Unique Words	Total Tokens	Average sentence Length	Average Word Length
BOUA	va	0.606M	12.355M	0.488M	0.055M	0.211M	12.899M	21.27	4.89
DOGCV	va	4.569M	50.566M	6.339M	0.613M	17.436M	57.517M	12.59	4.68
DOGV	va	18.598M	311.380M	24.138M	2.731M	11.416M	338.250M	18.19	4.88
DSCCV	va	2.353M	46.116M	0.554M	2.352m	5.031M	46.672M	19.84	4.56
DSCV	va	1.646M	32.496M	0.433M	1.427m	3.796M	32.930M	20.01	4.65
UN	va	0.394M	12.289M	0.253M	0.015M	0.533M	12.556M	31.86	4.86
VJ	va	0.913M	23.594M	0.466M	23.314m	0.849M	24.084M	26.39	4.57

Several of the downloaded sources have already been used in the Meta-Llama-3-8B training, so the date of data collection for the previous model has been taken into account and those web pages have been scraped from that date.

Information on the datasets used for training is shown below:

Official Bulletin of the University of Alicante (BOUA): These are the documents issued by the University of Alicante related to grants, regulations, and different resolutions of laws published periodically, specifically the Valencian version.
Legacy Official Journal of the Generalitat Valenciana (DOGCV): This journal contains historical documents issued by the Valencian Community. These documents were initially recorded on paper and digitised with the standardisation of the digital format. They have the same subject matter as the DOGV documents but were generated between 1980 and 1997.
Official Journal of the Generalitat Valenciana (DOGV): These documents contain official communications of the Valencian Community. They mainly deal with issuing laws, legal measures, and public sector communication. These journals were issued from 1998 to 2023.
Valencian Parliament Diary Dataset (DSCCV) contains records from various committee meetings held in the parliament, with each meeting documented in a separate text file.
Journal of the Valencian Parliament (DSCV): in this case, the transcripts of the different meetings held in the parliament's plenary sessions, with data from 1999 to 2022.
University news (UN): We have news in a colloquial register from different universities that have Valencian as an official language, including the universities of Valencia, Alicante, Jaume I, and the Polytechnic University of Valencia.
Valencian Journals (VJ): These include various types of Valencian journals with colloquial records to facilitate daily record-keeping alongside the legal and bureaucratic documents from previous records. These include a total of 10 different journals.

Training parameters

During the training of the model, a high context window was desired when generating text, so it was decided to use an input size of 2048 tokens and a minimum context window of 512 in case of truncating the input sequences. 80% of the data obtained was used for the training stage, while 20% was used during the evaluation stage. A summary of the parameters used during training can be seen in the following table:

Parameter	Value
Epochs	2
Learning Rate	2e-5
Warmup Steps	0
Precision	bf-16
Weight decay	1e-1
Training Fraction	0.95
Evaluation Fraction	0.05
Input size (tokens)	2048
Minimum context window (tokens)	512

Distributed Training Strategy

A distributed training strategy called Fully Sharded Data Parallel (FSDP) has been used. With this, the entire model has been loaded among the 4 A100s available for training with a mini-batch size of size 1 and a total gradient accumulation step of 64.

Languages

In addition to the data already used for the training of Meta-Llama-3-8B, data completely in Valencian from the sources mentioned in the previous section has been used.

Evaluation

In the following table, we can see the results obtained with different benchmarks in comparison with the model used for continuous pre-training. The results have been obtained from the model pre-trained; no instruction tuning or fine-tuning of any kind has been performed.

Valencian

Classification Benchmarks

Dataset	Lang.	Task	Metric	Salamandra-2B	Aitana-2B-S
XNLI	va	Natural Language Inference	acc	0.475	0.473

Generation Benchmarks

Dataset	Lang.	Task	Metric	Salamandra-2B	Aitana-2B-S
Cocoteros	va	Reading Comprehension	bleu	6.32	5.76
Phrases ca-va	va-ca	Translation - Adaptation	bleu	79.82	81.92
Phrases va-ca	va-ca	Translation - Adaptation	bleu	78.05	76.53
Phrases va-es	va-es	Translation	bleu	76.04	75.99
Phrases es-va	es-va	Translation	bleu	58.86	61.51

Catalan

Classification Benchmarks

Dataset	Lang.	Task	Metric	Salamandra-2B	Aitana-2B-S
Belebele Cat_latn	ca	Reading Comprehension	acc	0.231	0.257
COPA	ca	Commonsense Reasoning	acc	0.700	0.712
XStoryCloze	ca	Commonsense Reasoning	acc	0.655	0.657
OpenBookQA	ca	Question Answering	acc	0.294	0.282
PAWS	ca	Paraphrasing	acc	0.556	0.551
PiQA	ca	Question Answering	acc	0.643	0.646
SiQA	ca	Question Answering	acc	0.434	0.432
ARC Easy	ca	Question Answering	acc	0.551	0.549
ARC Challenge	ca	Question Answering	acc	0.290	0.288
XNLI	ca	Natural Language Inference	acc	0.473	0.480
Teca	ca	Natural Language Inference	acc	0.465	0.459
WNLI	ca	Natural Language Inference	acc	0.577	0.563
Catcola	ca	Linguistic Acceptability	acc	0.543	0.525
Catcola	ca	Linguistic Acceptability	mcc	0.046	0.023
Catalanqa	ca	Question Answering	F1	0.668	0.655
Mgsm direct	ca	Math	exact match	0.024	0.028
Catalanqa	ca	Question Answering	exact match	0.437	0.415
Xquad	ca	Question Answering	exact match	0.371	0.354
Xquad	ca	Question Answering	F1	0.579	0.566

Generation Benchmarks

Dataset	Lang.	Task	Metric	Salamandra-2B	Aitana-2B-S
Cabreu abstractive	ca	Summarization	bleu	5.78	6.24
Cabreu extractive	ca	Summarization	bleu	42.89	41.19
Cabreu extreme	ca	Summarization	bleu	3.29	3.81

Spanish

Classification Benchmarks

Dataset	Lang.	Task	Metric	Salamandra-2B	Aitana-2B-S
Belebele Cat_latn	es	Reading Comprehension	acc	0.228	0.224
PAWS	es	Paraphrasing	acc	0.561	0.543
XNLI	es	Natural Language Inference	acc	0.439	0.422
WNLI	es	Natural Language Inference	acc	0.563	0.563
XStoryCloze	es	Commonsense Reasoning	acc	0.653	0.652
Escola	es	Linguistic Acceptability	acc	0.593	0.536
Escola	es	Linguistic Acceptability	mcc	0.031	0.010
OpenbookQA	es	Question Answering	acc	0.308	0.314
MGSM Direct	es	Math	exact match	0.020	0.020
XQUAD	es	Question Answering	exact match	0.377	0.373
XQUAD	es	Question Answering	F1	0.584	0.583

Generation Benchmarks

Dataset	Lang.	Task	Metric	Salamandra-2B	Aitana-2B-S
Cocoteros	es	Reading Comprehension	bleu	8.46	7.35
XLSum	es	Summarization	bleu	0.801	0.434

English

Classification Benchmarks

Dataset	Lang.	Task	Metric	Salamandra-2B	Aitana-2B-S
Arc Challenge	en	Question Answering	acc	0.370	0.374
Arc Easy	en	Question Answering	acc	0.722	0.719
Belebele Eng_latn	en	Reading Comprehension	acc	0.216	0.229
PAWS	en	Paraphrasing	acc	0.561	0.562
XNLI	en	Natural Language Inference	acc	0.462	0.446
XStoryCloze	en	Commonsense Reasoning	acc	0.711	0.713
OpenBookQA	en	Question Answering	acc	0.300	0.308
PiQA	en	Question Answering	acc	0.737	0.743
Social iqa	en	Question Answering	acc	0.454	0.451
WNLI	en	Natural Language Inference	acc	0.465	0.578
MGSM Direct	en	Math	exact match	0.064	0.064
TriviaQA	en	Question Answering	exact match	-0.019	0.015

Additional information

Author

Language and Information System Group GPLSI

Contact

For further information, please send an email to GPLSI

Copyright

License

Apache License 2.0

Funding

This work was funded by ILENIA-VIVES project <<2022/TL22/00215334>>

Disclaimer

The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0.

Be aware that the model may have biases and/or any other undesirable distortions.

When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it) or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.

In no event shall the owner and creator of the model (GPLSI) be liable for any results arising from the use made by third parties.

gplsi
/

Aitana-2B-S

Aitana-2B-S

Table of Contents

Model description

Intended uses and limitations

How to use

Training

Training data

Training parameters

Distributed Training Strategy

Languages

Evaluation

Valencian

Classification Benchmarks

Generation Benchmarks

Catalan

Classification Benchmarks

Generation Benchmarks

Spanish

Classification Benchmarks

Generation Benchmarks

English

Classification Benchmarks

Additional information

Author

Contact

Copyright

License

Funding

Disclaimer

Model tree for gplsi/Aitana-2B-S