You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

This model was created as a research project to study the impact of high-quality data from the CoRoLa corpus on a small model. The model is intended only for research and may not be suitable for production use.

This model is the result of continous pre-training of the Llama-3.2-1B model on selected data from the Representative Corpus of Contemporary Romanian Language (CoRoLa). The purpose of the experiments was to evaluate the impact of a small high quality corpus of Romanian language. Thus, we focused only on a small part of the CoRoLa corpus. We filtered the documents based on CoRoLa metadata attributes DocumentType (Book, inBook, inCollection) and DocumentTextDomain (Science). This resulted in 7,568 documents that were included in this research.

A paper detailing the results was submitted to the 20th International Conference on Linguistic Resources and Tools for Natural Language Processing (CONSILR 2025).

Downloads last month: 9

Safetensors

Model size

1.24B params

Tensor type

BF16

Model tree for racai/corola-llama-3.2-1b-e3

Base model

meta-llama/Llama-3.2-1B

Finetuned

(694)

this model

Finetunes

1 model