DistilRoBERTa-base-ca
Model description
This model is a distilled version of projecte-aina/roberta-base-ca-v2.
It follows the same training procedure as DistilBERT, using the implementation of Knowledge Distillation from the paper's official repository.
The resulting architecture consists of 6 layers, 768 dimensional embeddings and 12 attention heads. This adds up to a total of 82M parameters, which is considerably less than the 125M of standard RoBERTa-base models. This makes the model lighter and faster than the original, at the cost of a slightly lower performance.
Training
Training procedure
This model has been trained using a technique known as Knowledge Distillation, which is used to shrink networks to a reasonable size while minimizing the loss in performance.
It basically consists in distilling a large language model (the teacher) into a more lightweight, energy-efficient, and production-friendly model (the student).
So, in a “teacher-student learning” setup, a relatively small student model is trained to mimic the behavior of a larger teacher model. As a result, the student has lower inference time and the ability to run in commodity hardware.
Training data
The training corpus consists of several corpora gathered from web crawling and public corpora, as shown in the table below:
Corpus | Size (GB) |
---|---|
Catalan Crawling | 13.00 |
RacoCatalá | 8.10 |
Catalan Oscar | 4.00 |
CaWaC | 3.60 |
Cat. General Crawling | 2.50 |
Wikipedia | 1.10 |
DOGC | 0.78 |
Padicat | 0.63 |
ACN | 0.42 |
Nació Digital | 0.42 |
Cat. Government Crawling | 0.24 |
Vilaweb | 0.06 |
Catalan Open Subtitles | 0.02 |
Tweets | 0.02 |
Evaluation
Evaluation benchmark
This model has been fine-tuned on the downstream tasks of the Catalan Language Understanding Evaluation benchmark (CLUB), which includes the following datasets:
Dataset | Task | Total | Train | Dev | Test |
---|---|---|---|---|---|
AnCora | NER | 13,581 | 10,628 | 1,427 | 1,526 |
AnCora | POS | 16,678 | 13,123 | 1,709 | 1,846 |
STS-ca | STS | 3,073 | 2,073 | 500 | 500 |
TeCla | TC | 137,775 | 110,203 | 13,786 | 13,786 |
TE-ca | RTE | 21,163 | 16,930 | 2,116 | 2,117 |
CatalanQA | QA | 21,427 | 17,135 | 2,157 | 2,135 |
XQuAD-ca | QA | - | - | - | 1,189 |
Evaluation results
This is how it compares to its teacher when fine-tuned on the aforementioned downstream tasks:
Model \ Task | NER (F1) | POS (F1) | STS-ca (Comb.) | TeCla (Acc.) | TEca (Acc.) | CatalanQA (F1/EM) | XQuAD-ca 1 (F1/EM) |
---|---|---|---|---|---|---|---|
RoBERTa-base-ca-v2 | 89.29 | 98.96 | 79.07 | 74.26 | 83.14 | 89.50/76.63 | 73.64/55.42 |
DistilRoBERTa-base-ca | 87.88 | 98.83 | 77.26 | 73.20 | 76.00 | 84.07/70.77 | 62.93/45.08 |
1 : Trained on CatalanQA, tested on XQuAD-ca (no train set).