DistilRoBERTa-base-ca

Model description

This model is a distilled version of projecte-aina/roberta-base-ca-v2.

It follows the same training procedure as DistilBERT, using the implementation of Knowledge Distillation from the paper's official repository.

The resulting architecture consists of 6 layers, 768 dimensional embeddings and 12 attention heads. This adds up to a total of 82M parameters, which is considerably less than the 125M of standard RoBERTa-base models. This makes the model lighter and faster than the original, at the cost of a slightly lower performance.

Training

Training procedure

This model has been trained using a technique known as Knowledge Distillation, which is used to shrink networks to a reasonable size while minimizing the loss in performance.

It basically consists in distilling a large language model (the teacher) into a more lightweight, energy-efficient, and production-friendly model (the student).

So, in a “teacher-student learning” setup, a relatively small student model is trained to mimic the behavior of a larger teacher model. As a result, the student has lower inference time and the ability to run in commodity hardware.

Training data

The training corpus consists of several corpora gathered from web crawling and public corpora, as shown in the table below:

Corpus	Size (GB)
Catalan Crawling	13.00
RacoCatalá	8.10
Catalan Oscar	4.00
CaWaC	3.60
Cat. General Crawling	2.50
Wikipedia	1.10
DOGC	0.78
Padicat	0.63
ACN	0.42
Nació Digital	0.42
Cat. Government Crawling	0.24
Vilaweb	0.06
Catalan Open Subtitles	0.02
Tweets	0.02

Evaluation

Evaluation benchmark

This model has been fine-tuned on the downstream tasks of the Catalan Language Understanding Evaluation benchmark (CLUB), which includes the following datasets:

Dataset	Task	Total	Train	Dev	Test
AnCora	NER	13,581	10,628	1,427	1,526
AnCora	POS	16,678	13,123	1,709	1,846
STS-ca	STS	3,073	2,073	500	500
TeCla	TC	137,775	110,203	13,786	13,786
TE-ca	RTE	21,163	16,930	2,116	2,117
CatalanQA	QA	21,427	17,135	2,157	2,135
XQuAD-ca	QA	-	-	-	1,189

Evaluation results

This is how it compares to its teacher when fine-tuned on the aforementioned downstream tasks:

Model \ Task	NER (F1)	POS (F1)	STS-ca (Comb.)	TeCla (Acc.)	TEca (Acc.)	CatalanQA (F1/EM)	XQuAD-ca ¹ (F1/EM)
RoBERTa-base-ca-v2	89.29	98.96	79.07	74.26	83.14	89.50/76.63	73.64/55.42
DistilRoBERTa-base-ca	87.88	98.83	77.26	73.20	76.00	84.07/70.77	62.93/45.08

¹ : Trained on CatalanQA, tested on XQuAD-ca (no train set).