XtremeDistilTransformers for Distilling Massive Neural Networks

XtremeDistilTransformers is a distilled task-agnostic transformer model that leverages task transfer for learning a small universal model that can be applied to arbitrary tasks and languages as outlined in the paper XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation.

We leverage task transfer combined with multi-task distillation techniques from the papers XtremeDistil: Multi-stage Distillation for Massive Multilingual Models and MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers with the following Github code.

This l6-h384 checkpoint with 6 layers, 384 hidden size, 12 attention heads corresponds to 22 million parameters with 5.3x speedup over BERT-base.

Other available checkpoints: xtremedistil-l6-h384-uncased and xtremedistil-l12-h384-uncased

The following table shows the results on GLUE dev set and SQuAD-v2.

Models	#Params	Speedup	MNLI	QNLI	QQP	RTE	SST	MRPC	SQUAD2	Avg
BERT	109	1x	84.5	91.7	91.3	68.6	93.2	87.3	76.8	84.8
DistilBERT	66	2x	82.2	89.2	88.5	59.9	91.3	87.5	70.7	81.3
TinyBERT	66	2x	83.5	90.5	90.6	72.2	91.6	88.4	73.1	84.3
MiniLM	66	2x	84.0	91.0	91.0	71.5	92.0	88.4	76.4	84.9
MiniLM	22	5.3x	82.8	90.3	90.6	68.9	91.3	86.6	72.9	83.3
XtremeDistil-l6-h256	13	8.7x	83.9	89.5	90.6	80.1	91.2	90.0	74.1	85.6
XtremeDistil-l6-h384	22	5.3x	85.4	90.3	91.0	80.9	92.3	90.0	76.6	86.6
XtremeDistil-l12-h384	33	2.7x	87.2	91.9	91.3	85.6	93.1	90.4	80.2	88.5

Tested with tensorflow 2.3.1, transformers 4.1.1, torch 1.6.0

If you use this checkpoint in your work, please cite:

@misc{mukherjee2021xtremedistiltransformers,
      title={XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation}, 
      author={Subhabrata Mukherjee and Ahmed Hassan Awadallah and Jianfeng Gao},
      year={2021},
      eprint={2106.04563},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Downloads last month: 1,528

Model tree for microsoft/xtremedistil-l6-h256-uncased

Finetunes

7 models

Quantizations

2 models