Abstract

Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resource-restricted devices. To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large “teacher” BERT can be effectively transferred to a small “student” TinyBERT. Then, we introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pre-training and task-specific learning stages. This framework ensures that TinyBERT can capture the general-domain as well as the task-specific knowledge in BERT. TinyBERT4 with 4 layers is empirically effective and achieves more than 96.8% the performance of its teacher BERT-Base on GLUE benchmark, while being 7.5x smaller and 9.4x faster on inference. TinyBERT4 is also significantly better than 4-layer state-of-the-art baselines on BERT distillation, with only ~28% parameters and ~31% inference time of them. Moreover, TinyBERT6 with 6 layers performs on-par with its teacher BERT-Base.

Keywords

Computer scienceTransformerDistillationInferenceBenchmark (surveying)Language modelArtificial intelligenceNatural language understandingTask (project management)Machine learningNatural languageNatural language processingVoltage

Affiliated Institutions

Related Publications

Publication Info

Year
2020
Type
article
Citations
1491
Access
Closed

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

1491
OpenAlex
295
Influential
666
CrossRef

Cite This

Xiaoqi Jiao, Yichun Yin, Lifeng Shang et al. (2020). TinyBERT: Distilling BERT for Natural Language Understanding. Findings of the Association for Computational Linguistics: EMNLP 2020 . https://doi.org/10.18653/v1/2020.findings-emnlp.372

Identifiers

DOI
10.18653/v1/2020.findings-emnlp.372
arXiv
1909.10351

Data Quality

Data completeness: 84%