Well read Students Learn Better: On The Importance Of Pre-training Compact Models
Deep Learning Explainer
This video explains how you can train a very competitive compact model with pre-training and distillation techniques. With those two techniques, it's possible you can train a 2-3 times smaller model with minimal or no loss in performance.
0:00 - Intro 2:29 - Pre-training + distillation + fine-tuning 3:58 - Knowledge distillation 6:04 - Objective function for distillation 7:26 - Teachers & Students 10:37 - Data types 12:10 - Truncating deep models 13:37 - Comparison to other work 16:49 - Analysis tasks 19:10 - Is it enough to pre-train word embeddings 21:35 - Is it worse to truncate deep pre-trained models 23:24 - What is the best student for fixed parameter budget 34:13 - Robustness to domain shift 35:15 - Corpus similarity measurement 41:19 - Conclusion
Related video: Distilling Task Specific Knowledge from BERT into Simple Neural Networks (paper explained) https://youtu.be/AKCPPvaz8tU
Transformer Architecture Explained https://youtu.be/ELTGIye424E
Paper: Well-Read Students Learn Better: On the Importance of Pre-training Compact Models https://arxiv.org/abs/1908.08962
Abstract Recent developments in natural language representations have been accompanied by large and expensive models that leverage vast amounts of general-domain text through self-supervised pre-training. Due to the cost of applying such models to down-stream tasks, several model compression techniques on pre-trained language representations have been proposed (Sun et al., 2019; Sanh, 2019). However, surprisingly, the simple baseline of just pre-training and fine-tuning compact models has been overlooked. In this paper, we first show that pre-training remains important in the context of smaller architectures, and fine-tuning pre-trained compact models can be competitive to more elaborate methods proposed in concurrent work. Starting with pre-trained compact models, we then explore transferring task knowledge from large fine-tuned models through standard knowledge distillation. The resulting simple, yet effective and general algorithm, Pre-trained Distillation, brings further improvements. Through extensive experiments, we more generally explore the interaction between pre-training and distillation under two variables that have been under-studied: model size and properties of unlabeled task data. One surprising observation is that they have a compound effect even when sequentially applied on the same data. To accelerate future research, we will make our 24 pre-trained miniature BERT models publicly available. ... https://www.youtube.com/watch?v=LoyyKVJgHKo
64581319 Bytes