Well read Students Learn Better: On The Importance Of Pre-training Compact Models

Deep Learning Explainer

bert model compression compact bert models deep learning distillation deep learning model compression distil bert model distil language models distiled distillation nlp models distilling deep learning models knowledge distillation language model pertaining neural network distillation neural network model compression nlp model distillation pre-trained distillation pre-training compact model small bert model task-specific distillation teacher-student network

description

This video explains how you can train a very competitive compact model with pre-training and distillation techniques. With those two techniques, it's possible you can train a 2-3 times smaller model with minimal or no loss in performance.

0:00 - Intro 2:29 - Pre-training + distillation + fine-tuning 3:58 - Knowledge distillation 6:04 - Objective function for distillation 7:26 - Teachers & Students 10:37 - Data types 12:10 - Truncating deep models 13:37 - Comparison to other work 16:49 - Analysis tasks 19:10 - Is it enough to pre-train word embeddings 21:35 - Is it worse to truncate deep pre-trained models 23:24 - What is the best student for fixed parameter budget 34:13 - Robustness to domain shift 35:15 - Corpus similarity measurement 41:19 - Conclusion

Related video: Distilling Task Specific Knowledge from BERT into Simple Neural Networks (paper explained) https://youtu.be/AKCPPvaz8tU

Transformer Architecture Explained https://youtu.be/ELTGIye424E

Paper: Well-Read Students Learn Better: On the Importance of Pre-training Compact Models https://arxiv.org/abs/1908.08962

Abstract Recent developments in natural language representations have been accompanied by large and expensive models that leverage vast amounts of general-domain text through self-supervised pre-training. Due to the cost of applying such models to down-stream tasks, several model compression techniques on pre-trained language representations have been proposed (Sun et al., 2019; Sanh, 2019). However, surprisingly, the simple baseline of just pre-training and fine-tuning compact models has been overlooked. In this paper, we first show that pre-training remains important in the context of smaller architectures, and fine-tuning pre-trained compact models can be competitive to more elaborate methods proposed in concurrent work. Starting with pre-trained compact models, we then explore transferring task knowledge from large fine-tuned models through standard knowledge distillation. The resulting simple, yet effective and general algorithm, Pre-trained Distillation, brings further improvements. Through extensive experiments, we more generally explore the interaction between pre-training and distillation under two variables that have been under-studied: model size and properties of unlabeled task data. One surprising observation is that they have a compound effect even when sequentially applied on the same data. To accelerate future research, we will make our 24 pre-trained miniature BERT models publicly available. ... https://www.youtube.com/watch?v=LoyyKVJgHKo

created

2021-01-02

staked

0.0 LBC

license

Copyrighted (contact publisher)

File size

64581319 Bytes