Sandwich Transformer: Improving Transformer Models by Reordering their Sublayers

Deep Learning Explainer

attention is all you need attention mechanism and transformers bert architecture bert nlp language model attention pre-train language models reordering bert sublayers reordering transformer sublayers sandwich bert sandwich reordering pattern transformer transformer architecture explained transformer language modeling transformer neural architecture search transformer nlp unsupervised learning in nlp

description

Is the transformer architecture the most optimal way to model languages? This video explains how reordering the transformer's sublayers can significantly improve its performance and how you can use these insights to design a better transformer.

0:00 - Intro 2:17 - Transformer block components 3:04 - Interleaved v.s non-interleaved 5:03 - Is interleaving optimal 7:02 - Sublayer reordering results 8:53 - Are balanced architecture better 11:40 - Attention first, feedforward later 13:08 - Designing a better ransformer 15:35 - Sandwich Transformer on WikiText-103 16:58 - Sandwich coefficient 19:00 - Toronto Books Corpus 20:12 - Machine translation 25:12 - Discussion on the NMT case 27:00 - Summary

Transformer Architecture Explained https://youtu.be/ELTGIye424E

Paper: Improving Transformer Models by Reordering their Sublayers

Code: https://github.com/ofirpress/sandwich_transformer

Abstract Multilayer transformer networks consist of interleaved self-attention and feedforward sublayers. Could ordering the sublayers in a different pattern lead to better performance? We generate randomly ordered transformers and train them with the language modeling objective. We observe that some of these models are able to achieve better performance than the interleaved baseline, and that those successful variants tend to have more self-attention at the bottom and more feedforward sublayers at the top. We propose a new transformer pattern that adheres to this property, the sandwich transformer, and show that it improves perplexity on multiple word-level and character-level language modeling benchmarks, at no cost in parameters, memory, or training time. However, the sandwich reordering pattern does not guarantee performance gains across every task, as we demonstrate on machine translation models. Instead, we suggest that further exploration of task-specific sublayer reorderings is needed in order to unlock additional gains. ... https://www.youtube.com/watch?v=EM8xFAjtZUQ

created

2021-01-02

staked

0.0 LBC

license

Copyrighted (contact publisher)

File size

43152274 Bytes