Rethinking Attention with Performers (Paper Explained)
Yannic Kilcher
#ai #research #attention
Transformers have huge memory and compute requirements because they construct an Attention matrix, which grows quadratically in the size of the input. The Performer is a model that uses random positive orthogonal features to construct an unbiased estimator to the Attention matrix and obtains an arbitrarily good approximation in linear time! The method generalizes beyond attention and opens the door to the next generation of deep learning architectures.
OUTLINE: 0:00 - Intro & Outline 6:15 - Quadratic Bottleneck in Attention Mechanisms 10:00 - Decomposing the Attention Matrix 15:30 - Approximating the Softmax Kernel 24:45 - Different Choices, Different Kernels 28:00 - Why the Naive Approach does not work! 31:30 - Better Approximation via Positive Features 36:55 - Positive Features are Infinitely Better 40:10 - Orthogonal Features are Even Better 43:25 - Experiments 49:20 - Broader Impact Statement 50:00 - Causal Attention via Prefix Sums 52:10 - Code 53:50 - Final Remarks & Conclusion
Paper: https://arxiv.org/abs/2009.14794 Code: https://github.com/google-research/google-research/tree/master/performer Blog: https://ai.googleblog.com/2020/10/rethinking-attention-with-performers.html
Kernels on ML Street Talk: https://www.youtube.com/watch?v=y_RjsDHl5Y4 My Video on Linformer: https://www.youtube.com/watch?v=-_2AF9Lhweo My Video on Reformer: https://www.youtube.com/watch?v=i4H0kjxrias My Video on Attention: https://www.youtube.com/watch?v=iDulhoQ2pro
Abstract: We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the ... https://www.youtube.com/watch?v=xJrKIPwVwGM
301594617 Bytes