An Image is Worth 16x16 Words:Transformers for Image Recognition at Scale (Paper Explained)
Deep Learning Explainer
This paper applies a pure transformer-based model (Vision Transformer) to a sequence of image patches for image recognition. It shows that when the Transformer is trained on a large dataset, it starts outperforming CNN-based models. And it achieves state-of-the-art on multiple image classification datasets. More importantly, it takes much less time to pre-train for Vision Transformer compared to other models.
0:00 - How many words is an image worth 2:17 - What's special about this paper 4:32 - Self-attention to images 7:05 - How it works 8:06 - Vision Transformer (ViT) 10:30 - Patch embedding 15:50 - [class] token 17:04 - Positional embedding 23:10 - Different ways to embed position info 25:49 - Model architecture 28:35 - Hybrid architecture 30::28 - Pre-training & fine-tuning 31:37 - Fine-tuning on higher resolution images 34:20 - Datasets 34:27 - Model variants 34:55 - Comparison to state-of-the-art 39:05 - Model size v.s data size 43:36 - Scaling study 46:32 - Attention heads 47:47 - Attention distance over layers 48:11 - Attention pattern analysis 51:14 - Self-supervised pre-training 54:16 - Summary
Related videos Transformer Architecture Explained https://youtu.be/ELTGIye424E
Quantifying Attention Flow In Transformers https://youtu.be/3Q0ZXqVaQPo
Paper https://openreview.net/pdf?id=YicbFdNTTy
Code https://github.com/lucidrains/vit-pytorch
Abstract While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer can perform very well on image classification tasks when applied directly to sequences of image patches. When pre-trained on large amounts of data and transferred to multiple recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer attain excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train. ... https://www.youtube.com/watch?v=Gl48KciWZp0
94296056 Bytes