Vokenization Improving Language Understanding with Visual Grounded Supervision (Paper Explained)
Deep Learning Explainer
It's a super cool paper that invents "vokenization" to generate a large amount of visually-grounded language datasets and trains visually-grounded models on those.
Most language models are trained on pure text data. Although it achieves significant success in recent years, but this is not how humans acquire a language. It raises an interesting question "Can language models achieve a high level of language understanding by reading the text input alone?" The answer is probably "no".
To push the boundary of language models, adding other learning signals in the learning process is the key to success. And the first thing that comes to my mind is vision (visual cue). However, the existing visually-grounded datasets are a level of magnitude smaller than pure text ones. This paper purposes "vokenization" method to overcome this problem, and uses the new data that generate to train visually-supervised language models.
More importantly, visually-grounded models show significant improvements over text-grounded only models.
0:00 - How did you learn your first language 1:00 - What's special about this paper 2:56 - How humans learn a language 5:23 - Visual pointing 6:07 - Challenge to visually-grounded supervision 9:58 - Token-image matching 11:53 - Vokenization 18:40 - Vokenizer training 23:39 - Visually-supervised language models 25:48 - Voken classification tasks 27:24 - Loss function 28:37 - Implication of voken classification 31:56 - Fine-tuning results 35:22 - Conventional visually-grounded corpora are very different 37:51 - Sentence-level v.s token-level 41:45 - Summary
Paper https://arxiv.org/abs/2010.06775
Code https://github.com/airsplay/vokenization
Abstract Humans learn language by listening, speaking, writing, reading, and also, via interaction with the multimodal real world. Existing language pre-training frameworks show the effectiveness of text-only self-supervision while we explore the idea of a visually-supervised language model in this paper. We find that the main reason hindering this exploration is the large divergence in magnitude and distributions between the visually-grounded language datasets and pure-language corpora. Therefore, we develop a technique named "vokenization" that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images (which we call "vokens"). The "vokenizer" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora. Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks such as GLUE, SQuAD, and SWAG.
Connect Twitter https://twitter.com/home ... https://www.youtube.com/watch?v=4T1u3Z2DaZA
65063136 Bytes