Pre-training Is (Almost) All You Need: An Application to Commonsense Reasoning (Paper Explained)

Deep Learning Explainer

bert architectures bert language models bert nlp bert zero shot learning common sense reasoning bert commonsense reasoning bert commonsenseqa deep learning commonsense reasoning language model analysis language model pertaining language modeling linguistic knowledge in bert plausibility ranking pre-train language models pre-trained text encoders self-supervised learning transformer architectures transformer models unsupervised learning in nlp

description

This paper proposes a method requiring no fine-tuning to achieve very decent results on commonsense reasoning tasks. (zero-shot learning) With this novel scoring mechanism, ROBERTA-large (355M parameters) performs surprisingly well in a zero-shot learning setup.

0:00 - Intro 3:08 - Commonsense reasoning 5:15 - Proposed method 8:29 - Sequence scoring method 12:59 - SSM-based fine-tuning 15:28 - Task probing 19:23 - Experiment results 24:58 - Future work 25:31 - Takeaways

Paper: https://arxiv.org/abs/2004.14074

Abstract: Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks. Most of the existing approaches rely on a randomly initialized classifier on top of such networks. We argue that this fine-tuning procedure is sub-optimal as the pre-trained model has no prior on the specific classifier labels, while it might have already learned an intrinsic textual representation of the task. In this paper, we introduce a new scoring method that casts a plausibility ranking task in a full-text format and leverages the masked language modeling head tuned during the pre-training phase. We study commonsense reasoning tasks where the model must rank a set of hypotheses given a premise, focusing on the COPA, Swag, HellaSwag and CommonsenseQA datasets. By exploiting our scoring method without fine-tuning, we are able to produce strong baselines (e.g. 80% test accuracy on COPA) that are comparable to supervised approaches. Moreover, when fine-tuning directly on the proposed scoring function, we show that our method provides a much more stable training phase across random restarts (e.g ×10 standard deviation reduction on COPA test accuracy) and requires less annotated data than the standard classifier approach to reach equivalent performances.

Deep Learning Explainer Twitter: https://twitter.com/DeepExplainer ... https://www.youtube.com/watch?v=Ijrdm0Nb_k0

created

2021-01-02

staked

0.0 LBC

license

Copyrighted (contact publisher)

File size

36496994 Bytes