How to Teach Computers Understand Videos and Text without Labeled Data - VideoClip
Deep Learning Explainer
A groundbreaking way to do self-supervision on videos and text. I would say it's the BERT moment for this video-text understanding.
#videoclip #contrastivelearning #videotransformer
0:00 - Intro 3:31 - Retrieval augmented training 5:07 - Video and text encoding8:48 - Contrastive loss 12:09 - Zero-shot transfer to end tasks
14:05 - Experiment results 18:09 - What did we learn
VideoCLIP: Contrastive Pre-training forZero-shot Video-Text Understanding https://arxiv.org/abs/2109.14084
Connect Twitter https://twitter.com/home Linkedin https://www.linkedin.com/in/xue-yong-fu-955723a6/ email edwindeeplearning@gmail.com
Abstract We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches. ... https://www.youtube.com/watch?v=vqMZjsIKUoQ
39333584 Bytes