Autoregressive Diffusion Models (Machine Learning Research Paper Explained)
Yannic Kilcher
#machinelearning #ardm #generativemodels
Diffusion models have made large advances in recent months as a new type of generative models. This paper introduces Autoregressive Diffusion Models (ARDMs), which are a mix between autoregressive generative models and diffusion models. ARDMs are trained to be agnostic to the order of autoregressive decoding and give the user a dynamic tradeoff between speed and performance at decoding time. This paper applies ARDMs to both text and image data, and as an extension, the models can also be used to perform lossless compression.
OUTLINE: 0:00 - Intro & Overview 3:15 - Decoding Order in Autoregressive Models 6:15 - Autoregressive Diffusion Models 8:35 - Dependent and Independent Sampling 14:25 - Application to Character-Level Language Models 18:15 - How Sampling & Training Works 26:05 - Extension 1: Parallel Sampling 29:20 - Extension 2: Depth Upscaling 33:10 - Conclusion & Comments
Paper: https://arxiv.org/abs/2110.02037
Abstract: We introduce Autoregressive Diffusion Models (ARDMs), a model class encompassing and generalizing order-agnostic autoregressive models (Uria et al., 2014) and absorbing discrete diffusion (Austin et al., 2021), which we show are special cases of ARDMs under mild assumptions. ARDMs are simple to implement and easy to train. Unlike standard ARMs, they do not require causal masking of model representations, and can be trained using an efficient objective similar to modern probabilistic diffusion models that scales favourably to highly-dimensional data. At test time, ARDMs support parallel generation which can be adapted to fit any given generation budget. We find that ARDMs require significantly fewer steps than discrete diffusion models to attain the same performance. Finally, we apply ARDMs to lossless compression, and show that they are uniquely suited to this task. Contrary to existing approaches based on bits-back coding, ARDMs obtain compelling results not only on complete datasets, but also on compressing single data points. Moreover, this can be done using a modest number of network calls for (de)compression due to the model's adaptable parallel generation.
Authors: Emiel Hoogeboom, Alexey A. Gritsenko, Jasmijn Bastings, Ben Poole, Rianne van den Berg, Tim Salimans
Links: TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://discord.gg/4H8xxDF BitChute: https://www.bitchute.com/channel/yannic-kilcher Minds: https://www.minds.com/ykilcher Parler: https://parler.com/profile/YannicKilcher LinkedIn: https://www.linkedin.com/in/ykilcher BiliBili: https://space.bilibili.com/1824646584
If you want to support me, the best thing to do is to share out the content :)
292862999 Bytes