OpenAI's DALL·E: Creating Images from Text - Explained
Deep Learning Explainer
OpenAI's DALL·E: Creating Images from Text - Explained
Overview Like GPT-3, DALL·E is a transformer language model. It receives both the text and the image as a single stream of data containing up to 1280 tokens, and is trained using maximum likelihood to generate all of the tokens, one after another. A token is any symbol from a discrete vocabulary; for humans, each English letter is a token from a 26-letter alphabet. DALL·E’s vocabulary has tokens for both text and image concepts. Specifically, each image caption is represented using a maximum of 256 BPE-encoded tokens with a vocabulary size of 16384, and the image is represented using 1024 tokens with a vocabulary size of 8192.
The images are preprocessed to 256x256 resolution during training. Similar to VQVAE,1415 each image is compressed to a 32x32 grid of discrete latent codes using a discrete VAE1011 that we pretrained using a continuous relaxation.1213 We found that training using the relaxation obviates the need for an explicit codebook, EMA loss, or tricks like dead code revival, and can scale up to large vocabulary sizes.
DALLE https://openai.com/blog/dall-e/#rf15
Generating Diverse High-Fidelity Images with VQ-VAE-2 https://arxiv.org/abs/1906.00446 ... https://www.youtube.com/watch?v=UfAE-1vdj_E
27399191 Bytes