Turing-NLG, DeepSpeed and the ZeRO optimizer

Yannic Kilcher

technology arxiv attention attention mechanism bert deep learning distributed gpt-2 long sequence machine learning machine translation megatron memory microsoft natural language processing nlp parallelism seq2seq transformer

description

Microsoft has trained a 17-billion parameter language model that achieves state-of-the-art perplexity. This video takes a look at the ZeRO optimizer that enabled this breakthrough. ZeRO allows you to do model- and data-parallelism without having huge cuts in training speed.

https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/ https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/ https://github.com/microsoft/DeepSpeed https://arxiv.org/abs/1910.02054

Links: YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher BitChute: https://www.bitchute.com/channel/yannic-kilcher Minds: https://www.minds.com/ykilcher ... https://www.youtube.com/watch?v=tC01FRB0M7w

created

2021-03-01

staked

0.0 LBC

license

Copyrighted (contact publisher)

File size

85757496 Bytes