AI Supremacy

Share this post
What is Microsoft & Nvidia's Megatron-Turing?
aisupremacy.substack.com

What is Microsoft & Nvidia's Megatron-Turing?

"Transformers - More than meets the eye."

Michael Spencer
Dec 24, 2021
3
Share this post
What is Microsoft & Nvidia's Megatron-Turing?
aisupremacy.substack.com

Everyone seems to want to compare themselves to OpenMind’s GPT-3. China’s Wu Dao 2.0, to DeepMind to Microsoft’s Megatron-Turning. We’re bigger, no we’re bigger. Who cares! What’s important is why GPT-3 like technologies matter.

It has thus been suggested that some version of GPT-4 could be out early next year or in 2023. That could be a game-changer, Or would it be?

In October, 2021 Microsoft claimed the DeepSpeed- and Megatron-powered Megatron-Turing Natural Language Generation model (MT-NLG), was the largest and the most powerful monolithic transformer language model trained to date, with 530 billion parameters.

Megatron is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA, based on work by Google. In June, 2021 The Chinese govt-backed Beijing Academy of Artificial Intelligence’s (BAAI) has introduced Wu Dao 2.0, the largest language model till date, with 1.75 trillion parameters.

This technology will change the internet. Transformers have been one of the widely popular approaches in deep learning, especially large scale transformer models like GPT-2, GPT-3, BERT, Turing NLG, Megatron-LM, XLNet, RoBERTa, etc. These models have the potential to find real-world applications, such as machine translation, time series prediction, and video understanding, among others.    

With Megatron Microsft maybe have wanted some limelight.

An amalgamation of DeepSpeed and Megatron transfer models, MT-NLG is 3x the number of parameters compared to the existing largest models, including GPT-3 (175 billion parameters), Turing NLG (17 billion parameters), and Megatron-LM (8 billion parameters). 

The 105-layer, transformer-based MT-NLG improved upon the prior state-of-the-art models in zero-, one-, and few-shot settings and set the new standard for large-scale language models in both model scale and quality. Microsoft claimed that this tech could be used in:

  • Completion prediction

  • Reading comprehension

  • Commonsense reasoning

  • Natural language inferences

  • Word sense disambiguation

In contrast, other large-scale language models, including BAAI’s Wu Dao 2.0 (1.75 trillion parameters) and Google’s Switch Transformer (1.6 trillion parameters), surpass MT-NLG in terms of trained parameters. 

GPT-3 like technologies are basically AI that is better at creating content that has a language structure – human or machine language – than anything that has come before it. How developers, startups and existing platforms use this tech is a bit speculative.

Transformer-based language models in natural language processing (NLP) have driven rapid progress in recent years fueled by computation at scale, large datasets, and advanced algorithms and software to train these models. Yet the implementation of GPT-3 at scale could warp the internet.

OpenAI claims GPT-3 will lead to a new generation of apps.

The model, in this case, is a neural network program based on the "Transformer" approach that has become widely popular in deep learning. Megatron-Turing is able to produce realistic-seeming text and also perform on various language tests such as sentence completion.

However are we reaching the limits of compute? The large number of compute operations required can result in unrealistically long training times if special attention is not paid to optimizing the algorithms, software, and hardware stack all together. This creates interesting bottlenecks and this race to be the biggest.

So, will NVIDIA and Microsoft ‘train to convergence’ an actual one-trillion model? Does it even matter? OpenAI said in March, 2021 that over 300 apps had been using GPT-3 across varying categories and industries, from productivity and education to creativity and games. Nearly one year later I’m not sure that sounds like a revolution.

Training MT-NLG was made feasible by numerous innovations and breakthroughs along all AI axes. For example, working closely together, NVIDIA and Microsoft achieved an unprecedented training efficiency by converging a state-of-the-art GPU-accelerated training infrastructure with a cutting-edge distributed learning software stack. It’s collaboration like this that is perhaps even more interesting for the future scalability of AI.

It’s getting harder for the lay person to keep up with the transformer-hype. Recent work in language models (LM) has demonstrated that a strong pretrained model can often perform competitively in a wide range of NLP tasks without finetuning. While giant language models are advancing the state of the art on language generation, they also suffer from issues such as bias and toxicity.

Microsoft with Nvidia shows that they have the ability to deploy such a large model across parallelized infrastructure. 

  • We live in a time where AI advancements are far outpacing Moore’s law.

  • Newer generations of GPUs, interconnected at lightning speeds are combining with the hyperscaling of AI models leading to better performance, with seemingly no end in sight.

For this reason GPT-4 and projects like Megatron-Turing are bringing us somewhere new in AI in the 2020s.

Currently, MT-NLG is not a commercial product. It is a research project between NVIDIA and Microsoft.

Share this post
What is Microsoft & Nvidia's Megatron-Turing?
aisupremacy.substack.com
Comments

Create your profile

0 subscriptions will be displayed on your profile (edit)

Skip for now

Only paid subscribers can comment on this post

Already a paid subscriber? Sign in

Check your email

For your security, we need to re-authenticate you.

Click the link we sent to , or click here to sign in.

TopNewCommunity

No posts

Ready for more?

© 2022 Michael Spencer
Privacy ∙ Terms ∙ Collection notice
Publish on Substack Get the app
Substack is the home for great writing