# What is Microsoft Research's µ-Parametrization?

### µTransfer: A technique for hyperparameter tuning of enormous neural networks

Microsoft Research has done it again! This is pretty exciting for OpenAI. There’s some suggestion that Microsoft and OpenAI may have solved a fundamental AI bottleneck. So let’s get into it.

The research was published this week on March 8th, 2022 a Tuesday on Microsoft Researcher’s blog.

The lead on the study is Edward Hu, a PhD student at Mila, where he studies deep learning under the supervision of Yoshua Bengio. Mila is the Quebec A.I. Institute in Montreal, Canada.

He’s interested in building useful AI systems. When it comes to building large-scale AI systems, fundamental research forms the theoretical insights that drastically reduce the amount of trial and error necessary and can prove very cost-effective.

In this research, Microsoft Research claims to for the first time, to tune enormous neural networks that are too expensive to train more than once. They achieved this by showing that a particular parameterization preserves optimal hyperparameters across different model sizes.

### What is µ-Parametrization (“myu-P”)

This is the µ-Parametrization (or *µP*, pronounced “myu-P”) that they actually introduced in a previous paper, where we showed that it uniquely enables maximal feature learning in the infinite-width limit. In collaboration with researchers at OpenAI, they verified its practical advantage on a range of realistic scenarios, which we describe in our new paper, “Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer.”

In this sense Microsoft Research has done their collab partners at OpenAI a huge favor. µ-Parametrization could be the key to tuning hyperparameters for massive AI models!

By greatly reducing the need to guess which training hyperparameters to use, this technique can accelerate research on enormous neural networks, such as GPT-3 and potentially larger successors in the future. They also released a PyTorch package that facilitates the integration of our technique in existing models, available on the project GitHub page or by simply running pip install mup.

Greg Yang has been important in this work.

Hyperparameter (HP) tuning in deep learning is an expensive process, prohibitively so for neural networks (NNs) with billions of parameters.

They show that, in the recently discovered Maximal Update Parametrization (μP), many optimal HPs remain stable even as model size changes. This leads to a new HP tuning paradigm we call *μTransfer*: parametrize the target model in μP, tune the HP indirectly on a smaller model, and *zero-shot transfer* them to the full-sized model, i.e., without directly tuning the latter at all.

The yverify μTransfer on Transformer and ResNet. For example, 1) by transferring pretraining HPs from a model of 13M parameters, we outperform published numbers of BERT-large (350M parameters), with a total tuning cost equivalent to pretraining BERT-large once; 2) by transferring from 40M parameters, they outperform published numbers of the 6.7B GPT-3 model, with tuning cost only 7% of total pretraining cost. A Pytorch implementation of their technique can be found at github.com/microsoft/mup and installable via

`pip install mup`

.

### Compute Optimization for Large Scale AI Models

Essentially the blog post published by Microsoft Research describes a technique called µ-Parametrization (or µP), which plays on the discovery of similarities between the behaviour of small- and large-scale AI models to minimize the quantity of compute resources required to make optimizations.

“µP provides an impressive step toward removing some of the black magic from scaling up neural networks. It also provides a theoretically backed explanation of some tricks used by past work, like the T5 model. I believe both practitioners and researchers alike will find this work valuable.”

— Colin Raffel, Assistant Professor of Computer Science, University of North Carolina at Chapel Hill and co-creator of T5

## Scaling the initialization is easy, but scaling training is hard

Large neural networks are hard to train partly because we don’t understand how their behavior changes as their size increases. Early work on deep learning, such as by Glorot & Bengio and He et al., generated useful heuristics that deep learning practitioners widely use today.

So, with µ-Parametrization, it **will be cheaper and simpler to develop larger-scale AI models** capable of yielding far superior performance to those available today. This is the kind of research at Microsoft Research that will make GPT-4 a greater success at OpenAI, I presume.

It has thus been suggested that some version of GPT-4 could be out sometime this year in 2022 or in 2023. And, it is widely expected that it would be a game-changer.

The Team’s goal was to obtain a similar consistency so that as model width increases, the change in activation scales during training stay consistent and similar to initialization to avoid numerical overflow and underflow. Their solution, µP, achieves this goal, as seen on the right in Figure 1, which shows the stability of network activation scales for the first few steps of training across increasing model width.

##### Do you enjoy A.I. articles at the intersection of breaking news, then help me continue to write on the subject. I cannot continue to write without support. Grateful for all tips, patronage and community contributions.

Their parameterization, which maintains this consistency during training, follows two pieces of crucial insight.

1-First, gradient updates behave differently from random weights when the width is large. This is because gradient updates are derived from data and contain correlations, whereas random initializations do not. Therefore, they need to be scaled differently.

2-Second, parameters of different shapes also behave differently when the width is large. While we typically divide parameters into weights and biases, with the former being matrices and the latter vectors, some weights behave like vectors in the large-width setting. For example, the embedding matrix in a language model is of size

*vocabsize x width*. While the width tends to infinity,*vocabsize*stays constant and finite. During matrix multiplication, the difference in behavior between summing along a finite dimension and an infinite one cannot be more different.

##### So why is this important?

µ-Parametrization offers a route to tuning large-scale models at much lower costs and much greater efficiency, by capitalizing on the insight that neural networks of varying sizes share the same optimal hyperparameters (HPs) in some conditions.

Essentially, this means a small-scale tuning process can be **extrapolated outwards and mapped onto a much larger model**, instead of tuning an entire multi-billion-parameter model directly.

This is a continuation of the Team’s research. These insights, which they discuss in detail in a previous blog post, motivated them to develop µP. In fact, beyond just keeping the activation scale consistent throughout training, µP ensures that neural networks of different and sufficiently large widths behave similarly during training such that they *converge to* a desirable limit, which we call *the feature learning limit*.

# On infinitely wide neural networks that exhibit feature learning

There are times I can literally sense Edward’s philosophy background in the way he writes.

## A theory-guided approach to scaling width

Their theory of scaling enables a procedure to transfer training hyperparameters across model sizes.

If, as discussed above, µP networks of different widths share similar training dynamics, they likely also share similar optimal hyperparameters.

Consequently, we can simply apply the optimal hyperparameters of a small model directly onto a scaled-up version. We call this practical procedure *µTransfer*. If our hypothesis is correct, the training loss-hyperparameter curves for µP models of different widths would share a similar minimum.

Conversely, our reasoning suggests that no scaling rule of initialization and learning rate other than µP can achieve the same result. This is supported by the animation below. Here, we vary the parameterization by interpolating the initialization scaling and the learning rate scaling between PyTorch default and µP. As shown, µP is the only parameterization that preserves the optimal learning rate across width, achieves the best performance for the model with width 213 = 8192, and where wider models always do better for a given learning rate—that is, graphically, the curves don’t intersect.

Not sure if this GIF is visible for you.

“µP’s principled way of parameterizing the model and selecting the learning rate make it easier for anybody to scale the training of deep neural networks. Such an elegant combination of beautiful theory and practical impact,” said Johannes Gehrke, Lab Director at Microsoft Research.

Building on the theoretical foundation of Tensor Programs, µTransfer works automatically for advanced architectures, such as Transformer and ResNet.

It can also simultaneously transfer a wide range of hyperparameters. Using Transformer as an example, we demonstrate in Figure 3 how the optima of key hyperparameters are stable across widths.

“I am excited about µP advancing our understanding of large models. µP’s principled way of parameterizing the model and selecting the learning rate make it easier for anybody to scale the training of deep neural networks. Such an elegant combination of beautiful theory and practical impact.”

— Johannes Gehrke, Technical Fellow, Lab Director of Research at Redmond, and CTO and Head of Machine Learning for the Intelligent Communications and Conversations Cloud (IC3)

Johannes Gehrke is a German computer scientist and the director of Microsoft Research in Redmond and CTO and Head of Machine Learning for the Microsoft Teams Backend.

### How Does this Relate to OpenAI?

To put the theory into practice, Microsoft worked with OpenAI to unleash µ-Parametrization on GPT-3, a natural language model whose largest iteration is made up of 175 billion parameters.

## A glimpse of the future: µP + GPT-3

Before this work, the larger a model was, the less well-tuned we expected it to be due to the high cost of tuning. Therefore, we expected that the largest models could benefit the most from µTransfer, which is why we partnered with OpenAI to evaluate it on GPT-3.

After parameterizing a version of GPT-3 with relative attention in µP, we tuned a small proxy model with 40 million parameters before copying the best hyperparameter combination to the 6.7-billion parameter variant of GPT-3, as prescribed by µTransfer.

The total compute used during this tuning stage was only 7 percent of the compute used in the pretraining of the final 6.7-billion model. This µTransferred model outperformed the model of the same size (with absolute attention) in the original GPT-3 paper. In fact, it performs similarly to the model (with absolute attention) with double the parameter count from the same paper, as shown in Figure 6.

### Future Implications are Good for this Research

#### Implications for deep learning theory

As shown previously, µP gives a scaling rule which uniquely preserves the optimal hyperparameter combination across models of different widths in terms of training loss. Conversely, other scaling rules, like the default in PyTorch or the NTK parameterization studied in the theoretical literature, are looking at regions in the hyperparameter space farther and farther from the optimum as the network gets wider.

In that regard, the team believes that the feature learning limit of µP, rather than the NTK limit, is the most natural limit to study if our goal is to derive insights that are applicable to feature learning neural networks used in practice. As a result, more advanced theories on overparameterized neural networks should reproduce the feature learning limit of µP in the large width setting.

“After parameterizing a version of GPT-3 with relative attention in µP, we tuned a small proxy model with 40 million parameters before copying the best hyperparameter combination to the 6.7-billion parameter variant of GPT-3,” Microsoft explained.

Applied to the underlying graphs for neural network initialization, training, and inference, the TP technique yields fundamental theoretical results, such as the architectural universality of the Neural Network-Gaussian Process correspondence and the Dynamical Dichotomy theorem, in addition to deriving µP and the feature learning limit that led to µTransfer.

### Future Prospects

Looking ahead, we believe extensions of TP theory to depth, batch size, and other scale dimensions hold the key to the reliable scaling of large models beyond width.

Think about it, the results were quite startling; the collaborators managed to create an even more performant version of GPT-3, using just 7% of the compute power consumed in the pretraining of the 6.7-billion parameter model.

#### Applying µTransfer to your own models

They created the mup package to enable practitioners to easily implement µP in their own PyTorch models, just as how frameworks like PyTorch, TensorFlow, and JAX have enabled us to take autograd for granted. Please note that µTransfer works for models of any size, not just those with billions of parameters.

#### The journey has just begun

While our theory explains why models of different widths behave differently, more investigation is needed to build a theoretical understanding of the scaling of network depth and other scale dimensions.

They go on to state another high-impact domain to which µP and µTransfer have not been applied is fine tuning a pretrained model. While feature learning is crucial in that domain, the need for regularization and the finite-width effect prove to be interesting challenges.

The Team firmly believe in fundamental research as a cost-effective complement to trial and error and plan to continue their work to derive more principled approaches to large-scale machine learning. To learn about their other deep learning projects or opportunities to work with us and even help us expand µP, please go to our Deep Learning Group page.

As such at AiSupremacy we’ve really been taking a tour of Microsoft Research of late.

The Research was by Edward Hu , PhD Student Greg Yang , Senior Researcher Jianfeng Gao , Distinguished Scientist & Vice President.

Incredible work happening at Microsoft Research. You can also subscribe to their Newsletter. I have no affiliation with Microsoft Research (or anyone for that matter).

## Create your profile

## Only paid subscribers can comment on this post

Sign in## Check your email

For your security, we need to re-authenticate you.

Click the link we sent to , or click here to sign in.