What is Google AI's V-MoE?
Towards A New Architecture For Computer Vision Based On A Sparse Mixture Of Experts
Sometimes I like to call out breaking news in A.I. that is more academic in nature. While I enjoy writing Op-Eds about the future the most, it’s important for me to track what big companies like Microsoft, Google, Facebook and others are doing in AI Research. I’ve also noticed that there is a lack of coverage in these areas on the web.
Throughout the previous few decades, deep learning advances have contributed to outstanding outcomes on a wide range of tasks, including image classification, machine translation, and protein folding prediction.
Sparsely-gated Mixture of Experts networks (MoEs) have demonstrated excellent scalability in Natural Language Processing.
According to the researchers, deep learning models will be able to learn without the assistance of humans and will be adaptable to changes in their environment in the future.
The utilization of huge models and datasets, on the other hand, comes at the cost of enormous computational resources.
According to recent research, large model sizes may be required for solid generalization and robustness. As a result, it’s become critical to train huge models while keeping resource needs low.
One promising approach involves the use of conditional computation: rather than activating the whole network for every single input, different parts of the model are activated for different inputs.
Google AI on January 14th, 2022 presented a Vision MoE (V-MoE), a sparse version of the Vision Transformer, that is scalable and competitive with the largest dense networks. When applied to image recognition, V-MoE matches the performance of state-of-the-art networks, while requiring as little as half of the compute at inference time.
You can read about it more on Google AI’s blog here.
I first learned about it on Marketechpost.com here.
In Google AI’s “Scaling Vision with Sparse Mixture of Experts”, they present V-MoE, a new vision architecture based on a sparse mixture of experts, which they then use to train the largest vision model to date.
They transfer V-MoE to ImageNet and demonstrate matching state-of-the-art accuracy while using about 50% fewer resources than models of comparable performance.
Vision Mixture of Experts (V-MoEs)
Vision Transformers (ViT) have emerged as one of the best architectures for vision tasks. ViT first partitions an image into equally-sized square patches. These are called tokens, a term inherited from language models.
Some of the ViT architecture’s dense feedforward layers (FFN) are replaced with experts, a sparse combination of independent FFNs. For each token, a learnable router layer chooses which experts to use and how they should be weighted.
Similar to GShard-M4 and GLaM, Google AI replace the feedforward network of every other transformer layer with a Mixture-of-Experts (MoE) layer that consists of multiple identical feedforward networks, the “experts”. For each task, the routing network, trained along with the rest of the model, keeps track of the task identity for all input tokens and chooses a certain number of experts per layer (two in this case) to form the task-specific subnetwork.
A 15-billion parameter model with 24 MoE layers is trained using an expanded version of JFT-300M to test the limitations of vision models. After fine-tuning, this large model obtained 90.35 percent test accuracy on ImageNet, which is close to the present state-of-the-art.
Google AI believe this is just the beginning of conditional computation at scale for computer vision; extensions include multi-modal and multi-task models, scaling up the expert count, and improving transfer of the representations produced by sparse models.
For a greater technical understanding of this you can read the paper on it here.
For more articles on AI Research that’s cutting edge I like to read Synced. However for the most part the intended audience for this Newsletter is to the lay person and mainstream reader.
Next Steps for AI Researchers
The quality improvements often seen with scaling machine learning models has incentivized the research community to work toward advancing scaling technology to enable efficient training of large models.
The emerging need to train models capable of generalizing to multiple tasks and modalities only increases the need for scaling models even further.
However, the practicality of serving these large models remains a major challenge. Efficiently deploying large models is an important direction of research, and we believe TaskMoE is a promising step towards more inference friendly algorithms that retain the quality gains of scaling.
Breaking news in AI remains fastest on the web for alerts on Reddit channels. I noticed this particular news on ComputerVision and Machine Learning News subreddits.
I’m writing about the future in a number of Newsletter on Substack. I cannot continue to write however without community support. If you appreciate my articles please consider supporting. You can help me also by sharing this article.
AI Supremacy is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
Hope you are having a wonderful weekend.
Thanks for reading!