Colossal-AI Seamlessly Accelerates Large Models
At Low Costs with Hugging Face
We’re almost at 4,000 subscribers on AI Supremacy, thanks for your support. Join my Facebook Group on AI here. This is yet another way I can offer value in case you are as passionate about A.I. and quantum computing as I am.
I’m super interested in A.I. startups huddling around Hugging Face in recent years. Bloom’s collaboration with Hugging Face is already one of the biggest moments in the democratization of A.I. we’ve seen in years.
So where are we today in 2022?
The Transformer architecture has improved the performance of deep learning models in domains such as Computer Vision and Natural Language Processing.
Together with better performance come larger model sizes. This imposes challenges to the memory wall of the current accelerator hardware such as GPU. It is never ideal to train large models such as Vision Transformer, BERT, and GPT on a single GPU or a single machine.
There is an urgent demand to train models in a distributed environment. However, distributed training, especially model parallelism, often requires domain expertise in computer systems and architecture. It remains a challenge for AI researchers to implement complex distributed training solutions for their models.
Tl;dr So Colossal-AI, which is a unified parallel training system designed to seamlessly integrate different paradigms of parallelization techniques including data parallelism, pipeline parallelism, multiple tensor parallelism, and sequence parallelism.
This is from researchers from HPC-AI Technology Inc. and the National University of Singapore (NUS).
As large-scale AI models continue their superior performances across different domains, trends emerge, leading to distinguished and efficient AI applications that have never been seen in the industry.
Training Larger AI Models
According to Synced, existing deep learning frameworks like PyTorch and Tensorflow may not offer a satisfactory solution for very large AI models. Furthermore, advanced knowledge of AI systems is typically required for sophisticated configurations and optimization of specific models.
Therefore, many AI users, such as engineers from small and medium-sized enterprises, can’t help but feel overwhelmed by the emergence of large AI models.
If you think about it, the colossal AI trend that all other AI trends serve is the increased scale of artificial intelligence in organizations. But Colossal AI the Unified Deep Learning system, paper here, is quite interesting.
In fact, the core reasons for the increased cost of large AI models are GPU memory restrictions and the inability to accommodate sizeable models.
This imposes challenges to the memory wall of the current accelerator hardware such as GPU.
There is an urgent demand to train models in a distributed environment. However, distributed training, especially model parallelism, often requires domain expertise in computer system and architecture. It remains a challenge for AI researchers to implement complex distributed training solutions for their models.
Colossal-AI aims to support the AI community to write distributed models in the same way as how they write models normally.
This allows them to focus on developing the model architecture and separates the concerns of distributed training from the development process. The documentations can be found at https://www.colossalai.org/ and the source code can be found at https://github.com/hpcaitech/ColossalAI.
In response to all of this, Colossal-AI developed the Gemini module, which efficiently manages and utilizes the heterogeneous memory of GPU and CPU and is expected to help solve the mentioned bottlenecks. Best of all, it is completely open-source and requires only minimal modifications to allow existing deep learning projects to be trained with much larger models on a single consumer-grade graphics card. In particular, it makes downstream tasks and application deployments such as large AI model fine-tuning and inference much easier. It even grants the convenience of training AI models at home!
HPC-AI Tech’s flagship open-source and large-scale AI system, Colossal-AI, now allows Hugging Face users to seamlessly develop their ML models in a distributed and easy manner.
I like to be spoon fed the nitty gritty and this is a bit old but gives you a good intro. Video.
If you want to get access to more content you can do so here for the price of a good cup of coffee a month. This enables me to independent publish about the biggest trends in A.I. in business, society and technology. Join 80+ other paying subscribers.
Configure with Colossal-AI
It is very simple to use the powerful features of Colossal-AI. Users only need a simple configuration file, and are not required to alter their training logic to equip models with their desired features (e.g. mixed-precision training, gradient accumulation, multi-dimensional parallel training, and memory redundancy elimination).
Suppose we intend to develop the OPT on one GPU. We can accomplish this by leveraging heterogeneous training from Colossal-AI, which only requires users to add relevant items to the configuration files. Each training strategy has its distinct advantages:
cuda: puts all model parameters on GPU, suitable for scenarios where training persists without weights offloading;
cpu: puts all model parameters on CPU, suitable for giant model training, only keeps weights on GPU memory that participate in current computation steps;
auto: determines the number of parameters to keep on GPU by closely monitoring the current memory status. It optimizes the usage of GPU memory and minimizes the expensive data transmission between GPU and CPU.
For typical users, they can just select the auto strategy, which maximizes training efficiency by dynamically adapting its heterogeneous strategy with respect to its current memory state.
Colossal-AI allows users to set up combinations of data, pipeline, sequence, and multiple tensor parallelism.
Remarkable Performance from Colossal-AI
On a single GPU, Colossal-AI’s automatic strategy provides remarkable performance gains from the ZeRO Offloading strategy by Microsoft DeepSpeed. Users can experience up to a 40% speedup, at a variety of model scales. However, when using a traditional deep learning training framework like PyTorch, a single GPU can no longer support the training of models at such a scale.
According to Sameer Maskey, founder and CEO at Fusemachines and an adjunct associate professor at Columbia University, the move toward scaling AI is made possible by more data, prioritizing data strategy and cheaper compute power.
Hugging Face is really making it easier for global researchers to collaborate and effectively improve R&D.
Behind the Scenes
Such remarkable improvements come from Colossal-AI’s efficient heterogeneous memory management system, Gemini. To put it simply, Gemini uses a few warmup steps during model training to collect memory usage information from PyTorch computational graphs.
The results are pretty impressive:
Colossal-AI provides an easy way to set up a combination of data, pipeline, sequence and multiple tensor parallelism. With friendly APIs, the user can construct distributed model using tensor parallelism just like how they construct a single-GPU model.
Thanks for reading!