Please note that both Megatron-LM and DeepSpeed have Pipeline Parallelism and BF16 Optimizer implementations, but we used the ones from DeepSpeed as they are integrated with ZeRO. Megatron-DeepSpeed implements 3D Parallelism to allow huge models to train in a very efficient way. Let’s briefly discuss the 3D components. DataParallel (DP) - the same setup is replicated multiple times, and each being
![The Technology Behind BLOOM Training](https://cdn-ak-scissors.b.st-hatena.com/image/square/5e51e6f54356ad64287c2ac3aebcdb29d8bd2a0a/height=288;version=1;width=512/https%3A%2F%2Fhuggingface.co%2Fblog%2Fassets%2F86_bloom_megatron_deepspeed%2Fthumbnail.png)