The primary motivation for this project is to make it easy to take a single-GPU training script and successfully scale it to train across many GPUs in parallel. This has two aspects: How much modification does one have to make to a program to make it distributed, and how easy is it to run it? How much faster would it run in distributed mode? Internally at Uber we found the MPI model to be much mor