We’ve scaled Kubernetes clusters to 7,500 nodes, producing a scalable infrastructure for large models like GPT-3, CLIP, and DALL·E, but also for rapid small-scale iterative research such as Scaling Laws for Neural Language Models. Scaling a single Kubernetes cluster to this size is rarely done and requires some special care, but the upside is a simple infrastructure that allows our machine learnin
![Scaling Kubernetes to 7,500 nodes](https://cdn-ak-scissors.b.st-hatena.com/image/square/4df66050961f5916303b90bc765b668a8685b4d6/height=288;version=1;width=512/https%3A%2F%2Fimages.openai.com%2Fblob%2F84745f0a-d786-4066-9907-4ce230afd73c%2Fscaling-kubernetes-to-7-500-nodes.png%3Ftrim%3D0%252C5%252C696%252C5%26width%3D1000%26quality%3D80)