We explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that operates on spacetime patches of video and image latent codes. Our largest model, Sora, is capable of generating a minute of high fidelity video. Ou
![Video generation models as world simulators](https://cdn-ak-scissors.b.st-hatena.com/image/square/ffd1cf637f319b054f3be6a86c907ccd99699188/height=288;version=1;width=512/https%3A%2F%2Fimages.openai.com%2Fblob%2F28bcbcb2-563a-432b-bb30-d74f66a087fe%2Fyoung-tiger.jpg%3Ftrim%3D0%252C0%252C0%252C0%26width%3D1000%26quality%3D80)