Abstract We present Imagen Video, a text-conditional video generation system based on a cascade of video diffusion models. Given a text prompt, Imagen Video generates high definition videos using a base video generation model and a sequence of interleaved spatial and temporal video super-resolution models. We describe how we scale up the system as a high definition text-to-video model including de
![Imagen Video](https://cdn-ak-scissors.b.st-hatena.com/image/square/234e64257a744fdc9ef4778dc9cf8f375fb97998/height=288;version=1;width=512/https%3A%2F%2Fimagen.research.google%2Fvideo%2Fleaves.png)