Abstract We capitalize on large amounts of unlabeled video in order to learn a model of scene dynamics for both video recognition tasks (e.g. action classification) and video generation tasks (e.g. future prediction). We propose a generative adversarial network for video with a spatio-temporal convolutional architecture that untangles the scene's foreground from the background. Experiments suggest
![Generating Videos with Scene Dynamics - MIT](https://cdn-ak-scissors.b.st-hatena.com/image/square/31c487f0595018cf72d6739e3221d74b83fa6885/height=288;version=1;width=512/http%3A%2F%2Fweb.mit.edu%2Fvondrick%2Ftinyvideo%2Fnetwork.png)