This document summarizes several datasets for image captioning, video classification, action recognition, and temporal localization. It describes the purpose, collection process, annotation format, examples and references for datasets including MS COCO, Visual Genome, Flickr8K/30K, Kinetics, Charades, AVA, STAIR Captions and Actions. The datasets vary in scale from thousands to millions of images/
![画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)](https://cdn-ak-scissors.b.st-hatena.com/image/square/c8321063ee9111b996d82b1f1cf46c72e5b54875/height=288;version=1;width=512/https%3A%2F%2Fcdn.slidesharecdn.com%2Fss_thumbnails%2Fyoshikawastairlabseminarv20180802-180802091719-thumbnail.jpg%3Fwidth%3D640%26height%3D640%26fit%3Dbounds)