This document summarizes recent research on applying self-attention mechanisms from Transformers to domains other than language, such as computer vision. It discusses models that use self-attention for images, including ViT, DeiT, and T2T, which apply Transformers to divided image patches. It also covers more general attention modules like the Perceiver that aims to be domain-agnostic. Finally, it
![Convolutional Neural Netwoks で自然言語処理をする](https://cdn-ak-scissors.b.st-hatena.com/image/square/1f7209c135aebcfa457338ea1defa39bb38570e8/height=288;version=1;width=512/https%3A%2F%2Fcdn.slidesharecdn.com%2Fss_thumbnails%2Fcnninnlp-170131110318-thumbnail.jpg%3Fwidth%3D640%26height%3D640%26fit%3Dbounds)