This document summarizes recent research on applying self-attention mechanisms from Transformers to domains other than language, such as computer vision. It discusses models that use self-attention for images, including ViT, DeiT, and T2T, which apply Transformers to divided image patches. It also covers more general attention modules like the Perceiver that aims to be domain-agnostic. Finally, it
![リクルート式 自然言語処理技術の適応事例紹介](https://cdn-ak-scissors.b.st-hatena.com/image/square/c301fc6d944cb7b6befca744eb3c4839cdf1d580/height=288;version=1;width=512/https%3A%2F%2Fcdn.slidesharecdn.com%2Fss_thumbnails%2F20151120ikeda-151215054451-thumbnail.jpg%3Fwidth%3D640%26height%3D640%26fit%3Dbounds)