The transformer decoder is mainly built from attention layers. It uses self-attention to process the sequence being generated, and it uses cross-attention to attend to the image. By inspecting the attention weights of the cross attention layers you will see what parts of the image the model is looking at as it generates words. This notebook is an end-to-end example. When you run the notebook, it d