In theory, Attention is All You Need. In practice, however, we also need optimized attention implementations like FlashAttention. Although these fused attention implementations have substantially improved performance and enabled long contexts, this efficiency has come with a loss of flexibility. You can no longer try out a new attention variant by writing a few PyTorch operators - you often need t