Micro-benchmark of the multi-head attention, run-time in us. Flash-Decoding achieves almost constant run-time as the sequence length scales to up to 64k. The up to 8x speedup end-to-end measured earlier is made possible because the attention itself is up to 50x faster than FlashAttention. Up until sequence length 32k, the attention time is roughly constant, because Flash-Decoding manages to fully
![Flash-Decoding for long-context inference](https://cdn-ak-scissors.b.st-hatena.com/image/square/620891c997fbe414d1c946e00299739ac874f11e/height=288;version=1;width=512/https%3A%2F%2Fpytorch.org%2Fassets%2Fimages%2Fsocial-share.jpg)