by Tri Dao, Daniel Haziza, Francisco Massa, Grigory Sizov Motivation Large language models (LLM) such as ChatGPT or Llama have received unprecedented attention lately. However, they remain massively expensive to run. Even though generating a single response can cost about $0.01 (a few seconds of an 8xA100 instance on AWS), the costs quickly add up when scaling to billions of users, who could have
![Flash-Decoding for long-context inference](https://cdn-ak-scissors.b.st-hatena.com/image/square/620891c997fbe414d1c946e00299739ac874f11e/height=288;version=1;width=512/https%3A%2F%2Fpytorch.org%2Fassets%2Fimages%2Fsocial-share.jpg)