secondlifeのブックマーク - はてなブックマーク

Flash-Decoding for long-context inference
Micro-benchmark of the multi-head attention, run-time in us. Flash-Decoding achieves almost constant run-time as the sequence length scales to up to 64k. The up to 8x speedup end-to-end measured earlier is made possible because the attention itself is up to 50x faster than FlashAttention. Up until sequence length 32k, the attention time is roughly constant, because Flash-Decoding manages to fully
secondlife 2023/10/16
長いコンテキストの推論高速化の Flash-Decoding

LLM
リンク
INT8 Quantization for x86 CPU in PyTorch
secondlife 2023/08/08
リンク
PyTorch 2.0
Get Started Select preferences and run the command to install PyTorch locally, or get started quickly with one of the supported cloud platforms. Overview Introducing PyTorch 2.0, our first steps toward the next generation 2-series release of PyTorch. Over the last few years we have innovated and iterated from PyTorch 1.0 to the most recent 1.13 and moved to the newly formed PyTorch Foundation, par
secondlife 2022/12/22
リンク
1