Parallel reduction is a common building block for many parallel algorithms. A presentation from 2007 by Mark Harris provided a detailed strategy for implementing parallel reductions on GPUs, but this 6-year old document bears updating. In this post I will show you some features of the Kepler GPU architecture which make reductions even faster: the shuffle (SHFL) instruction and fast device memory a
![Faster Parallel Reductions on Kepler | NVIDIA Technical Blog](https://cdn-ak-scissors.b.st-hatena.com/image/square/66f3ddf04cc722abf8d4dcfd0e7d14076fa6289f/height=288;version=1;width=512/https%3A%2F%2Fdeveloper-blogs.nvidia.com%2Fwp-content%2Fuploads%2F2014%2F02%2FKepler_reductions_thumb.png)