Kernel 1: Naive Implementation In the CUDA programming model, computation is ordered in a three-level hierarchy. Each invocation of a CUDA kernel creates a new grid, which consists of multiple blocks. Each block consists of up to 1024 individual threads.These constants can be looked-up in the CUDA Programming guide. Threads that are in the same block have access to the same shared memory region (S
![How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog](https://cdn-ak-scissors.b.st-hatena.com/image/square/6bb4a46f21f565f3e8a2016d47a5e509d1695d99/height=288;version=1;width=512/http%3A%2F%2Fsiboehm.com%2Fassets%2Fimg%2FCUDA-MMM%2FRoofline_background.png)