TL;DR The code from the tutorial is available at matmul.c. This blog post is the result of my attempt to implement high-performance fp32 matrix multiplication (=SGEMM) on CPU while keeping the code simple and scalable. The implementation follows the BLIS design, works for arbitrary matrix sizes, and outperforms OpenBLAS achieving over 1 TFLOPS across a wide range of matrix sizes on AMD Ryzen 7700.