I just wrote 84 new matrix multiplication kernels for llamafile which enable it to read prompts / images faster. Compared to llama.cpp, prompt eval time with llamafile should go anywhere between 30% and 500% faster when using F16 and Q8_0 weights on CPU. The improvements are most dramatic for ARMv8.2+ (e.g. RPI 5), Intel (e.g. Alderlake), and AVX512 (e.g. Zen 4) computers. My kernels go 2x faster