I've been writing a lot of AArch64 assembly, for reasons. I recently came up with a "clever" idea to eliminate one jump from an inner loop, and was surprised to find that it slowed things down. Allow me to explain my terrible error, so that you don't fall victim in the future. A toy model of the relevant code looks something like this: float run(const float* data, size_t n) { float g = 0.0; while