Imagine you wrote a program for a pleasingly parallel problem, where each thread does its own independent piece of work, and the threads don’t need to coordinate except joining the results at the end. Obviously you’d expect the more cores it runs on, the faster it is. You benchmark it on a laptop first and indeed you find out it scales nearly perfectly on all of the 4 available cores. Then you run