support BLS12_377 pairing mulVec (resp. mulEach) with AVX-512 IFMA is 1.52 (resp. 3.26) times faster than without it. Add {Fp,Fr,Fp2}::squareRoot. Improve the performance of squareRoot. Add batch inversion for Fr and Fp elements, and batch normalization for G1 and G2 points. mulVec is a little improved. mulEach with AVX-512 IFMA is improved slightly and 2.8 times faster than G1::mul on BLS12-381.