VEX-encoded vxorps %xmm0, %xmm0, %xmm0 does the same thing as EVEX-encoded vxorps %zmm0, %zmm0, %zmm0, zeroing the full-width vector and breaking dependencies on the old value of the architectural register. The VEX version is one byte shorter than the EVEX version. #include <immintrin.h> __m512 zerovec(){ return _mm512_setzero_ps(); } current compiles to (https://godbolt.org/g/i5LA9Y): gcc8 and clang5.0.0 (trunk 301766): vxorps %zmm0, %zmm0, %zmm0 ICC17: vpxord %zmm0, %zmm0, %zmm0 MSVC: vxorps xmm0, xmm0, xmm0 Always using 128b zeroing instructions wouldn't hurt for AVX/AVX2, as well. I'm not sure if CPUs like Jaguar or Bulldozer-family (which crack 256b instructions into two 128b ops) handle xor-zeroing specially and only need one internal operation for vxorps %ymm,%ymm,%ymm zeroing. If not, using 128b would save execution throughput. (e.g. maybe they crack instructions at decode, but independence-detection happens later? Unlikely, because it probably has to decode to a special zeroing micro-op). One possible downside is that vxorps %ymm0,%ymm0,%ymm0 warms up the 256b execution units on Intel CPUs like Skylake, but vxorps %xmm0,%xmm0,%xmm0 doesn't. As Agner Fog describes (in http://agner.org/optimize/microarchitecture.pdf), a program can run a single 256b AVX instruction at least 56,000 clock cycles before an AVX loop to start the warm-up process before a critical 256b loop. IDK if any existing code uses something this function to achieve that: __attribute__((noinline)) __m256 warmup_avx256(void) { return _mm256_setzero_ps(); } If vxorps %xmm0,%xmm0,%xmm0 is faster on Bulldozer or Jaguar, we should probably make sure to always use that. People will just have to use something else for their warmup function, like maybe all-ones. OTOH, during the warm-up period, vxorps %xmm0,%xmm0,%xmm0 may be faster. (e.g. at the start of executing an AVX 256b function when 256b execution units were asleep). ---- AFAIK, there are no problems with mixing VEX and EVEX vector instructions on any existing AVX512 hardware (KNL and skylake-avx512). In asm syntax, I'm not sure there's even a way to request the EVEX encoding of vaddps %ymm, %ymm, %ymm with no masking or broadcast-load source operand. You could easily have that as part of a horizontal-add of a zmm vector, and there aren't EVEX versions of
If anyone has Bulldozer-family, Jaguar, or Zen hardware they can test on, I posted an SO question with an asm loop that might be helpful in testing with perf counters. http://stackoverflow.com/questions/43713273/is-vxorps-zeroing-on-amd-bulldozer-or-jaguar-faster-with-xmm-register-than-ymm
Agner Fog confirms that always using AVX-128 zeroing is a good idea for all current and likely future CPUs. http://stackoverflow.com/a/43751783/224132. AMD Ryzen decodes vxorps %ymm0, %ymm0, %ymm0 to two micro-ops, but only one for the xmm version, so there is a real performance gain from this on real hardware.
Ayman, can this be easily fixed by adding a new entry to the VEX2EVEX?
It's not just EVEX vxorps - if possible all V_SET0 should use VEX vxorps xmm on AVX1+ targets.
VEX2EVEX should fix it.
Sorry for the late reply (I didn't notice the previous comments). The EVEX2VEX pass replaces only 128-bit and 256-bit EVEX instructions which have a parallel VEX instructions. In this case, the EVEX instruction is 512-bit, so EVEX2VEX pass just ignores it. The bottom line is that EVEX2VEX pass with its current logic cannot handle these cases.
D35839/rL309298 fixed the VEX ymm->xmm zero case and D35965/rL309926 fixed the EVEX zmm/ymm->xmm zero cases. There is still an issue with duplicate xmm zero registers that is mentioned on [Bug #26018]