32862 – AVX512 _mm512_setzero_ps could save a byte by using a VEX-encoded vxorps xmm instead of EVEX

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 32862 - AVX512 _mm512_setzero_ps could save a byte by using a VEX-encoded vxorps xmm instead of EVEX

Summary: AVX512 _mm512_setzero_ps could save a byte by using a VEX-encoded vxorps xmm ...

Status:	RESOLVED FIXED

Alias:	None

Product:	libraries
Classification:	Unclassified
Component:	Backend: X86 (show other bugs)
Version:	trunk
Hardware:	PC Linux

Importance:	P enhancement
Assignee:	Unassigned LLVM Bugs

URL:
Keywords:

Depends on:
Blocks:

Reported:	2017-04-30 16:50 PDT by Peter Cordes
Modified:	2017-08-18 03:47 PDT (History)
CC List:	7 users (show)

See Also:	26018
Fixed By Commit(s):

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Peter Cordes 2017-04-30 16:50:32 PDT

VEX-encoded vxorps %xmm0, %xmm0, %xmm0 does the same thing as EVEX-encoded vxorps %zmm0, %zmm0, %zmm0, zeroing the full-width vector and breaking dependencies on the old value of the architectural register.

The VEX version is one byte shorter than the EVEX version.

#include <immintrin.h>
__m512 zerovec(){ return _mm512_setzero_ps(); }

current compiles to (https://godbolt.org/g/i5LA9Y):
gcc8 and clang5.0.0 (trunk 301766):  vxorps  %zmm0, %zmm0, %zmm0
ICC17:                               vpxord   %zmm0, %zmm0, %zmm0  
MSVC:                                vxorps   xmm0, xmm0, xmm0

Always using 128b zeroing instructions wouldn't hurt for AVX/AVX2, as well.

I'm not sure if CPUs like Jaguar or Bulldozer-family (which crack 256b instructions into two 128b ops) handle xor-zeroing specially and only need one internal operation for vxorps %ymm,%ymm,%ymm zeroing.  If not, using 128b would save execution throughput.  (e.g. maybe they crack instructions at decode, but independence-detection happens later?  Unlikely, because it probably has to decode to a special zeroing micro-op).

One possible downside is that  vxorps %ymm0,%ymm0,%ymm0  warms up the 256b execution units on Intel CPUs like Skylake, but vxorps %xmm0,%xmm0,%xmm0 doesn't.  As Agner Fog describes (in http://agner.org/optimize/microarchitecture.pdf), a program can run a single 256b AVX instruction at least 56,000 clock cycles before an AVX loop to start the warm-up process before a critical 256b loop.

IDK if any existing code uses something this function to achieve that:
    __attribute__((noinline)) __m256 warmup_avx256(void) {
        return _mm256_setzero_ps();
    }

If  vxorps %xmm0,%xmm0,%xmm0  is faster on Bulldozer or Jaguar, we should probably make sure to always use that.  People will just have to use something else for their warmup function, like maybe all-ones.

OTOH, during the warm-up period, vxorps %xmm0,%xmm0,%xmm0 may be faster.  (e.g. at the start of executing an AVX 256b function when 256b execution units were asleep).

----

AFAIK, there are no problems with mixing VEX and EVEX vector instructions on any existing AVX512 hardware (KNL and skylake-avx512).

In asm syntax, I'm not sure there's even a way to request the EVEX encoding of
vaddps %ymm, %ymm, %ymm with no masking or broadcast-load source operand.  You could easily have that as part of a horizontal-add of a zmm vector, and there aren't EVEX versions of

Comment 1 Peter Cordes 2017-04-30 20:24:46 PDT

If anyone has Bulldozer-family, Jaguar, or Zen hardware they can test on, I posted an SO question with an asm loop that might be helpful in testing with perf counters.

http://stackoverflow.com/questions/43713273/is-vxorps-zeroing-on-amd-bulldozer-or-jaguar-faster-with-xmm-register-than-ymm

Comment 2 Peter Cordes 2017-05-04 16:43:29 PDT

Agner Fog confirms that always using AVX-128 zeroing is a good idea for all current and likely future CPUs.  http://stackoverflow.com/a/43751783/224132.

AMD Ryzen decodes vxorps  %ymm0, %ymm0, %ymm0 to two micro-ops, but only one for the xmm version, so there is a real performance gain from this on real hardware.

Comment 3 Zvi Rackover 2017-05-21 00:27:06 PDT

Ayman, can this be easily fixed by adding a new entry to the VEX2EVEX?

Comment 4 Simon Pilgrim 2017-05-21 01:48:47 PDT

It's not just EVEX vxorps - if possible all V_SET0 should use VEX vxorps xmm on AVX1+ targets.

Comment 5 Elena Demikhovsky 2017-07-25 10:44:23 PDT

VEX2EVEX should fix it.

Comment 6 Ayman Musa 2017-07-26 01:16:41 PDT

Sorry for the late reply (I didn't notice the previous comments).

The EVEX2VEX pass replaces only 128-bit and 256-bit EVEX instructions which have a parallel VEX instructions.
In this case, the EVEX instruction is 512-bit, so EVEX2VEX pass just ignores it.

The bottom line is that EVEX2VEX pass with its current logic cannot handle these cases.

Comment 7 Simon Pilgrim 2017-08-07 03:35:56 PDT

D35839/rL309298 fixed the VEX ymm->xmm zero case and D35965/rL309926 fixed the EVEX zmm/ymm->xmm zero cases.

There is still an issue with duplicate xmm zero registers that is mentioned on [Bug #26018]