Open
Description
opened on Sep 10, 2024
Based on the description in #91227, I thought each of the following might both compile down to a single vpternlogd:
static Vector512<int> Exp1(Vector512<int> a, Vector512<int> b, Vector512<int> c) =>
Vector512.ConditionalSelect(a, b & c, b | c);
static Vector512<int> Exp2(Vector512<int> a, Vector512<int> b, Vector512<int> c) =>
(a & (b & c)) | (~a & (b | c));
but they don't today. The first results in a vpternlogd, but it's the standard one for ConditionalSelect used to choose between the results, and it's thus still computing the and and or separately:
vmovups zmm0, zmmword ptr [r8]
vmovups zmm1, zmmword ptr [r9]
vpandd zmm2, zmm1, zmm0
vpord zmm0, zmm1, zmm0
vpternlogd zmm0, zmm2, zmmword ptr [rdx], -40
The second results in two vpternlogds that are then or'd together:
vmovups zmm0, zmmword ptr [rdx]
vmovups zmm1, zmmword ptr [r8]
vmovups zmm2, zmmword ptr [r9]
vmovaps zmm3, zmm0
vpternlogd zmm3, zmm2, zmm1, -128
vpternlogd zmm2, zmm1, zmm0, 84
vpord zmm0, zmm2, zmm3
rather than a single vpternlogd that handles the whole bitwise operation.
Is this just further opportunity? Or is there something preventing such optimization?
cc: @tannergooding, @EgorBo
Activity