-
Notifications
You must be signed in to change notification settings - Fork 399
Open
Description
Hello,
I am trying to port code that uses VNNI / Neon Dotprod, and I am having troubles detecting support for these operations with dynamic dispatch.
I modified skeleton.cc / skeleton.h by adding the following lines to CodepathDemo:
#ifdef HWY_NATIVE_U8_I8_SUMOFMULQUADACCUMULATE
const char* gather2 = "Has VNNI";
#else
const char* gather2 = "No VNNI";
#endif
#ifdef HWY_NATIVE_U8_I8_SATWIDENMULPAIRWISEADD
const char* gather3 = "Has Fallback";
#else
const char* gather3 = "No Fallback";
#endif
printf("Target %15s: %15s %15s %15s %d\n",
hwy::TargetName(HWY_TARGET),
gather, gather2, gather3,
HWY_TARGET <= HWY_AVX3_DL);
Then I replaced dynamic dispatch with:
#define VISITOR(TARGET, NAMESPACE) NAMESPACE::CodepathDemo();
HWY_VISIT_TARGETS(VISITOR)
(I also tried a SetSupportedTargetsForTest + HWY_DYNAMIC_DISPATCH loop for the same exact results)
The programm was compiled on with gcc, with O3 and no march flags enabed .
The output shows unexpected toggling of the macros:
Target AVX2: Has int64 Has VNNI Has Fallback 0
Target AVX3: Has int64 No VNNI No Fallback 0
Target AVX3_DL: Has int64 Has VNNI Has Fallback 1
Target AVX3_SPR: Has int64 Has VNNI Has Fallback 1
Target AVX3_ZEN4: Has int64 No VNNI No Fallback 1
Target SSE2: Has int64 No VNNI No Fallback 0
Target SSE2: Has int64 No VNNI No Fallback 0
Target SSE4: Has int64 No VNNI No Fallback 0
Target SSSE3: Has int64 Has VNNI Has Fallback 0
It seems that the HWY_NATIVE_* macros are toggled on/off at each iteration of the dynamic dispatch codegen process:
- SSE2 -> OFF
- SSSE3 -> ON
- SSE4 -> OFF
- AVX2 -> ON
- AVX3 -> OFF
- AVX3_DL -> ON
- AVX3_ZEN4 -> OFF
- AVX3_SPR -> ON
However, testing other macros like HWY_NATIVE_FMA which are defined in set_macros-inl.h works perfectly!
Questions:
- Is this meant to work and did i mess something up, or are most HWY_NATIVE_* macros intended only for static dispatch?
- If I want to detect VNNI / Dotprod support reliably in a dynamically dispatched function, should I :
-
- Use something like HWY_TARGET <= HWY_AVX3_DL?
-
- Build static targets separately, aggregate via CMake, and probe dispatch function once to get the correct function table?
-
- Should i not be trying to manually route kernels and instead let auto-tune decide for its-self what to do. For instance generate 3/4 kernels per target, even the ones using non native operations that will be very slow, then probe the target function pointers at startup and tune to find the most efficient one.
-
- Am i way off topic and this is not the way to go at all?
Thank you in advance!
Metadata
Metadata
Assignees
Labels
No labels