Skip to content

Avoid use of _mm512_set_epi16 for reversing 16-bit vectors #198

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 22, 2025

Conversation

sterrettm2
Copy link
Contributor

To help with older GCC versions we should avoid using _mm512_set_epi16, and currently some are used for reversing vectors. Hopefully this fixes GCC 8 builds and thus #196.
There is no meaningful performance change from this patch; here are benchmarks to show the lack of regression

Benchmarks
Benchmark                                                         Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------------
[simdsort/.*/_Float16 vs. simdsort/.*/_Float16]                -0.0038         -0.0038           372           371           375           374
[simdsort/.*/_Float16 vs. simdsort/.*/_Float16]                +0.0004         +0.0005           848           849           851           851
[simdsort/.*/_Float16 vs. simdsort/.*/_Float16]                +0.0006         +0.0007          1427          1428          1430          1431
[simdsort/.*/_Float16 vs. simdsort/.*/_Float16]                -0.0016         -0.0019          3033          3028          3037          3031
[simdsort/.*/_Float16 vs. simdsort/.*/_Float16]                -0.0026         -0.0027         17372         17326         17379         17332
[simdsort/.*/_Float16 vs. simdsort/.*/_Float16]                +0.0007         +0.0008        542548        542944        542538        542948
[simdsort/.*/_Float16 vs. simdsort/.*/_Float16]                -0.0003         -0.0004       1494079       1493635       1494006       1493472
[simdsort/.*/_Float16 vs. simdsort/.*/_Float16]                +0.0048         +0.0048       6177541       6207248       6177238       6206961
[simdsort/.*/_Float16 vs. simdsort/.*/_Float16]                -0.0004         -0.0001      83381017      83344051      83347728      83342490
[simdsort/.*/_Float16 vs. simdsort/.*/_Float16]                +0.0060         +0.0056           371           373           373           375
[simdsort/.*/_Float16 vs. simdsort/.*/_Float16]                +0.0021         +0.0018           813           814           815           817
[simdsort/.*/_Float16 vs. simdsort/.*/_Float16]                +0.0012         +0.0017          1536          1538          1539          1541
[simdsort/.*/_Float16 vs. simdsort/.*/_Float16]                +0.0105         +0.0104          2893          2923          2895          2925
[simdsort/.*/_Float16 vs. simdsort/.*/_Float16]                -0.0082         -0.0082         17019         16879         17021         16882
[simdsort/.*/_Float16 vs. simdsort/.*/_Float16]                +0.0002         +0.0002        544503        544623        544440        544575
[simdsort/.*/_Float16 vs. simdsort/.*/_Float16]                +0.0056         +0.0056       1907694       1918321       1907299       1917984
[simdsort/.*/_Float16 vs. simdsort/.*/_Float16]                +0.0041         +0.0041       6315616       6341351       6315245       6340965
[simdsort/.*/_Float16 vs. simdsort/.*/_Float16]                +0.0020         +0.0019         35652         35724         35657         35723
[simdsort/.*/_Float16 vs. simdsort/.*/_Float16]                +0.0097         +0.0032           633           639           631           633
[simdsort/.*/_Float16 vs. simdsort/.*/_Float16]                -0.0014         -0.0015         37039         36987         37041         36986
[simdsort/.*/_Float16 vs. simdsort/.*/_Float16]_pvalue          0.9461          0.9246      U Test, Repetitions: 20 vs 20
OVERALL_GEOMEAN                                                +0.0015         +0.0011             0             0             0             0

Benchmark                                                         Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------------
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                +0.0046         +0.0047           334           336           336           338
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                +0.0063         +0.0062           446           449           448           451
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                +0.0003         +0.0023           709           710           710           712
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                -0.0018         -0.0049          1767          1764          1778          1769
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                -0.0049         -0.0051         11002         10948         11007         10951
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                +0.0005         +0.0005        261175        261295        261171        261303
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                -0.0024         -0.0027        897326        895148        897015        894612
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                +0.0026         +0.0026       7619747       7639234       7619279       7639067
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                +0.0608         +0.0610      83122688      88176807      83104235      88173763
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                -0.0015         -0.0016           336           336           338           337
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                -0.0005         -0.0002           449           449           451           451
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                -0.0006         -0.0029           710           710           713           711
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                -0.0042         -0.0052          1774          1767          1781          1772
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                -0.0024         -0.0020          8889          8867          8892          8875
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                +0.0018         +0.0017         67681         67802         67662         67777
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                +0.0024         +0.0027        365596        366482        365094        366087
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                -0.0070         -0.0067       3462076       3437960       3461054       3437710
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                +0.0013         +0.0016         22272         22301         22270         22305
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                +0.0027         +0.0064           594           595           591           595
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                -0.0038         -0.0035         23012         22925         23010         22930
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                -0.0000         +0.0001           337           337           339           339
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                +0.0003         +0.0001           450           450           451           451
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                +0.0015         +0.0002           709           710           712           712
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                -0.0025         -0.0022          1768          1763          1776          1772
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                -0.0001         +0.0001         10921         10920         10927         10928
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                -0.0002         +0.0000        260264        260215        260207        260212
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                -0.0010         -0.0010        894294        893415        893799        892941
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                +0.0007         +0.0010       7607092       7612503       7604347       7612095
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                +0.0095         +0.0097      84498970      85301690      84473524      85291403
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                -0.0047         -0.0041           336           335           338           337
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                -0.0049         -0.0025           450           448           452           450
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                +0.0008         +0.0003           710           710           711           711
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                +0.0002         +0.0005          1771          1771          1779          1780
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                +0.0026         +0.0028          8836          8859          8840          8865
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                +0.0004         +0.0005         67851         67882         67840         67873
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                -0.0039         -0.0035        370442        369016        369922        368620
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                -0.0069         -0.0069       3466781       3442719       3466550       3442519
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                +0.0205         +0.0207         22158         22613         22159         22618
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                +0.0009         +0.0080           594           594           589           593
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]                -0.0039         -0.0035         22895         22807         22892         22811
[simdsort/.*/*int16_t vs. simdsort/.*/*int16_t]_pvalue          0.9962          0.9885      U Test, Repetitions: 40 vs 40
OVERALL_GEOMEAN                                                +0.0015         +0.0018             0             0             0             0

Copy link
Contributor

@r-devulap r-devulap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @sterrettm2 for fixing it!

@r-devulap r-devulap merged commit eafc913 into intel:main Apr 22, 2025
13 checks passed
@r-devulap r-devulap requested a review from Copilot May 6, 2025 04:28
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR replaces the usage of _mm512_set_epi16 with a constexpr array and _mm512_loadu_si512 to reverse 16‐bit vectors, addressing build issues on older GCC versions. Key changes include updating the NETWORK_REVERSE_32LANES macro to list indices in reversed order and modifying the vector reversal functions in both avx512fp16-16bit-qsort.hpp and avx512-16bit-qsort.hpp.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
src/xss-common-includes.h Updated NETWORK_REVERSE_32LANES macro to emit reversed index order for vector reversal.
src/avx512fp16-16bit-qsort.hpp Replaced _mm512_set_epi16 with loading a constexpr array for reversing vectors.
src/avx512-16bit-qsort.hpp Similarly replaced _mm512_set_epi16 with a constexpr array load to reverse vectors.
Comments suppressed due to low confidence (3)

src/xss-common-includes.h:82

  • Ensure that the revised ordering in the NETWORK_REVERSE_32LANES macro correctly matches the intended reversal behavior for all consumers of this macro.
31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0

src/avx512fp16-16bit-qsort.hpp:142

  • Verify that using _mm512_loadu_si512 on the static array 'arr' preserves the correct vector reversal behavior and that the array size exactly meets the requirements of the intrinsic.
constexpr static uint16_t arr[] = {NETWORK_REVERSE_32LANES};

src/avx512-16bit-qsort.hpp:178

  • Confirm that replacing _mm512_set_epi16 with a constexpr array and _mm512_loadu_si512 in the reverse functions maintains the intended reversal logic, particularly regarding the reversed index order.
constexpr static uint16_t arr[] = {NETWORK_REVERSE_32LANES};

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants