Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement fallback to smaller vector size for swizzle_dyn #433

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

cvijdea-bd
Copy link
Contributor

@cvijdea-bd cvijdea-bd commented Aug 25, 2024

This PR adds a fallback implementation so that e.g. u8x64::swizzle_dyn can be reasonably efficient even when only compiled with 128-bit SSSE.

A "downgraded" swizzle_dyn op on N lanes emits 4 swizzle_dyn ops on N/2 lanes. If the optimizer can deduce that index values are bounded to N/2 or less, then it will generally be more efficient.
For example, u8x64::swizzle_dyn will only emit 4 pshufb instructions on SSSE, instead of 16 in the general case, if the optimizer can prove index values are always <16 (this is generally achieved by preceding the swizzle with a 0xf mask).

Additionaly, for non-power-of-two N values, this PR adds a fallback implementation which zero-extends to the next power of two size.

Benchmarks

Below are benchmark results for the following code, on 5 target-cpu levels:

    b.bytes = (lookups.len() * std::mem::size_of::<Simd<u8, N>>()) as u64;
    b.iter(|| {
        for (lookup, index) in black_box(&lookups).iter().zip(black_box(&indexes).iter()) {
            let lookup = *lookup;
            let index = *index & Simd::splat(MASK);
            black_box(Simd::<u8, N>::swizzle_dyn(lookup, index));
        }
    });
  • x86-64 - baseline, no vectorized shuffles
  • x86-64-v2 - ssse, adds pshufb on u8x16
  • x86-64-v3 - avx2, adds vpshufb on u8x32 (vpshufb is not a true extension of pshufb to 256-bit, instead it's more like 2 pshufb ops ran in parallel, with 4-bit indices in the corresponding lane)
  • x86-64-v4 - avx512, adds vpshufb on u8x64 (again really just 4x pshufb in parallel)
  • icelake-server - avx512vbmi, adds vpermb (this is a true 256/512-bit shuffle with 5/6-bit indices)

N.B. the code used to benchmark includes #431, and removes the src/masks/bitmask.rs avx512f mask implementation to work around the problem discussed here.

The main thing to look for is the performance of swizzle_dyn_64* on non-vbmi targets (v2, v3, v4), and the performance of swizzle_dyn_32* on x86-64-v2. With the previous implementation, sizes without native vector instructions fall back to the scalar implementation, while in the PR version they fall back to a lower vector size for ~4-10x performance over the scalar version.

Benchmark data for old version

-Ctarget-cpu=x86-64

test bench_swizzle_dyn_16        ... bench:   1,758,136.93 ns/iter (+/- 620,537.28) = 1192 MB/s
test bench_swizzle_dyn_16_mask16 ... bench:     884,830.30 ns/iter (+/- 26,690.78) = 2370 MB/s
test bench_swizzle_dyn_32        ... bench:   2,852,244.70 ns/iter (+/- 981,864.91) = 735 MB/s
test bench_swizzle_dyn_32_mask16 ... bench:   1,218,294.30 ns/iter (+/- 15,292.08) = 1721 MB/s
test bench_swizzle_dyn_32_mask32 ... bench:   1,218,707.50 ns/iter (+/- 13,197.90) = 1720 MB/s
test bench_swizzle_dyn_64        ... bench:   5,415,598.00 ns/iter (+/- 173,006.27) = 387 MB/s
test bench_swizzle_dyn_64_mask16 ... bench:   1,274,801.70 ns/iter (+/- 21,257.84) = 1645 MB/s
test bench_swizzle_dyn_64_mask32 ... bench:     818,246.50 ns/iter (+/- 31,324.95) = 2562 MB/s
test bench_swizzle_dyn_64_mask64 ... bench:     825,826.30 ns/iter (+/- 42,892.83) = 2539 MB/s
test bench_swizzle_dyn_24        ... bench:   2,546,501.90 ns/iter (+/- 78,041.70) = 823 MB/s
test bench_swizzle_dyn_24_mask16 ... bench:   1,132,914.90 ns/iter (+/- 33,921.66) = 1851 MB/s
test bench_swizzle_dyn_43        ... bench:   5,561,422.50 ns/iter (+/- 1,310,875.16) = 377 MB/s
test bench_swizzle_dyn_43_mask16 ... bench:     888,501.18 ns/iter (+/- 36,721.91) = 2360 MB/s
test bench_swizzle_dyn_5_mask_8  ... bench:   7,800,009.70 ns/iter (+/- 101,916.37) = 268 MB/s

-Ctarget-cpu=x86-64-v2

test bench_swizzle_dyn_16        ... bench:     129,001.85 ns/iter (+/- 2,045.71) = 16256 MB/s
test bench_swizzle_dyn_16_mask16 ... bench:     126,409.63 ns/iter (+/- 1,106.15) = 16590 MB/s
test bench_swizzle_dyn_32        ... bench:   2,648,847.38 ns/iter (+/- 72,835.19) = 791 MB/s
test bench_swizzle_dyn_32_mask16 ... bench:     638,609.40 ns/iter (+/- 1,205.20) = 3283 MB/s
test bench_swizzle_dyn_32_mask32 ... bench:     638,810.50 ns/iter (+/- 1,368.94) = 3282 MB/s
test bench_swizzle_dyn_64        ... bench:   5,125,435.65 ns/iter (+/- 332,486.48) = 409 MB/s
test bench_swizzle_dyn_64_mask16 ... bench:     817,608.81 ns/iter (+/- 41,425.93) = 2564 MB/s
test bench_swizzle_dyn_64_mask32 ... bench:     814,272.70 ns/iter (+/- 39,830.58) = 2575 MB/s
test bench_swizzle_dyn_64_mask64 ... bench:     817,215.23 ns/iter (+/- 30,396.90) = 2566 MB/s
test bench_swizzle_dyn_24        ... bench:   2,402,568.90 ns/iter (+/- 61,893.97) = 872 MB/s
test bench_swizzle_dyn_24_mask16 ... bench:     610,669.90 ns/iter (+/- 18,419.01) = 3434 MB/s
test bench_swizzle_dyn_43        ... bench:   4,149,922.60 ns/iter (+/- 162,629.27) = 505 MB/s
test bench_swizzle_dyn_43_mask16 ... bench:     868,863.50 ns/iter (+/- 27,605.20) = 2413 MB/s
test bench_swizzle_dyn_5_mask_8  ... bench:   7,755,508.70 ns/iter (+/- 164,755.79) = 270 MB/s

-Ctarget-cpu=x86-64-v3

test bench_swizzle_dyn_16        ... bench:     141,146.06 ns/iter (+/- 207.07) = 14858 MB/s
test bench_swizzle_dyn_16_mask16 ... bench:     138,518.46 ns/iter (+/- 191.52) = 15139 MB/s
test bench_swizzle_dyn_32        ... bench:     121,160.02 ns/iter (+/- 162.73) = 17308 MB/s
test bench_swizzle_dyn_32_mask16 ... bench:     118,747.46 ns/iter (+/- 420.25) = 17660 MB/s
test bench_swizzle_dyn_32_mask32 ... bench:     120,251.52 ns/iter (+/- 186.12) = 17439 MB/s
test bench_swizzle_dyn_64_mask16 ... bench:     734,519.20 ns/iter (+/- 26,480.60) = 2855 MB/s
test bench_swizzle_dyn_64_mask32 ... bench:     755,231.40 ns/iter (+/- 25,220.19) = 2776 MB/s
test bench_swizzle_dyn_64_mask64 ... bench:     756,587.60 ns/iter (+/- 9,162.56) = 2771 MB/s
test bench_swizzle_dyn_24        ... bench:   2,525,430.65 ns/iter (+/- 80,467.50) = 830 MB/s
test bench_swizzle_dyn_24_mask16 ... bench:     618,569.60 ns/iter (+/- 15,202.54) = 3390 MB/s
test bench_swizzle_dyn_43        ... bench:   3,806,743.75 ns/iter (+/- 22,935.83) = 550 MB/s
test bench_swizzle_dyn_43_mask16 ... bench:     885,843.70 ns/iter (+/- 656.71) = 2367 MB/s
test bench_swizzle_dyn_5_mask_8  ... bench:   7,792,672.30 ns/iter (+/- 702,420.66) = 269 MB/s
test bench_swizzle_dyn_64        ... bench:   4,981,829.60 ns/iter (+/- 169,819.79) = 420 MB/s

-Ctarget-cpu=x86-64-v4

test bench_swizzle_dyn_16        ... bench:     144,257.45 ns/iter (+/- 39,790.03) = 14537 MB/s
test bench_swizzle_dyn_16_mask16 ... bench:     143,098.64 ns/iter (+/- 305.36) = 14655 MB/s
test bench_swizzle_dyn_32        ... bench:     140,321.57 ns/iter (+/- 14,479.64) = 14945 MB/s
test bench_swizzle_dyn_32_mask16 ... bench:     122,695.31 ns/iter (+/- 169.55) = 17092 MB/s
test bench_swizzle_dyn_32_mask32 ... bench:     125,273.56 ns/iter (+/- 212.20) = 16740 MB/s
test bench_swizzle_dyn_64        ... bench:   5,133,064.60 ns/iter (+/- 71,121.29) = 408 MB/s
test bench_swizzle_dyn_64_mask16 ... bench:     778,564.40 ns/iter (+/- 3,198.50) = 2693 MB/s
test bench_swizzle_dyn_64_mask32 ... bench:     778,808.50 ns/iter (+/- 3,380.96) = 2692 MB/s
test bench_swizzle_dyn_64_mask64 ... bench:     842,318.40 ns/iter (+/- 34,180.06) = 2489 MB/s
test bench_swizzle_dyn_24        ... bench:   3,409,837.20 ns/iter (+/- 5,918.96) = 615 MB/s
test bench_swizzle_dyn_24_mask16 ... bench:     831,782.30 ns/iter (+/- 1,396.30) = 2521 MB/s
test bench_swizzle_dyn_43        ... bench:   3,845,018.10 ns/iter (+/- 62,559.08) = 545 MB/s
test bench_swizzle_dyn_43_mask16 ... bench:     925,849.20 ns/iter (+/- 1,840.68) = 2265 MB/s
test bench_swizzle_dyn_5_mask_8  ... bench:   7,792,497.60 ns/iter (+/- 169,610.01) = 269 MB/s

-Ctarget-cpu=icelake-server (avx512vbmi)

test bench_swizzle_dyn_16        ... bench:     141,509.67 ns/iter (+/- 431.88) = 14819 MB/s
test bench_swizzle_dyn_16_mask16 ... bench:     140,317.70 ns/iter (+/- 291.18) = 14945 MB/s
test bench_swizzle_dyn_32        ... bench:     120,201.04 ns/iter (+/- 127.38) = 17447 MB/s
test bench_swizzle_dyn_32_mask16 ... bench:     120,402.70 ns/iter (+/- 1,684.98) = 17417 MB/s
test bench_swizzle_dyn_32_mask32 ... bench:     120,711.00 ns/iter (+/- 1,595.57) = 17373 MB/s
test bench_swizzle_dyn_64        ... bench:     127,307.53 ns/iter (+/- 1,356.22) = 16473 MB/s
test bench_swizzle_dyn_64_mask16 ... bench:     127,292.10 ns/iter (+/- 811.17) = 16475 MB/s
test bench_swizzle_dyn_64_mask32 ... bench:     127,435.99 ns/iter (+/- 1,624.41) = 16456 MB/s
test bench_swizzle_dyn_64_mask64 ... bench:     127,359.49 ns/iter (+/- 1,162.59) = 16466 MB/s
test bench_swizzle_dyn_24        ... bench:   3,403,487.80 ns/iter (+/- 20,527.12) = 616 MB/s
test bench_swizzle_dyn_24_mask16 ... bench:     833,268.80 ns/iter (+/- 8,846.27) = 2516 MB/s
test bench_swizzle_dyn_43        ... bench:   3,989,622.70 ns/iter (+/- 136,142.64) = 525 MB/s
test bench_swizzle_dyn_43_mask16 ... bench:     925,779.40 ns/iter (+/- 6,602.74) = 2265 MB/s
test bench_swizzle_dyn_5_mask_8  ... bench:   9,831,859.25 ns/iter (+/- 2,805,111.90) = 213 MB/s
Benchmark data for new version

-Ctarget-cpu=x86-64

test bench_swizzle_dyn_16        ... bench:   1,757,825.75 ns/iter (+/- 55,621.98) = 1193 MB/s
test bench_swizzle_dyn_16_mask16 ... bench:     885,512.05 ns/iter (+/- 25,654.57) = 2368 MB/s
test bench_swizzle_dyn_32        ... bench:   2,805,398.55 ns/iter (+/- 66,245.41) = 747 MB/s
test bench_swizzle_dyn_32_mask16 ... bench:     904,389.00 ns/iter (+/- 13,162.55) = 2318 MB/s
test bench_swizzle_dyn_32_mask32 ... bench:     904,787.30 ns/iter (+/- 25,749.03) = 2317 MB/s
test bench_swizzle_dyn_64        ... bench:   5,401,521.75 ns/iter (+/- 165,573.85) = 388 MB/s
test bench_swizzle_dyn_64_mask16 ... bench:   1,274,823.30 ns/iter (+/- 38,571.44) = 1645 MB/s
test bench_swizzle_dyn_64_mask32 ... bench:     812,329.91 ns/iter (+/- 34,655.58) = 2581 MB/s
test bench_swizzle_dyn_64_mask64 ... bench:     820,164.60 ns/iter (+/- 38,105.78) = 2556 MB/s
test bench_swizzle_dyn_24        ... bench:   2,520,968.30 ns/iter (+/- 101,413.91) = 831 MB/s
test bench_swizzle_dyn_24_mask16 ... bench:   1,133,691.24 ns/iter (+/- 43,145.54) = 1849 MB/s
test bench_swizzle_dyn_43        ... bench:   4,126,429.50 ns/iter (+/- 141,499.13) = 508 MB/s
test bench_swizzle_dyn_43_mask16 ... bench:     886,000.79 ns/iter (+/- 26,400.43) = 2366 MB/s
test bench_swizzle_dyn_5_mask_8  ... bench:   7,799,985.50 ns/iter (+/- 327,156.35) = 268 MB/s

-Ctarget-cpu=x86-64-v2

test bench_swizzle_dyn_16        ... bench:     142,425.58 ns/iter (+/- 2,086.47) = 14724 MB/s
test bench_swizzle_dyn_16_mask16 ... bench:     137,679.62 ns/iter (+/- 2,961.24) = 15232 MB/s
test bench_swizzle_dyn_32        ... bench:     180,386.56 ns/iter (+/- 40,374.21) = 11625 MB/s
test bench_swizzle_dyn_32_mask16 ... bench:     117,059.84 ns/iter (+/- 263.47) = 17915 MB/s
test bench_swizzle_dyn_32_mask32 ... bench:     122,496.84 ns/iter (+/- 5,551.35) = 17120 MB/s
test bench_swizzle_dyn_64        ... bench:     255,301.92 ns/iter (+/- 6,439.50) = 8214 MB/s
test bench_swizzle_dyn_64_mask16 ... bench:     118,240.99 ns/iter (+/- 3,802.15) = 17736 MB/s
test bench_swizzle_dyn_64_mask32 ... bench:     122,202.14 ns/iter (+/- 6,491.88) = 17161 MB/s
test bench_swizzle_dyn_64_mask64 ... bench:     208,805.05 ns/iter (+/- 7,945.68) = 10043 MB/s
test bench_swizzle_dyn_24        ... bench:     237,441.03 ns/iter (+/- 26,824.54) = 8832 MB/s
test bench_swizzle_dyn_24_mask16 ... bench:     134,570.62 ns/iter (+/- 3,092.15) = 15584 MB/s
test bench_swizzle_dyn_43        ... bench:     503,438.20 ns/iter (+/- 919.14) = 4165 MB/s
test bench_swizzle_dyn_43_mask16 ... bench:     120,647.25 ns/iter (+/- 329.87) = 17382 MB/s
test bench_swizzle_dyn_5_mask_8  ... bench:     352,197.78 ns/iter (+/- 34,416.03) = 5954 MB/s

-Ctarget-cpu=x86-64-v3

test bench_swizzle_dyn_16        ... bench:     154,768.45 ns/iter (+/- 2,525.78) = 13550 MB/s
test bench_swizzle_dyn_16_mask16 ... bench:     152,077.10 ns/iter (+/- 280.91) = 13790 MB/s
test bench_swizzle_dyn_32        ... bench:     140,406.54 ns/iter (+/- 5,891.90) = 14936 MB/s
test bench_swizzle_dyn_32_mask16 ... bench:     135,856.60 ns/iter (+/- 1,557.41) = 15436 MB/s
test bench_swizzle_dyn_32_mask32 ... bench:     140,208.33 ns/iter (+/- 1,010.13) = 14957 MB/s
test bench_swizzle_dyn_64        ... bench:     182,426.04 ns/iter (+/- 492.10) = 11495 MB/s
test bench_swizzle_dyn_64_mask16 ... bench:     130,687.91 ns/iter (+/- 300.31) = 16047 MB/s
test bench_swizzle_dyn_64_mask32 ... bench:     133,016.57 ns/iter (+/- 611.21) = 15766 MB/s
test bench_swizzle_dyn_64_mask64 ... bench:     154,378.03 ns/iter (+/- 929.31) = 13584 MB/s
test bench_swizzle_dyn_24        ... bench:     167,513.82 ns/iter (+/- 3,303.83) = 12519 MB/s
test bench_swizzle_dyn_24_mask16 ... bench:     138,114.07 ns/iter (+/- 1,129.57) = 15184 MB/s
test bench_swizzle_dyn_43        ... bench:     248,154.55 ns/iter (+/- 1,220.99) = 8450 MB/s
test bench_swizzle_dyn_43_mask16 ... bench:     138,646.46 ns/iter (+/- 2,367.86) = 15125 MB/s
test bench_swizzle_dyn_5_mask_8  ... bench:     484,605.70 ns/iter (+/- 1,178.23) = 4327 MB/s

-Ctarget-cpu=x86-64-v4

test bench_swizzle_dyn_16        ... bench:     147,056.86 ns/iter (+/- 7,901.26) = 14260 MB/s
test bench_swizzle_dyn_16_mask16 ... bench:     145,653.23 ns/iter (+/- 765.60) = 14398 MB/s
test bench_swizzle_dyn_32        ... bench:     128,586.98 ns/iter (+/- 1,309.26) = 16309 MB/s
test bench_swizzle_dyn_32_mask16 ... bench:     125,415.66 ns/iter (+/- 439.47) = 16721 MB/s
test bench_swizzle_dyn_32_mask32 ... bench:     128,567.36 ns/iter (+/- 236.51) = 16311 MB/s
test bench_swizzle_dyn_64        ... bench:     154,575.97 ns/iter (+/- 179.12) = 13567 MB/s
test bench_swizzle_dyn_64_mask16 ... bench:     124,714.50 ns/iter (+/- 242.85) = 16815 MB/s
test bench_swizzle_dyn_64_mask32 ... bench:     127,455.67 ns/iter (+/- 1,334.81) = 16454 MB/s
test bench_swizzle_dyn_64_mask64 ... bench:     130,072.39 ns/iter (+/- 957.03) = 16123 MB/s
test bench_swizzle_dyn_24        ... bench:     236,086.73 ns/iter (+/- 11,341.07) = 8882 MB/s
test bench_swizzle_dyn_24_mask16 ... bench:     128,803.36 ns/iter (+/- 2,010.54) = 16281 MB/s
test bench_swizzle_dyn_43        ... bench:     373,447.80 ns/iter (+/- 531.31) = 5615 MB/s
test bench_swizzle_dyn_43_mask16 ... bench:     130,216.60 ns/iter (+/- 932.21) = 16104 MB/s
test bench_swizzle_dyn_5_mask_8  ... bench:     313,044.87 ns/iter (+/- 14,027.41) = 6699 MB/s

-Ctarget-cpu=icelake-server (avx512vbmi)

test bench_swizzle_dyn_16        ... bench:     139,096.00 ns/iter (+/- 534.63) = 15077 MB/s
test bench_swizzle_dyn_16_mask16 ... bench:     137,995.96 ns/iter (+/- 2,989.46) = 15197 MB/s
test bench_swizzle_dyn_32        ... bench:     118,391.74 ns/iter (+/- 253.91) = 17713 MB/s
test bench_swizzle_dyn_32_mask16 ... bench:     118,585.59 ns/iter (+/- 10,503.54) = 17684 MB/s
test bench_swizzle_dyn_32_mask32 ... bench:     118,472.72 ns/iter (+/- 1,708.84) = 17701 MB/s
test bench_swizzle_dyn_64        ... bench:     146,002.83 ns/iter (+/- 181.27) = 14363 MB/s
test bench_swizzle_dyn_64_mask16 ... bench:     135,991.02 ns/iter (+/- 120.40) = 15421 MB/s
test bench_swizzle_dyn_64_mask32 ... bench:     135,965.09 ns/iter (+/- 128.03) = 15424 MB/s
test bench_swizzle_dyn_64_mask64 ... bench:     135,971.99 ns/iter (+/- 172.91) = 15423 MB/s
test bench_swizzle_dyn_24        ... bench:     130,304.84 ns/iter (+/- 8,965.54) = 16094 MB/s
test bench_swizzle_dyn_24_mask16 ... bench:     121,858.66 ns/iter (+/- 63,820.69) = 17209 MB/s
test bench_swizzle_dyn_43        ... bench:     186,879.93 ns/iter (+/- 436.80) = 11221 MB/s
test bench_swizzle_dyn_43_mask16 ... bench:     158,348.00 ns/iter (+/- 410.78) = 13243 MB/s
test bench_swizzle_dyn_5_mask_8  ... bench:     420,093.25 ns/iter (+/- 3,317.77) = 4992 MB/s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant