Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PTX-JIT compilation for mixed-join kernels #17763

Draft
wants to merge 2 commits into
base: branch-25.02
Choose a base branch
from

Conversation

lamarrr
Copy link
Contributor

@lamarrr lamarrr commented Jan 17, 2025

Description

This merge request documents driver PTX-JIT compilation of some of CUDF's kernels.
Follows up on #17399

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@lamarrr lamarrr requested a review from a team as a code owner January 17, 2025 18:48
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Jan 17, 2025
@lamarrr lamarrr added DO NOT MERGE Hold off on merging; see PR for details and removed libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Jan 17, 2025
@lamarrr lamarrr marked this pull request as draft January 17, 2025 18:52
Copy link

copy-pr-bot bot commented Jan 17, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@lamarrr
Copy link
Contributor Author

lamarrr commented Jan 17, 2025

Initial runtime results:
["sass_join.json", "ptx_join.json"]

mixed_inner_join

[0] NVIDIA RTX A6000

Key Nullable left_size right_size Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I32 0 1000 1000 107.412 us 45.73% 113.981 us 12.96% 6.569 us 6.12% PASS
I32 0 100000 1000 134.812 us 37.34% 148.843 us 12.20% 14.032 us 10.41% PASS
I32 0 10000000 1000 3.707 ms 2.37% 4.590 ms 2.10% 883.255 us 23.83% FAIL
I32 0 100000 100000 155.543 us 33.90% 169.347 us 13.92% 13.805 us 8.88% PASS
I32 0 10000000 100000 4.560 ms 2.89% 5.406 ms 1.95% 845.752 us 18.55% FAIL
I32 0 10000000 10000000 13.935 ms 4.74% 13.667 ms 0.99% -267.641 us -1.92% FAIL
I32 1 1000 1000 117.985 us 50.90% 116.200 us 14.38% -1.785 us -1.51% PASS
I32 1 100000 1000 162.909 us 32.78% 160.931 us 12.59% -1.979 us -1.21% PASS
I32 1 10000000 1000 4.833 ms 2.81% 4.925 ms 1.45% 91.243 us 1.89% FAIL
I32 1 100000 100000 178.845 us 33.18% 176.964 us 10.36% -1.881 us -1.05% PASS
I32 1 10000000 100000 5.563 ms 2.88% 5.650 ms 1.81% 86.714 us 1.56% PASS
I32 1 10000000 10000000 7.171 ms 2.81% 7.280 ms 1.27% 108.791 us 1.52% FAIL
I64 0 1000 1000 114.046 us 40.88% 117.443 us 10.67% 3.397 us 2.98% PASS
I64 0 100000 1000 148.410 us 37.02% 158.075 us 17.16% 9.665 us 6.51% PASS
I64 0 10000000 1000 4.023 ms 2.65% 4.878 ms 2.56% 854.558 us 21.24% FAIL
I64 0 100000 100000 162.808 us 33.72% 175.782 us 11.31% 12.974 us 7.97% PASS
I64 0 10000000 100000 4.827 ms 2.66% 5.632 ms 1.28% 805.072 us 16.68% FAIL
I64 0 10000000 10000000 13.510 ms 2.41% 13.850 ms 1.16% 340.445 us 2.52% FAIL
I64 1 1000 1000 119.155 us 36.64% 115.593 us 16.72% -3.562 us -2.99% PASS
I64 1 100000 1000 166.345 us 38.40% 166.325 us 15.47% -0.020 us -0.01% PASS
I64 1 10000000 1000 4.874 ms 2.81% 4.971 ms 1.69% 96.543 us 1.98% FAIL
I64 1 100000 100000 175.518 us 35.36% 178.723 us 12.08% 3.206 us 1.83% PASS
I64 1 10000000 100000 5.647 ms 2.58% 5.729 ms 1.50% 81.439 us 1.44% PASS
I64 1 10000000 10000000 7.244 ms 2.62% 7.376 ms 1.13% 131.554 us 1.82% FAIL

mixed_left_join

[0] NVIDIA RTX A6000

Key Nullable left_size right_size Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I32 0 1000 1000 113.690 us 39.79% 117.048 us 14.64% 3.358 us 2.95% PASS
I32 0 100000 1000 142.752 us 37.52% 158.756 us 14.56% 16.005 us 11.21% PASS
I32 0 10000000 1000 4.098 ms 2.71% 5.078 ms 1.89% 980.068 us 23.92% FAIL
I32 0 100000 100000 166.782 us 35.92% 177.156 us 16.49% 10.374 us 6.22% PASS
I32 0 10000000 100000 4.946 ms 2.72% 5.964 ms 1.34% 1.018 ms 20.59% FAIL
I32 0 10000000 10000000 13.618 ms 2.53% 14.020 ms 0.88% 401.839 us 2.95% FAIL
I32 1 1000 1000 120.569 us 46.47% 116.652 us 21.16% -3.917 us -3.25% PASS
I32 1 100000 1000 169.803 us 33.67% 167.155 us 16.28% -2.649 us -1.56% PASS
I32 1 10000000 1000 5.369 ms 2.58% 5.437 ms 1.56% 67.425 us 1.26% PASS
I32 1 100000 100000 185.911 us 40.39% 181.608 us 17.17% -4.303 us -2.31% PASS
I32 1 10000000 100000 6.151 ms 2.33% 6.218 ms 1.61% 66.182 us 1.08% PASS
I32 1 10000000 10000000 7.792 ms 2.39% 7.882 ms 1.33% 90.290 us 1.16% PASS
I64 0 1000 1000 115.506 us 37.48% 118.522 us 13.41% 3.016 us 2.61% PASS
I64 0 100000 1000 153.142 us 33.45% 167.061 us 11.69% 13.919 us 9.09% PASS
I64 0 10000000 1000 4.437 ms 2.79% 5.459 ms 1.67% 1.022 ms 23.03% FAIL
I64 0 100000 100000 170.649 us 41.94% 184.774 us 15.84% 14.125 us 8.28% PASS
I64 0 10000000 100000 5.290 ms 2.43% 6.219 ms 1.70% 929.149 us 17.56% FAIL
I64 0 10000000 10000000 13.782 ms 2.41% 14.206 ms 0.92% 424.127 us 3.08% FAIL
I64 1 1000 1000 121.638 us 45.41% 117.509 us 16.47% -4.129 us -3.39% PASS
I64 1 100000 1000 172.242 us 32.60% 175.784 us 17.28% 3.542 us 2.06% PASS
I64 1 10000000 1000 5.427 ms 2.52% 5.495 ms 1.42% 68.333 us 1.26% PASS
I64 1 100000 100000 182.457 us 34.83% 183.586 us 15.90% 1.129 us 0.62% PASS
I64 1 10000000 100000 6.233 ms 2.45% 6.292 ms 1.40% 59.039 us 0.95% PASS
I64 1 10000000 10000000 7.875 ms 2.20% 7.987 ms 1.20% 111.406 us 1.41% FAIL

mixed_full_join

[0] NVIDIA RTX A6000

Key Nullable left_size right_size Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I32 0 1000 1000 164.518 us 37.67% 164.318 us 17.27% -0.200 us -0.12% PASS
I32 0 100000 1000 175.712 us 38.90% 187.289 us 11.75% 11.576 us 6.59% PASS
I32 0 10000000 1000 4.610 ms 3.29% 5.357 ms 2.12% 747.442 us 16.21% FAIL
I32 0 100000 100000 225.323 us 43.12% 233.060 us 10.65% 7.737 us 3.43% PASS
I32 0 10000000 100000 5.076 ms 2.88% 6.087 ms 1.87% 1.010 ms 19.90% FAIL
I32 0 10000000 10000000 14.737 ms 2.68% 15.127 ms 1.02% 390.140 us 2.65% FAIL
I32 1 1000 1000 171.217 us 40.73% 166.232 us 20.98% -4.986 us -2.91% PASS
I32 1 100000 1000 224.302 us 33.73% 218.112 us 13.01% -6.190 us -2.76% PASS
I32 1 10000000 1000 5.712 ms 3.07% 5.776 ms 1.83% 64.391 us 1.13% PASS
I32 1 100000 100000 244.861 us 38.04% 234.514 us 11.25% -10.347 us -4.23% PASS
I32 1 10000000 100000 6.502 ms 3.05% 6.563 ms 1.89% 61.251 us 0.94% PASS
I32 1 10000000 10000000 8.712 ms 4.72% 8.692 ms 1.38% -19.533 us -0.22% PASS
I64 0 1000 1000 181.308 us 57.38% 165.903 us 18.05% -15.405 us -8.50% PASS
I64 0 100000 1000 198.376 us 46.57% 196.898 us 15.37% -1.478 us -0.74% PASS
I64 0 10000000 1000 4.788 ms 6.16% 5.725 ms 1.75% 937.132 us 19.57% FAIL
I64 0 100000 100000 232.964 us 40.92% 238.416 us 9.20% 5.452 us 2.34% PASS
I64 0 10000000 100000 5.450 ms 3.94% 6.341 ms 1.49% 891.110 us 16.35% FAIL
I64 0 10000000 10000000 15.027 ms 3.46% 15.320 ms 1.07% 293.508 us 1.95% FAIL
I64 1 1000 1000 177.338 us 56.10% 165.684 us 14.84% -11.654 us -6.57% PASS
I64 1 100000 1000 223.218 us 30.43% 226.542 us 15.60% 3.324 us 1.49% PASS
I64 1 10000000 1000 5.887 ms 6.33% 5.830 ms 1.73% -56.611 us -0.96% PASS
I64 1 100000 100000 245.615 us 45.85% 236.938 us 12.15% -8.677 us -3.53% PASS
I64 1 10000000 100000 6.609 ms 3.66% 6.638 ms 1.86% 28.872 us 0.44% PASS
I64 1 10000000 10000000 9.133 ms 6.41% 8.796 ms 1.46% -336.952 us -3.69% FAIL

mixed_left_semi_join

[0] NVIDIA RTX A6000

Key Nullable left_size right_size Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I32 0 1000 1000 98.279 us 34.13% 95.040 us 8.14% -3.239 us -3.30% PASS
I32 0 100000 1000 177.786 us 17.45% 177.586 us 4.03% -0.200 us -0.11% PASS
I32 0 10000000 1000 7.853 ms 2.97% 7.804 ms 1.58% -48.581 us -0.62% PASS
I32 0 100000 100000 187.587 us 42.10% 181.900 us 10.67% -5.687 us -3.03% PASS
I32 0 10000000 100000 8.984 ms 2.94% 8.832 ms 1.40% -152.295 us -1.70% FAIL
I32 0 10000000 10000000 16.492 ms 2.21% 16.348 ms 0.84% -143.934 us -0.87% FAIL
I32 1 1000 1000 107.840 us 54.27% 103.017 us 12.41% -4.823 us -4.47% PASS
I32 1 100000 1000 164.156 us 32.76% 162.940 us 11.85% -1.216 us -0.74% PASS
I32 1 10000000 1000 6.195 ms 3.16% 6.164 ms 1.50% -30.804 us -0.50% PASS
I32 1 100000 100000 171.964 us 26.85% 168.187 us 13.71% -3.777 us -2.20% PASS
I32 1 10000000 100000 6.502 ms 3.40% 6.484 ms 1.59% -18.457 us -0.28% PASS
I32 1 10000000 10000000 8.177 ms 2.60% 8.174 ms 1.21% -3.047 us -0.04% PASS
I64 0 1000 1000 86.600 us 49.69% 82.140 us 9.83% -4.459 us -5.15% PASS
I64 0 100000 1000 168.093 us 17.77% 166.030 us 7.20% -2.063 us -1.23% PASS
I64 0 10000000 1000 8.035 ms 2.71% 8.002 ms 1.31% -32.785 us -0.41% PASS
I64 0 100000 100000 188.939 us 23.40% 185.053 us 8.56% -3.886 us -2.06% PASS
I64 0 10000000 100000 9.583 ms 5.63% 9.161 ms 1.37% -422.491 us -4.41% FAIL
I64 0 10000000 10000000 16.734 ms 2.17% 16.582 ms 0.91% -152.639 us -0.91% FAIL
I64 1 1000 1000 102.164 us 53.77% 98.082 us 18.70% -4.082 us -4.00% PASS
I64 1 100000 1000 163.273 us 29.00% 162.970 us 11.65% -0.303 us -0.19% PASS
I64 1 10000000 1000 6.489 ms 3.47% 6.481 ms 1.61% -8.145 us -0.13% PASS
I64 1 100000 100000 172.470 us 26.75% 168.477 us 8.09% -3.993 us -2.32% PASS
I64 1 10000000 100000 6.562 ms 3.20% 6.532 ms 1.63% -30.486 us -0.46% PASS
I64 1 10000000 10000000 8.000 ms 2.78% 7.976 ms 1.29% -24.018 us -0.30% PASS

mixed_left_anti_join

[0] NVIDIA RTX A6000

Key Nullable left_size right_size Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I32 0 1000 1000 98.898 us 38.28% 95.586 us 23.15% -3.312 us -3.35% PASS
I32 0 100000 1000 178.511 us 17.45% 177.858 us 6.39% -0.653 us -0.37% PASS
I32 0 10000000 1000 7.880 ms 3.08% 7.819 ms 1.58% -61.460 us -0.78% PASS
I32 0 100000 100000 184.580 us 20.89% 182.020 us 6.99% -2.560 us -1.39% PASS
I32 0 10000000 100000 9.010 ms 2.85% 8.850 ms 1.39% -160.067 us -1.78% FAIL
I32 0 10000000 10000000 16.512 ms 2.12% 16.356 ms 0.50% -155.944 us -0.94% FAIL
I32 1 1000 1000 106.826 us 40.95% 103.516 us 19.07% -3.310 us -3.10% PASS
I32 1 100000 1000 164.458 us 28.57% 163.130 us 15.14% -1.328 us -0.81% PASS
I32 1 10000000 1000 6.225 ms 3.13% 6.192 ms 1.60% -33.197 us -0.53% PASS
I32 1 100000 100000 172.748 us 30.53% 168.596 us 11.88% -4.152 us -2.40% PASS
I32 1 10000000 100000 6.531 ms 3.23% 6.512 ms 1.69% -18.180 us -0.28% PASS
I32 1 10000000 10000000 8.215 ms 2.65% 8.212 ms 1.65% -2.930 us -0.04% PASS
I64 0 1000 1000 86.315 us 53.71% 82.487 us 20.29% -3.828 us -4.44% PASS
I64 0 100000 1000 169.257 us 20.24% 166.349 us 13.10% -2.909 us -1.72% PASS
I64 0 10000000 1000 8.062 ms 2.85% 8.020 ms 1.46% -41.852 us -0.52% PASS
I64 0 100000 100000 190.510 us 28.57% 185.310 us 6.23% -5.200 us -2.73% PASS
I64 0 10000000 100000 9.290 ms 2.70% 9.181 ms 1.69% -108.463 us -1.17% PASS
I64 0 10000000 10000000 16.754 ms 2.06% 16.599 ms 0.97% -154.941 us -0.92% PASS
I64 1 1000 1000 102.948 us 56.62% 98.676 us 19.82% -4.272 us -4.15% PASS
I64 1 100000 1000 166.421 us 41.01% 164.187 us 14.01% -2.234 us -1.34% PASS
I64 1 10000000 1000 6.517 ms 3.14% 6.514 ms 1.63% -2.796 us -0.04% PASS
I64 1 100000 100000 175.468 us 38.06% 169.063 us 7.29% -6.405 us -3.65% PASS
I64 1 10000000 100000 6.595 ms 3.12% 6.558 ms 1.54% -36.826 us -0.56% PASS
I64 1 10000000 10000000 8.038 ms 2.78% 8.006 ms 1.41% -31.744 us -0.39% PASS

@lamarrr
Copy link
Contributor Author

lamarrr commented Jan 17, 2025

There's also 90-second PTX-JIT time for the mixed-join kernels, and it is forcefully run for all the modules at startup. I'll investigate on how to make it only compile the PTX when it is used.

@lamarrr
Copy link
Contributor Author

lamarrr commented Jan 17, 2025

SASS-compiled (CUDF's all-arch) vs PTX (compute_60) Binary Size and Build Time

SASS

parallel build time: 2922 seconds
libcudf.so size: 625M

PTX

parallel build time: 2806 seconds
libcudf.so size: 616M

Object File SASS Build Time SASS Binary Size PTX Build Time PTX Binary Size
CMakeFiles/cudf.dir/src/join/mixed_join_size_kernel_nulls.cu.o 19:06 min 3.630 MB 2:22 min 1.620 MB
CMakeFiles/cudf.dir/src/join/mixed_join_kernel_nulls.cu.o 18:40 min 3.614 MB 2:14 min 1.591 MB
CMakeFiles/cudf.dir/src/join/mixed_join_size_kernel.cu.o 10:20 min 3.176 MB 77.814 s 1.367 MB
CMakeFiles/cudf.dir/src/join/mixed_join_kernel.cu.o 9:02 min 3.183 MB 76.023 s 1.350 MB
CMakeFiles/cudf.dir/src/join/mixed_join_kernels_semi.cu.o 5:08 min 2.065 MB 50.980 s 880.776 KB
CMakeFiles/cudf.dir/src/join/mixed_join_semi.cu.o 89.156 s 2.266 MB 20.180 s 1.760 MB
CMakeFiles/cudf.dir/src/join/mixed_join.cu.o 49.300 s 1.759 MB 17.363 s 1.593 MB

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue DO NOT MERGE Hold off on merging; see PR for details libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant