-
Notifications
You must be signed in to change notification settings - Fork 974
Description
Introduction
There are a some areas where JIT-compiled kernels can provide performance improvements over existing libcudf functions.
Please note that this issue is focused on CUDA C++ features in libcudf that use JITIFY and nvrtc, rather than cuDF-python features using Numba to generate PTX from user-defined Python functions.
JIT transforms, JIT projection expressions
JIT transforms, or UDF (user defined function) transforms, can be used to fuse together multiple binary ops or function calls within a single kernel. This eliminates the materialization of intermediates and for complex expressions can lead to significant speedup. We've written a custom "polynomials" benchmark in #17695 that shows >10x speedup for JIT-compiled kernels versus binary ops and AST (abstract syntax tree) implementations.
- support decimal types Implemented Decimal Transforms #17968
- support multiple column inputs (single column output) Added Multi-input & Scalar Support for Transform UDFs #17881
- compare
imbalanced_tree
benchmarks for JIT vs binary ops vs AST Added Imbalanced Tree Benchmarks for Transforms #18032 - collect data on NDS and NDS-H runtime impact of JIT compiled expressions Added NDSH Q09 Benchmark for Transforms #18127
- support operators with string input and fixed-width output Implemented String Input support for Transforms and Removed
jit::column_device_view
#18378 - support operators with string input and string output Implemented String Output & User-data Support for Transforms #18490
- implement null-aware transforms
JIT Filters
- support UDF filters Implement UDF Filters #19070
- implement null-aware filters
JIT aggregation
JIT aggregations, or UDAFs (user defined aggregation functions), can be used to complete complex transformations on the groups of a groupby aggregation. libcudf supports both CUDA
and PTX
aggregation kinds.
Some examples of UDAFs could include "compute score" with additional flexibility for feature engineering. Here are some "compute score" examples from the archived TorchArrow project.
To support some of these functions, the user might create a struct column that contains a list of id's, a list of targets, and a score per target. Ref: https://pytorch.org/torcharrow/beta/functional.html
get_score_sum | Return the sum of all the scores in matching_id_scores that has a corresponding id in matching_ids that is also in input_ids.
get_score_min | Return the min among of all the scores in matching_id_scores that has a corresponding id in matching_ids that is also in input_ids.
get_score_max | Return the min among of all the scores in matching_id_scores that has a corresponding id in matching_ids that is also in input_ids.
JIT join
Currently libcudf uses mixed_join
to fuse together hash join with post-filter. Mixed joins accept an AST predicate that is applied as thread-per-row when the probe table equality keys are found in the build table. Mixed joins have poor warp occupancy due to heavy register pressure, as a result of combined hash join and AST expression functionality into a single kernel.
- implement conditional joins
- implement mixed joins
- implement custom expressions as keys
One alternative would be to use code gen to check the post-equality predicate and JIT-compile the resulting kernel. Please see #15366 for some additional context.
Improving JIT infrastructure
As part of expanding JIT functionality in libcudf, we will need better tools for tracking JIT-compilation time (NVIDIA/jitify#137). We will also need better tools for JIT cache management such as clearing and pre-populating. Collaboration with Spark-RAPIDS and other partners will be critical for success.
JIT benchmarking
We could write libcudf UDF's for some of the operations in UDFBench, also see the 2025 paper here. Some example UDFs can be found at https://github.com/athenarc/UDFBench/tree/main/engines/duckdb/udfs.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status