Skip to content

[FEA] Expand JIT functionality in libcudf #18023

@GregoryKimball

Description

@GregoryKimball

Introduction

There are a some areas where JIT-compiled kernels can provide performance improvements over existing libcudf functions.

Please note that this issue is focused on CUDA C++ features in libcudf that use JITIFY and nvrtc, rather than cuDF-python features using Numba to generate PTX from user-defined Python functions.

JIT transforms, JIT projection expressions

JIT transforms, or UDF (user defined function) transforms, can be used to fuse together multiple binary ops or function calls within a single kernel. This eliminates the materialization of intermediates and for complex expressions can lead to significant speedup. We've written a custom "polynomials" benchmark in #17695 that shows >10x speedup for JIT-compiled kernels versus binary ops and AST (abstract syntax tree) implementations.

JIT Filters

JIT aggregation

JIT aggregations, or UDAFs (user defined aggregation functions), can be used to complete complex transformations on the groups of a groupby aggregation. libcudf supports both CUDA and PTX aggregation kinds.

Some examples of UDAFs could include "compute score" with additional flexibility for feature engineering. Here are some "compute score" examples from the archived TorchArrow project.

To support some of these functions, the user might create a struct column that contains a list of id's, a list of targets, and a score per target. Ref: https://pytorch.org/torcharrow/beta/functional.html

get_score_sum | Return the sum of all the scores in matching_id_scores that has a corresponding id in matching_ids that is also in input_ids.
get_score_min | Return the min among of all the scores in matching_id_scores that has a corresponding id in matching_ids that is also in input_ids.
get_score_max | Return the min among of all the scores in matching_id_scores that has a corresponding id in matching_ids that is also in input_ids.

JIT join

Currently libcudf uses mixed_join to fuse together hash join with post-filter. Mixed joins accept an AST predicate that is applied as thread-per-row when the probe table equality keys are found in the build table. Mixed joins have poor warp occupancy due to heavy register pressure, as a result of combined hash join and AST expression functionality into a single kernel.

  • implement conditional joins
  • implement mixed joins
  • implement custom expressions as keys

One alternative would be to use code gen to check the post-equality predicate and JIT-compile the resulting kernel. Please see #15366 for some additional context.

Improving JIT infrastructure

As part of expanding JIT functionality in libcudf, we will need better tools for tracking JIT-compilation time (NVIDIA/jitify#137). We will also need better tools for JIT cache management such as clearing and pre-populating. Collaboration with Spark-RAPIDS and other partners will be critical for success.

JIT benchmarking

We could write libcudf UDF's for some of the operations in UDFBench, also see the 2025 paper here. Some example UDFs can be found at https://github.com/athenarc/UDFBench/tree/main/engines/duckdb/udfs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    PythonAffects Python cuDF API.feature requestNew feature or requestlibcudfAffects libcudf (C++/CUDA) code.

    Type

    No type

    Projects

    Status

    Story Issue

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions