Skip to content

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Sep 19, 2025

Description

This adds a new configuration option to cudf-polars' config, allowing users to specify the memory resource to create by default.

Currently, users either get the default behavior (typically a managed memory resource) or can pass in a concrete memory resource (see the new docs added in this PR).

In some cases, you might want to pass in a description of the memory resource to use:

  1. In our unit tests, we might want to specify a CudaAsyncMemoryResource with a relatively small initial pool size
  2. In a distributed environment, you can't pass around concrete memory resource objects (which can't be serialized)

This is strictly more flexible than the current option of setting POLARS_GPU_ENABLE_CUDA_MANAGED_MEMORY. By setting POLARS_GPU_ENABLE_CUDA_MANAGED_MEMORY=0 you'll get a CudaAsyncMemoryResource with its default initial_pool_size and release_threshold. With this system, you can set

CUDF_POLARS__MEMORY_RESOURCE_CONFIG__QUALNAME="rmm.mr.CudaAsyncMemoryResource"

to get the same thing, or

CUDF_POLARS__MEMORY_RESOURCE_CONFIG__QUALNAME="rmm.mr.CudaAsyncMemoryResource"
CUDF_POLARS__MEMORY_RESOURCE_CONFIG__OPTIONS='{"initial_pool_size": 256, "release_threshold": 256}"'

to configure the pool.

I'd recommend deprecating POLARS_GPU_ENABLE_CUDA_MANAGED_MEMORY to reduce the number of ways this can be configured, though we should take our time with that.

This adds a new configuration option to cudf-polars' config,
allowing users to specify the memory resource to create by default.

Currently, users either get the default behavior (typically a managed
memory resource) or can pass in a concrete memory resource.

In some cases, you might want to pass in a description of the memory
resource to use:

1. In our unit tests, we might want to specify a CudaAsyncMemoryResource
   with a relatively small initial pool size
2. In a distributed environment, you can't pass around concrete
   memory resource objects (which can't be serialized)
Copy link

copy-pr-bot bot commented Sep 19, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added Python Affects Python cuDF API. cudf-polars Issues specific to cudf-polars labels Sep 19, 2025
@GPUtester GPUtester moved this to In Progress in cuDF Python Sep 19, 2025
@TomAugspurger TomAugspurger added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Sep 22, 2025
Examples
--------
>>> MemoryResourceConfig(
... qualname="rmm.mr.CudaAsyncMemoryResource",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are people OK with "qualname" here? I want to avoid locking us to MRs that happen to be defined in rmm.mr.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Sep 24, 2025

The only other feature I might want to add here is expanding the MemoryResourceConfig to handle "nested" memory resource configurations. Right now, you can't create an RMM memory resource that needs to configure the inner memory resource. I'm not sure whether that's worth doing here or not.

For example, to express our default memory resource:

            mr = rmm.mr.PrefetchResourceAdaptor(
                rmm.mr.PoolMemoryResource(
                    rmm.mr.ManagedMemoryResource(),
                    initial_pool_size=free_memory,
                )
            )

Aside from free_memory itself being dynamic, this could be expressed as something like:

{
    "qualname": "rmm.mr.PrefetchResourceAdaptor",
    "options": {
        "upstream_mr": {
            "qualname": "rmm.mr.PoolMemoryResource",
            "options": {
                "upstream_mr": {
                    "qualname": "rmm.mr.ManagedMemoryResource",
                    "options": {
                        "initial_pool_size": 256
                    }
                }
            }
        }
    }
}

That relies on pattern matching a dict with {"qualname": ..., "options": ...} to mean "this is a memory resource config", which is probably sufficient.

@TomAugspurger TomAugspurger marked this pull request as ready for review September 24, 2025 13:23
@TomAugspurger TomAugspurger requested a review from a team as a code owner September 24, 2025 13:23
@TomAugspurger TomAugspurger changed the base branch from branch-25.10 to branch-25.12 September 24, 2025 16:14
rapids-bot bot pushed a commit that referenced this pull request Sep 26, 2025
Now that cudf-polars uses managed memory by default, the prior comment here should no longer be applicable and we should be able to run these tests with more than 1 process for a hopeful improvement in runtime.

Probably depends on #20042 so each xdist process doesn't set the `initial_pool_size` of the memory resource to 80% of the available device memory.

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)
  - Tom Augspurger (https://github.com/TomAugspurger)

Approvers:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)
  - Bradley Dice (https://github.com/bdice)
  - Tom Augspurger (https://github.com/TomAugspurger)

URL: #19980
vyasr and others added 13 commits September 26, 2025 12:23
We really only need to see why tests fail when they fail.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Matthew Murray (https://github.com/Matt711)
  - Bradley Dice (https://github.com/bdice)

URL: rapidsai#20107
This pull-request adds a JIT filter for the `read_parquet` filtering. 
JIT Filter for `read_parquet` can be turned on using the `use_jit_filter` reader option or the `LIBCUDF_USE_JIT` environment variable.
It also adds a benchmark for `read_parquet` to compare the AST and JIT filters.
Benchmark results will be posted below.

Follows-up rapidsai#18023

Authors:
  - Basit Ayantunde (https://github.com/lamarrr)
  - Muhammad Haseeb (https://github.com/mhaseeb123)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Muhammad Haseeb (https://github.com/mhaseeb123)

URL: rapidsai#19831
This PR cleans up the custom device atomic logic by using `atomic_ref`.

Authors:
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - David Wendt (https://github.com/davidwendt)

URL: rapidsai#19924
)

This PR uses 8 processes instead of 6, hoping to cut the runtime of the pandas test suite. I have data below, this speeds things up by about 21.6%.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: rapidsai#20109
This PR is needed to support the nvbench upgrade in rapidsai/rapids-cmake#895. This should be merged immediately after.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: rapidsai#19619
…i#20038)

Fixes several groupby benchmarks that only measured the aggregate/scan/shift call where a sort-based groupby is used. This means only the 1st call performed a keys sort while the remaining invocations reused the already sorted keys.
The change involves simply instantiating the groupby object within the `state.exec{}` functor along with the `aggregate` call.

This change also adds decimal64 to the sum benchmark to show improvement on a follow on PR.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Shruti Shivakumar (https://github.com/shrshi)
  - Nghia Truong (https://github.com/ttnghia)

URL: rapidsai#20038
Minor cleanup of aggregation code including fixing a misspelling.
No functional changes.
Found while working on rapidsai#20040

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Yunsong Wang (https://github.com/PointKernel)

URL: rapidsai#20053
…9980)

Now that cudf-polars uses managed memory by default, the prior comment here should no longer be applicable and we should be able to run these tests with more than 1 process for a hopeful improvement in runtime.

Probably depends on rapidsai#20042 so each xdist process doesn't set the `initial_pool_size` of the memory resource to 80% of the available device memory.

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)
  - Tom Augspurger (https://github.com/TomAugspurger)

Approvers:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)
  - Bradley Dice (https://github.com/bdice)
  - Tom Augspurger (https://github.com/TomAugspurger)

URL: rapidsai#19980
@TomAugspurger TomAugspurger requested review from a team as code owners September 26, 2025 19:24
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue pylibcudf Issues specific to the pylibcudf package labels Sep 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue cudf-polars Issues specific to cudf-polars improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

7 participants