Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent PyDataFrame serialization #17364

Draft
wants to merge 51 commits into
base: branch-25.02
Choose a base branch
from

Conversation

pentschev
Copy link
Member

Description

Prevent PyDataFrame serialization and enable Distributed scheduler if available.

This includes changes from #17262 and thus requires it to be merged first, also depends on changes from #17352 .

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

rjzamora and others added 30 commits November 6, 2024 14:58
Co-authored-by: Lawrence Mitchell <wence@gmx.li>
pentschev and others added 14 commits November 14, 2024 11:46
This is a prototype implementation of rapidsai/build-infra#139

The work that this builds on:
* rapidsai/gha-tools#118, which adds a shell wrapper that automatically creates spans for the commands that it wraps. It also uses the `opentelemetry-instrument` command to set up monkeypatching for supported Python libraries, if the command is python-based
* https://github.com/rapidsai/shared-workflows/tree/add-telemetry, which installs the gha-tools work from above and sets necessary environment variables. This is only done for the conda-cpp-build.yaml shared workflow at the time of submitting this PR.

The goal of this PR is to observe telemetry data sent from a GitHub Actions build triggered by this PR as a proof of concept. Once it all works, the remaining work is:

* merge rapidsai/gha-tools#118
* Move the opentelemetry-related install stuff in https://github.com/rapidsai/shared-workflows/compare/add-telemetry?expand=1#diff-ca6188672785b5d214aaac2bf77ce0528a48481b2a16b35aeb78ea877b2567bcR118-R125 into https://github.com/rapidsai/ci-imgs, and rebuild ci-imgs
* expand coverage to other shared workflows
* Incorporate the changes from this PR to other jobs and to other repos

Authors:
  - Mike Sarahan (https://github.com/msarahan)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: rapidsai#16924
Updates cmake to 3.28.6 in the JNI Dockerfile used to build the cudf jar.  This helps avoid a bug in older cmake where FindCUDAToolkit can fail to find cufile libraries.

Authors:
  - Jason Lowe (https://github.com/jlowe)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Gera Shegalov (https://github.com/gerashegalov)

URL: rapidsai#17342
@pentschev pentschev added feature request New feature or request 2 - In Progress Currently a work in progress non-breaking Non-breaking change cudf.polars Issues specific to cudf.polars labels Nov 19, 2024
@pentschev pentschev requested review from a team as code owners November 19, 2024 18:29
@github-actions github-actions bot added the Python Affects Python cuDF API. label Nov 19, 2024
@github-actions github-actions bot added Java Affects Java cuDF API. pylibcudf Issues specific to the pylibcudf package labels Nov 19, 2024
@pentschev pentschev marked this pull request as draft November 19, 2024 18:30
Comment on lines -732 to +739
pdf = pl.DataFrame._from_pydf(df)
df = DataFrame.from_polars(df)
if projection is not None:
pdf = pdf.select(projection)
df = DataFrame.from_polars(pdf)
df = df.select(projection)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were previously applying the column projection before converting from CPU to GPU. Now we are immediately moving everything to GPU, and then applying the column projection. I think we want this to be something like:

        if projection is not None:
            df = df.select(projection)
        df = DataFrame.from_polars(df)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress cudf.polars Issues specific to cudf.polars feature request New feature or request Java Affects Java cuDF API. non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.
Projects
Status: In Progress
Status: In Progress
Development

Successfully merging this pull request may close these issues.

5 participants