Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Support decimal dtypes #17113

Draft
wants to merge 1 commit into
base: branch-24.12
Choose a base branch
from

Conversation

wence-
Copy link
Contributor

@wence- wence- commented Oct 17, 2024

Description

Polars only has 128 bit decimals. This currently loses the precision for round trips.

I can run a groupby and it seems to work.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Polars only has 128 bit decimals. This currently loses the precision
for round trips.
@github-actions github-actions bot added Python Affects Python cuDF API. cudf.polars Issues specific to cudf.polars labels Oct 17, 2024
@wence-
Copy link
Contributor Author

wence- commented Oct 17, 2024

Conversion back to arrow doesn't work because we lost the precision argument

@vyasr
Copy link
Contributor

vyasr commented Oct 30, 2024

Conversion back to arrow doesn't work because we lost the precision argument

Would resolving #17080 (which should be possible with arrow 18) resolve this? Could we carry around extra metadata in the interim to manage this (necessary until we drop support for older versions of arrow anyway)?

@wence-
Copy link
Contributor Author

wence- commented Oct 30, 2024

Conversion back to arrow doesn't work because we lost the precision argument

Would resolving #17080 (which should be possible with arrow 18) resolve this? Could we carry around extra metadata in the interim to manage this (necessary until we drop support for older versions of arrow anyway)?

So I don't really understand decimals, but AFAICT, polars only has an underlying Decimal128 dtype (rather than separate 32, 64, and 128 types like libcudf does). When I therefore go from arrow column to libcudf column, because libcudf's decimal type only cares about the scale and not the precision, we upgrade all of the precisions to "full" (whatever that means). Now when I go back again, the precision argument is different from when we started.

We haven't really looked at dtypes that need extra metadata in cudf-polars yet, but both decimals and struct dtypes do need that. And then, transitively, list dtypes also need it. So I think we're going to have to do something. One thing might just be to keep the polars dtype around and use that in various places.

@wence-
Copy link
Contributor Author

wence- commented Oct 30, 2024

As an example:

import polars as pl
import pylibcudf as plc

df = pl.DataFrame({"a": pl.Series([1, 2, 3], dtype=pl.Decimal(precision=5, scale=3))})

print(df.to_arrow())

print(plc.interop.to_arrow(plc.interop.from_arrow(df.to_arrow())))

pyarrow.Table
a: decimal128(5, 3)
----
a: [[1.000,2.000,3.000]]
pyarrow.Table
: decimal128(38, 3) not null
----
: [[1.000,2.000,3.000]]

So the data are all there, but the precision is missing

@vyasr
Copy link
Contributor

vyasr commented Oct 31, 2024

Ah OK so this is about the values stored in the dtype, not the dtype itself being lost (which is what my question was about, i.e. whether getting 32 and 64 bit decimals in arrow would help; although even then I guess you'd need those explicitly added in polars too).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf.polars Issues specific to cudf.polars Python Affects Python cuDF API.
Projects
Status: In Progress
Status: In Progress
Development

Successfully merging this pull request may close these issues.

2 participants