Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor numpy array input in as_column #14651

Merged
merged 35 commits into from
Apr 5, 2024

Conversation

mroeschke
Copy link
Contributor

@mroeschke mroeschke commented Dec 19, 2023

Description

Simplifies the numpy array input logic to as_column to be

if object/string dtype like:
    # parse with pandas with inference
elif numeric-like dtype or datelike with nat:
    # parse with pyarrow (due to np.nan/np.nat/nan_is_null handling)
else:
    # create column from buffer

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mroeschke mroeschke added Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Dec 19, 2023
@mroeschke mroeschke requested a review from a team as a code owner December 19, 2023 00:17
@mroeschke mroeschke requested review from wence- and bdice December 19, 2023 00:17
python/cudf/cudf/core/column/column.py Show resolved Hide resolved
Comment on lines 2232 to 2235
is_nat = np.isnat(arbitrary)
if (nan_as_null is None or nan_as_null) and np.isnat(
arbitrary
).any():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
is_nat = np.isnat(arbitrary)
if (nan_as_null is None or nan_as_null) and np.isnat(
arbitrary
).any():
is_nat = np.isnat(arbitrary)
if (nan_as_null is None or nan_as_null) and is_nat.any():

Let's not compute isnat more often than necessary.

return as_column(
pa.array(arbitrary),
dtype=dtype,
nan_as_null=nan_as_null,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the cases we're handling where, and why does this one go via pyarrow whereas the nan_as_null == False or not any(is_nat) goes via as_buffer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the pyarrow route, we want to convert NaT to NA which pyarrow does nicely be default (nan_as_null=True)

In the buffer route, we want to consider NaT as NA in the mask but maintain NaT as a value (nan_as_null=False)

I'll leave a comment about this

elif arbitrary.dtype.kind in "mM":
time_unit = get_time_unit(arbitrary)
if time_unit in ("D", "W", "M", "Y"):
# TODO: Raise in these cases instead of downcasting to s?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably yes because not all valid datetimes with a coarser resolution can be represented with s resolution.

@mroeschke mroeschke marked this pull request as draft January 11, 2024 19:55
@mroeschke mroeschke changed the base branch from branch-24.02 to branch-24.04 January 31, 2024 22:04
@vyasr
Copy link
Contributor

vyasr commented Feb 27, 2024

Is this intended to still be in draft?

@mroeschke
Copy link
Contributor Author

Is this intended to still be in draft?

Thanks for checking in on this. Yes, I think there are some failing tests I am still working through

@mroeschke mroeschke marked this pull request as ready for review March 15, 2024 23:15
@mroeschke mroeschke changed the base branch from branch-24.04 to branch-24.06 March 18, 2024 22:04
Comment on lines 2061 to 2064
if pd.isna(arbitrary).any():
arbitrary = pa.array(arbitrary)
else:
arbitrary = pd.Series(arbitrary)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we just always use pyarrow here? Do we lose something from non-nullable data that pandas captures better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are cases where pandas will infer dtype=object into a non-object type that would help match pandas behavior e.g.

In [1]: import cudf, pandas as pd, numpy as np

# e.g a `to_numpy()` round-trip
In [2]: pd.Series(np.array([pd.Timestamp(2020, 1, 1)], dtype=object))
Out[2]: 
0   2020-01-01
dtype: datetime64[ns]

I'll add a comment about that here

dtype=dtype,
nan_as_null=nan_as_null,
)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Superfluous else since we're returning above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Removed

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one looks good too, thanks for all the ongoing cleanup work.

Comment on lines 2051 to 2055
# Handle case that `arbitrary` elements are cupy arrays
if len(arbitrary) > 0 and hasattr(
arbitrary[0], "__cuda_array_interface__"
):
return as_column(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems "oddly specific". This is something like we have a list of cupy arrays?

Copy link
Contributor Author

@mroeschke mroeschke Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. I had added this since I thought there was a unit test that would hit this branch. I am no longer encountering that after running the test suite again so I'll actually remove it for now

@mroeschke
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit c5eb324 into rapidsai:branch-24.06 Apr 5, 2024
69 checks passed
@mroeschke mroeschke deleted the ref/np_array_handling branch April 5, 2024 23:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants