-
Notifications
You must be signed in to change notification settings - Fork 926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor numpy array input in as_column #14651
Refactor numpy array input in as_column #14651
Conversation
is_nat = np.isnat(arbitrary) | ||
if (nan_as_null is None or nan_as_null) and np.isnat( | ||
arbitrary | ||
).any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_nat = np.isnat(arbitrary) | |
if (nan_as_null is None or nan_as_null) and np.isnat( | |
arbitrary | |
).any(): | |
is_nat = np.isnat(arbitrary) | |
if (nan_as_null is None or nan_as_null) and is_nat.any(): |
Let's not compute isnat
more often than necessary.
return as_column( | ||
pa.array(arbitrary), | ||
dtype=dtype, | ||
nan_as_null=nan_as_null, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the cases we're handling where, and why does this one go via pyarrow whereas the nan_as_null == False
or not any(is_nat)
goes via as_buffer
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the pyarrow route, we want to convert NaT to NA which pyarrow does nicely be default (nan_as_null=True
)
In the buffer route, we want to consider NaT as NA in the mask but maintain NaT as a value (nan_as_null=False
)
I'll leave a comment about this
elif arbitrary.dtype.kind in "mM": | ||
time_unit = get_time_unit(arbitrary) | ||
if time_unit in ("D", "W", "M", "Y"): | ||
# TODO: Raise in these cases instead of downcasting to s? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arguably yes because not all valid datetimes with a coarser resolution can be represented with s
resolution.
Is this intended to still be in draft? |
Thanks for checking in on this. Yes, I think there are some failing tests I am still working through |
if pd.isna(arbitrary).any(): | ||
arbitrary = pa.array(arbitrary) | ||
else: | ||
arbitrary = pd.Series(arbitrary) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if we just always use pyarrow here? Do we lose something from non-nullable data that pandas captures better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are cases where pandas will infer dtype=object
into a non-object type that would help match pandas behavior e.g.
In [1]: import cudf, pandas as pd, numpy as np
# e.g a `to_numpy()` round-trip
In [2]: pd.Series(np.array([pd.Timestamp(2020, 1, 1)], dtype=object))
Out[2]:
0 2020-01-01
dtype: datetime64[ns]
I'll add a comment about that here
dtype=dtype, | ||
nan_as_null=nan_as_null, | ||
) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Superfluous else since we're returning above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this one looks good too, thanks for all the ongoing cleanup work.
# Handle case that `arbitrary` elements are cupy arrays | ||
if len(arbitrary) > 0 and hasattr( | ||
arbitrary[0], "__cuda_array_interface__" | ||
): | ||
return as_column( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems "oddly specific". This is something like we have a list of cupy arrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. I had added this since I thought there was a unit test that would hit this branch. I am no longer encountering that after running the test suite again so I'll actually remove it for now
/merge |
Description
Simplifies the numpy array input logic to
as_column
to beChecklist