Refactor numpy array input in as_column #14651

mroeschke · 2023-12-19T00:17:32Z

Description

Simplifies the numpy array input logic to as_column to be

if object/string dtype like:
    # parse with pandas with inference
elif numeric-like dtype or datelike with nat:
    # parse with pyarrow (due to np.nan/np.nat/nan_is_null handling)
else:
    # create column from buffer

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…y_handling

python/cudf/cudf/core/column/column.py

wence- · 2024-01-09T15:52:05Z

python/cudf/cudf/core/column/column.py

+            is_nat = np.isnat(arbitrary)
+            if (nan_as_null is None or nan_as_null) and np.isnat(
+                arbitrary
+            ).any():


Suggested change

is_nat = np.isnat(arbitrary)

if (nan_as_null is None or nan_as_null) and np.isnat(

arbitrary

).any():

is_nat = np.isnat(arbitrary)

if (nan_as_null is None or nan_as_null) and is_nat.any():

Let's not compute isnat more often than necessary.

wence- · 2024-01-09T15:53:57Z

python/cudf/cudf/core/column/column.py

+                return as_column(
+                    pa.array(arbitrary),
+                    dtype=dtype,
+                    nan_as_null=nan_as_null,


What are the cases we're handling where, and why does this one go via pyarrow whereas the nan_as_null == False or not any(is_nat) goes via as_buffer?

In the pyarrow route, we want to convert NaT to NA which pyarrow does nicely be default (nan_as_null=True)

In the buffer route, we want to consider NaT as NA in the mask but maintain NaT as a value (nan_as_null=False)

I'll leave a comment about this

wence- · 2024-01-09T16:03:46Z

python/cudf/cudf/core/column/column.py

+        elif arbitrary.dtype.kind in "mM":
+            time_unit = get_time_unit(arbitrary)
+            if time_unit in ("D", "W", "M", "Y"):
+                # TODO: Raise in these cases instead of downcasting to s?


Arguably yes because not all valid datetimes with a coarser resolution can be represented with s resolution.

…y_handling

vyasr · 2024-02-27T19:19:12Z

Is this intended to still be in draft?

mroeschke · 2024-02-27T19:22:42Z

Is this intended to still be in draft?

Thanks for checking in on this. Yes, I think there are some failing tests I am still working through

…y_handling

vyasr · 2024-04-04T19:12:03Z

python/cudf/cudf/core/column/column.py

+            if pd.isna(arbitrary).any():
+                arbitrary = pa.array(arbitrary)
+            else:
+                arbitrary = pd.Series(arbitrary)


What happens if we just always use pyarrow here? Do we lose something from non-nullable data that pandas captures better?

There are cases where pandas will infer dtype=object into a non-object type that would help match pandas behavior e.g.

In [1]: import cudf, pandas as pd, numpy as np # e.g a `to_numpy()` round-trip In [2]: pd.Series(np.array([pd.Timestamp(2020, 1, 1)], dtype=object)) Out[2]: 0 2020-01-01 dtype: datetime64[ns]

I'll add a comment about that here

vyasr · 2024-04-04T19:12:55Z

python/cudf/cudf/core/column/column.py

+                        dtype=dtype,
+                        nan_as_null=nan_as_null,
+                    )
+                else:


Superfluous else since we're returning above.

Good point. Removed

…y_handling

wence-

I think this one looks good too, thanks for all the ongoing cleanup work.

wence- · 2024-04-05T08:46:45Z

python/cudf/cudf/core/column/column.py

+            # Handle case that `arbitrary` elements are cupy arrays
+            if len(arbitrary) > 0 and hasattr(
+                arbitrary[0], "__cuda_array_interface__"
+            ):
+                return as_column(


This seems "oddly specific". This is something like we have a list of cupy arrays?

Correct. I had added this since I thought there was a unit test that would hit this branch. I am no longer encountering that after running the test suite again so I'll actually remove it for now

…y_handling

mroeschke · 2024-04-05T23:01:05Z

/merge

mroeschke added 2 commits December 18, 2023 16:13

Refactor numpy array input in as_column

aac2c1e

Merge remote-tracking branch 'upstream/branch-24.02' into ref/np_arra…

5afd82c

…y_handling

mroeschke added Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Dec 19, 2023

mroeschke requested a review from a team as a code owner December 19, 2023 00:17

mroeschke requested review from wence- and bdice December 19, 2023 00:17

mroeschke added 7 commits December 18, 2023 16:55

make bool and int data go through arrow

18be5c0

Merge remote-tracking branch 'upstream/branch-24.02' into ref/np_arra…

1eb5cf0

…y_handling

Fix typo, just build buffer again for datelike types

ac403ca

Merge remote-tracking branch 'upstream/branch-24.02' into ref/np_arra…

2655e26

…y_handling

Fix ==

e192238

Treat NAs in np array as NULL with arrow

8778e29

Merge remote-tracking branch 'upstream/branch-24.02' into ref/np_arra…

dbca8d4

…y_handling

wence- reviewed Jan 9, 2024

View reviewed changes

mroeschke added 3 commits January 11, 2024 11:42

Merge remote-tracking branch 'upstream/branch-24.02' into ref/np_arra…

2013017

…y_handling

Reuse is_nat

e75893a

Add comments about NaT behavior

bdb8da0

mroeschke marked this pull request as draft January 11, 2024 19:55

mroeschke changed the base branch from branch-24.02 to branch-24.04 January 31, 2024 22:04

mroeschke added 6 commits January 31, 2024 14:04

Merge remote-tracking branch 'upstream/branch-24.04' into ref/np_arra…

b39a75e

…y_handling

Merge remote-tracking branch 'upstream/branch-24.04' into ref/np_arra…

94baa34

…y_handling

Merge remote-tracking branch 'upstream/branch-24.04' into ref/np_arra…

edd2080

…y_handling

Merge remote-tracking branch 'upstream/branch-24.04' into ref/np_arra…

66d4403

…y_handling

Merge remote-tracking branch 'upstream/branch-24.04' into ref/np_arra…

ba963ad

…y_handling

Trigger CI

192f376

Merge remote-tracking branch 'upstream/branch-24.04' into ref/np_arra…

1c923b4

…y_handling

mroeschke added 10 commits February 27, 2024 17:54

Fix some tests

1917103

Merge remote-tracking branch 'upstream/branch-24.04' into ref/np_arra…

99c1419

…y_handling

Merge remote-tracking branch 'upstream/branch-24.04' into ref/np_arra…

a969b84

…y_handling

Merge remote-tracking branch 'upstream/branch-24.04' into ref/np_arra…

92856de

…y_handling

Merge remote-tracking branch 'upstream/branch-24.04' into ref/np_arra…

147ba2b

…y_handling

fix some concat tests

ef56817

Dont create mask if no NAs

ab5d10f

Merge remote-tracking branch 'upstream/branch-24.04' into ref/np_arra…

e2fb5b4

…y_handling

Fix test, add error for unitless datetlike

96f1491

Merge remote-tracking branch 'upstream/branch-24.04' into ref/np_arra…

912e5b9

…y_handling

mroeschke marked this pull request as ready for review March 15, 2024 23:15

Merge remote-tracking branch 'upstream/branch-24.06' into ref/np_arra…

95ae6b2

…y_handling

mroeschke changed the base branch from branch-24.04 to branch-24.06 March 18, 2024 22:04

vyasr reviewed Apr 4, 2024

View reviewed changes

mroeschke added 2 commits April 4, 2024 16:31

Merge remote-tracking branch 'upstream/branch-24.06' into ref/np_arra…

bee6c91

…y_handling

Address review

2c64990

wence- approved these changes Apr 5, 2024

View reviewed changes

mroeschke added 3 commits April 5, 2024 11:59

Merge remote-tracking branch 'upstream/branch-24.06' into ref/np_arra…

4d57148

…y_handling

Remove carveout for np array of cupy objects

3ffd9e6

Merge remote-tracking branch 'upstream/branch-24.06' into ref/np_arra…

15fff60

…y_handling

rapids-bot bot merged commit c5eb324 into rapidsai:branch-24.06 Apr 5, 2024
69 checks passed

mroeschke deleted the ref/np_array_handling branch April 5, 2024 23:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor numpy array input in as_column #14651

Refactor numpy array input in as_column #14651

mroeschke commented Dec 19, 2023 •

edited

Loading

wence- Jan 9, 2024

wence- Jan 9, 2024

mroeschke Jan 11, 2024

wence- Jan 9, 2024

vyasr commented Feb 27, 2024

mroeschke commented Feb 27, 2024

vyasr Apr 4, 2024

mroeschke Apr 5, 2024

vyasr Apr 4, 2024

mroeschke Apr 5, 2024

wence- left a comment

wence- Apr 5, 2024

mroeschke Apr 5, 2024 •

edited

Loading

mroeschke commented Apr 5, 2024

Refactor numpy array input in as_column #14651

Refactor numpy array input in as_column #14651

Conversation

mroeschke commented Dec 19, 2023 • edited Loading

Description

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vyasr commented Feb 27, 2024

mroeschke commented Feb 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wence- left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke Apr 5, 2024 • edited Loading

Choose a reason for hiding this comment

mroeschke commented Apr 5, 2024

mroeschke commented Dec 19, 2023 •

edited

Loading

mroeschke Apr 5, 2024 •

edited

Loading