Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: faster _coerce_to_data_and_mask() for astype("Float64") #60121

Merged

Conversation

auderson
Copy link
Contributor

@auderson auderson commented Oct 29, 2024

Use np.isnan for floats instead of libmissing.is_numeric_na. Also add a fast path for booleans

Prev:
image
New:
image

@auderson
Copy link
Contributor Author

Failures seem unrelated: FAILED pandas/tests/io/test_parquet.py::TestParquetPyArrow::test_timezone_aware_index[timezone_aware_date_list1] - AssertionError: DataFrame.index are different

@auderson
Copy link
Contributor Author

auderson commented Oct 29, 2024

The results from asv don't seem as impressive as the results above, guessing the scale of these tests is much smaller.
image

@auderson
Copy link
Contributor Author

@rhshadrach friendly ping

@jorisvandenbossche jorisvandenbossche added Performance Memory or execution speed performance NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Oct 30, 2024
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, thanks a lot @auderson!

@jorisvandenbossche jorisvandenbossche added this to the 3.0 milestone Oct 30, 2024
elif values.dtype.kind == "f":
# np.isnan is faster than is_numeric_na() for floats
# github issue: #60066
mask = np.isnan(values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we ever get a masked float values here with pd.NA?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a

    if not copy:
        values = np.asarray(values)
    else:
        values = np.array(values, copy=copy)

above, so values is guaranteed to be a numpy array at this point

@mroeschke mroeschke merged commit 00d4189 into pandas-dev:main Oct 30, 2024
56 of 57 checks passed
@mroeschke
Copy link
Member

Thanks @auderson

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NA - MaskedArrays Related to pd.NA and nullable extension arrays Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: df.astype("float64[pyarrow]") is slow, df.astype("Float64") is super slow
3 participants