Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: support EAs in nargsort without warning #25595

Closed

Conversation

jorisvandenbossche
Copy link
Member

Closes #25439

@jorisvandenbossche jorisvandenbossche added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff ExtensionArray Extending pandas with custom dtypes or arrays. labels Mar 7, 2019
@jorisvandenbossche jorisvandenbossche added this to the 0.24.2 milestone Mar 7, 2019
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably want a whatsnew?

@@ -239,7 +241,9 @@ def nargsort(items, kind='quicksort', ascending=True, na_position='last'):
"""

# specially handle Categorical
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is slightly stale now.

@@ -239,7 +241,9 @@ def nargsort(items, kind='quicksort', ascending=True, na_position='last'):
"""

# specially handle Categorical
if is_categorical_dtype(items):
if is_extension_array_dtype(items):
if isinstance(items, ABCIndexClass):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so when we have ExtensionArrays-backed indexes in the future, this is liable to break? Or we'll need a ._values as part of the interface...

I don't suppose items = items._values_for_argsort works here? In theory, DatetimeArray could just use i8 values for sorting, bypassing the warning?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

._values is defined to be "the best array representation", so if the Index is backed by an EA, it will always return the EA? (also if we have an ExtensionIndex). I can maybe also use the public .array now?

I don't suppose items = items._values_for_argsort

No, because we want to dispatch to the argsort method of the EA, I think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I mis-remembered where the warning was coming from (I thought it was implemented on DatetimeArray).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why exactly is this change needed? should this not just defer to the index .argsort, which for DTA handles this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Index.argsort has a different signature (and that's not specific to DatetimeTZ, but the general Index.argsort, so I would rather not deal with that here. Can open an issue for that)

@jorisvandenbossche
Copy link
Member Author

Probably want a whatsnew?

There is no behaviour change, just the warning, so thought it might not be needed. But can certainly add one.


def test_nargsort_datetimearray_warning(self, recwarn):
# https://github.com/pandas-dev/pandas/issues/25439
# can be removed once the FutureWarning for np.array(DTA) is removed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason you are not using tm.assert_produces_warning, this looks non-standard

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I want to assert that a specific warning is not raised. I found it a bit clunky as well, so if you know a better way, all ears.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with pytest.raises(None): is the cleanest, but I think our linter disallows that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming you mean pytest.warns(None): in that case you still have to do something like

with pytest.warns(None) as record:
    ....

assert len(record) == 0

giving basically the same as I did above (but above I did it with the recwarn fixture instead of the with statement. Happy to use the with statement if that is more typically used / easier to understand. Actually now thinking about it, I would find that clearer myself :-))

Copy link
Contributor

@TomAugspurger TomAugspurger Mar 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I meant pytest.warns. I really thought that pytest.warns(None) asserted that there were no warnings, but apparently not.

In theory, you could rely on our global "fail on warning" setting from the one CI job, but I don't think I trust that enough :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tm.assert_produces_warning(None) already does this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But not for a specific warning, what I tried to do here.
But OK, maybe it is not that important to do it (the reason I wanted to do that was to avoid unrelated failures, eg when numpy starts raising a warning on something unrelated used under the hood).

Will change to tm.assert_produces_warning(None)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But not for a specific warning, what I tried to do here.
But OK, maybe it is not that important to do it (the reason I wanted to do that was to avoid unrelated failures, eg when numpy starts raising a warning on something unrelated used under the hood).

sure, but try to guard against unexpected warnings is a bit of a rabbit hole

@@ -239,7 +241,9 @@ def nargsort(items, kind='quicksort', ascending=True, na_position='last'):
"""

# specially handle Categorical
if is_categorical_dtype(items):
if is_extension_array_dtype(items):
if isinstance(items, ABCIndexClass):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why exactly is this change needed? should this not just defer to the index .argsort, which for DTA handles this?

@jorisvandenbossche
Copy link
Member Author

There is actually a broader issue here (also the reason that the tests are failing, eg IntervalArray.argsort is failing on data with NaNs): the EA.argsort API is not clearly defined (or not tested), and the check I edited here was originally specific for categorical data, so it also assumed that NaNs would be sorted first (-1 in the codes). While in general, I think from an argsort method it is expected to sort them last.

Eg

In [12]: pd.Categorical(['a', 'b', np.nan]).argsort()                   
Out[12]: array([2, 0, 1])

In [13]: pd.Categorical(['a', 'b', np.nan]).sort_values()                 
Out[13]: 
[a, b, NaN]
Categories (2, object): [a, b]

so in the above, argsort and sort_values have a different behaviour regarding missing values.

If we want to generalize this, this "sorting last" should be moved inside Categorical.argsort. But that would mean for the case where core.sorting.nargsort wants to put them first, you do this moving first/last twice (moving last inside Categorical.argsort, moving back first inside nargsort).

Unless we would add the na_position to the EA.argsort interface.

@jorisvandenbossche
Copy link
Member Author

@TomAugspurger would you be fine with catching this specific warning in our code for 0.24.2 (because my explanation above has too much impact for 0.24.2)?
Or what is the reason for objecting filtering the warning? It's certainly not good practice, but showing this warning is just plain confusing for users, as there is nothing they should do anything about.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 10, 2019 via email

@jorisvandenbossche
Copy link
Member Author

OK, opened #25629 for that

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.24.2, 0.25.0 Mar 10, 2019
@jreback
Copy link
Contributor

jreback commented Apr 20, 2019

can you merge master

@jreback
Copy link
Contributor

jreback commented Jun 8, 2019

superseded by #26354

@jreback jreback closed this Jun 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FutureWarning when sorting tz-aware datetimeindex
3 participants