Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: ExtensionArray.argsort places the missing value at the end #26354

Closed
wants to merge 24 commits into from

Conversation

makbigc
Copy link
Contributor

@makbigc makbigc commented May 12, 2019

_values_for_argsort returns different values for missing value in different EA. For example, missing value in PandasArray gets np.nan and in PeriodArray gets -9223372036854775808 (in my machine). np.argsort places np.nan at the end and the negative number in the beginning.

In this PR, ExtensionArray.argsort separates the array into two parts, data and missing values. Sort the data part and finally append missing part to it. The approach can work in general, no matter what the replacement of missing value in values_for_argsort is.

@codecov
Copy link

codecov bot commented May 12, 2019

Codecov Report

Merging #26354 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26354      +/-   ##
==========================================
- Coverage   91.86%   91.85%   -0.01%     
==========================================
  Files         179      179              
  Lines       50700    50701       +1     
==========================================
- Hits        46576    46573       -3     
- Misses       4124     4128       +4
Flag Coverage Δ
#multiple 90.45% <100%> (ø) ⬆️
#single 41.12% <53.33%> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/core/arrays/integer.py 96.25% <ø> (-0.06%) ⬇️
pandas/core/arrays/base.py 98.87% <100%> (-0.01%) ⬇️
pandas/core/arrays/period.py 98.3% <100%> (ø) ⬆️
pandas/core/arrays/categorical.py 95.82% <100%> (+0.02%) ⬆️
pandas/core/sorting.py 98.36% <100%> (ø) ⬆️
pandas/io/gbq.py 88.88% <0%> (-11.12%) ⬇️
pandas/core/frame.py 96.88% <0%> (-0.12%) ⬇️
pandas/core/arrays/sparse.py 93.59% <0%> (ø) ⬆️
pandas/core/arrays/numpy_.py 94.49% <0%> (ø) ⬆️
pandas/core/arrays/timedeltas.py 88.82% <0%> (ø) ⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2d2606d...84f87a8. Read the comment docs.

@codecov
Copy link

codecov bot commented May 12, 2019

Codecov Report

Merging #26354 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26354      +/-   ##
==========================================
- Coverage   92.04%   92.03%   -0.01%     
==========================================
  Files         175      175              
  Lines       52292    52308      +16     
==========================================
+ Hits        48133    48144      +11     
- Misses       4159     4164       +5
Flag Coverage Δ
#multiple 90.59% <100%> (ø) ⬆️
#single 40.7% <26.66%> (-0.18%) ⬇️
Impacted Files Coverage Δ
pandas/core/arrays/base.py 99.48% <100%> (+0.04%) ⬆️
pandas/core/arrays/datetimelike.py 97.71% <100%> (ø) ⬆️
pandas/core/arrays/period.py 98.53% <100%> (ø) ⬆️
pandas/core/arrays/integer.py 96.36% <100%> (+0.01%) ⬆️
pandas/core/arrays/categorical.py 95.95% <100%> (ø) ⬆️
pandas/core/sorting.py 98.35% <100%> (ø) ⬆️
pandas/core/arrays/numpy_.py 94.49% <100%> (ø) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 97.01% <0%> (-0.12%) ⬇️
pandas/util/testing.py 90.6% <0%> (-0.11%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bb0376c...73f88a5. Read the comment docs.

@@ -57,6 +57,12 @@ def data_for_sorting(dtype):
return integer_array([1, 2, 0], dtype=dtype)


@pytest.fixture
def data_multiple_nan(dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we not already have a suitable fixture for this? or just change data_misssing_for_sorting to have another nan; adding this creates even more boilerplate and makes testing even more complex

pandas/core/arrays/integer.py Outdated Show resolved Hide resolved
def permutation(mask):
# Return a permutation which maps the indices of the
# subarray without nan to the indices of the original array.
permu = np.arange(len(mask))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just mask.cumsum() no? this logic is pretty complicated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather not have a function, just in-line this

# subarray without nan to the indices of the original array.
permu = np.arange(len(mask))
nan_loc = np.arange(len(mask))[mask]
offset = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have logic like this elsewhere in the codebase perhaps have a look at pandas/core/indexing.py

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or pandas/core/sorting.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sub function permutation is simplified.

@gfyoung gfyoung added API Design ExtensionArray Extending pandas with custom dtypes or arrays. labels May 12, 2019
@@ -40,6 +40,7 @@ Other Enhancements
- :meth:`DataFrame.query` and :meth:`DataFrame.eval` now supports quoting column names with backticks to refer to names with spaces (:issue:`6508`)
- :func:`merge_asof` now gives a more clear error message when merge keys are categoricals that are not equal (:issue:`26136`)
- :meth:`pandas.core.window.Rolling` supports exponential (or Poisson) window type (:issue:`21303`)
- :meth:`ExtensionArray.argsort` places the nan at the end of the sorted array (:issue:`21801`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go under API breaking changes.

@@ -370,13 +370,15 @@ def _values_for_argsort(self) -> np.ndarray:
ndarray
The transformed values should maintain the ordering between values
within the array.
ndarray
The mask which indicates the NaN values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NaN -> NA.

@makbigc
Copy link
Contributor Author

makbigc commented May 24, 2019

@TomAugspurger do you have anything else to implement?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 24, 2019 via email

@@ -252,6 +252,7 @@ Other API Changes
- The ``arg`` argument in :meth:`pandas.core.groupby.DataFrameGroupBy.agg` has been renamed to ``func`` (:issue:`26089`)
- The ``arg`` argument in :meth:`pandas.core.window._Window.aggregate` has been renamed to ``func`` (:issue:`26372`)
- Most Pandas classes had a ``__bytes__`` method, which was used for getting a python2-style bytestring representation of the object. This method has been removed as a part of dropping Python2 (:issue:`26447`)
- :meth:`ExtensionArray.argsort` places the nan at the end of the sorted array (:issue:`21801`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use double-backticks around nan

def permutation(mask):
# Return a permutation which maps the indices of the
# subarray without nan to the indices of the original array.
permu = np.arange(len(mask))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather not have a function, just in-line this

if mask.any():
notmask = ~mask
notnull = np.argsort(values[notmask], kind=kind, **kwargs)
permu = permutation(mask)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might have some utilities in pandas/core/sorting.py which are helpful here (or as a possible future PR to pull common code out)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find the common code in pandas/core/sorting.py. Can we go ahead? I want to implement argmin for EA based on this PR.

pandas/core/arrays/integer.py Outdated Show resolved Hide resolved
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @jreback was thinking of nargsort in pandas/core/sorting. Does that work? it may not be able to handles EAs yet.

This will need some tests. I would re-use the values_for_sorting fixture and do a .take to get some missing values.

@@ -252,6 +252,7 @@ Other API Changes
- The ``arg`` argument in :meth:`pandas.core.groupby.DataFrameGroupBy.agg` has been renamed to ``func`` (:issue:`26089`)
- The ``arg`` argument in :meth:`pandas.core.window._Window.aggregate` has been renamed to ``func`` (:issue:`26372`)
- Most Pandas classes had a ``__bytes__`` method, which was used for getting a python2-style bytestring representation of the object. This method has been removed as a part of dropping Python2 (:issue:`26447`)
- :meth:`ExtensionArray.argsort` places the ``nan`` at the end of the sorted array (:issue:`21801`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nan to missing values. Not all EAs use NaN as their missing value marker.

@@ -370,13 +370,17 @@ def _values_for_argsort(self) -> np.ndarray:
ndarray
The transformed values should maintain the ordering between values
within the array.
ndarray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you have multiple return values of the same type, you should use names

values : ndarray
    The transformed values ...
mask : ndarray
    The mask indicating which values are NA...

# permu maps the indices of the subarray
# without nan to the indices of the original array.
permu = np.arange(len(mask))
permu = permu[~mask]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

~mask -> notmask.


notnull = permu[notnull]
allnan = np.arange(len(self))[mask]
result = np.append(notnull, allnan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Append copies, which I don't think we want. Use np.concatenate.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think core.sorting.nargsort should be updated to dispatch to ExtensionArrays (the categorical path can be made general for any extension type).

And are we OK with doing a hard break for Categorical here? It has had this "NA first" behaviour already for a long time I think, so we might need to deprecate this first?

Is there a reason that you removed the tests? (this should certainly be tested)

@@ -361,7 +361,7 @@ def isna(self) -> ArrayLike:
"""
raise AbstractMethodError(self)

def _values_for_argsort(self) -> np.ndarray:
def _values_for_argsort(self) -> Tuple[np.ndarray, np.ndarray]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change for ExtensionArray authors. I am not sure how many already rely on this, but should we deprecate it first? (eg handle both cases (only values and values+mask) where it is used, and raise warning if only values are returned)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or actually, is it needed to change this? As we can get the mask just from self.isna() ?


# permu maps the indices of the subarray
# without nan to the indices of the original array.
permu = np.arange(len(mask))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is the name "permu" coming from?
Maybe have a look in pd.core.sorting.nargsort to use more or less the same names in the code.

@makbigc
Copy link
Contributor Author

makbigc commented Jun 6, 2019

  1. @jorisvandenbossche The case of Categorical will be handled in a separate PR. As you said, the 'NA first' behaviour is required in the test test_sort_values_na_position and test_sort_index_na_position_with_categories.

  2. sorting.nargsort calls __array__ of EAs which has a similar output as _values_for_argsort. So some _values_for_argsort methods are removed.

    items = np.asanyarray(items)

@jreback
Copy link
Contributor

jreback commented Jun 6, 2019

can you add some tests to assert the results of the sort

@jreback
Copy link
Contributor

jreback commented Jun 8, 2019

@makbigc can you also add any tests that are in #25595 as well (this PR will supersed)

@@ -361,23 +362,6 @@ def isna(self) -> ArrayLike:
"""
raise AbstractMethodError(self)

def _values_for_argsort(self) -> np.ndarray:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't we leave this and just enhance the signature, passing na_position ? then the default implementation will just work

Copy link
Contributor Author

@makbigc makbigc Jun 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_values_for_argsort is added backed to ExtensionArray interface. Actually, _values_for_argsort which calls np.array does the same thing as np.asanyarray in nargsort. But the code in ExtensionArray.argsort is organized in more general way.

DatetimeLikeArray and PeriodArray's own _values_for_argsort methods are deleted. Otherwise, NA values in PeriodArray would get a negative number from self._data which would be placed in the beginning in nargsort.

@pep8speaks
Copy link

pep8speaks commented Jun 9, 2019

Hello @makbigc! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-06-14 08:10:10 UTC

@makbigc
Copy link
Contributor Author

makbigc commented Jun 9, 2019

The test_nargsort_datetimearray_warning was already added in #25629.
Another test_nargsort_extension_array is added.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mentioned some perf issues, is Int64 ok on perf? what about PandasArrays? (do we have asv's for this?)

pandas/core/arrays/base.py Outdated Show resolved Hide resolved
@@ -620,9 +620,6 @@ def _values_for_factorize(self):
def _from_factorized(cls, values, original):
return cls(values, dtype=original.dtype)

def _values_for_argsort(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to cast to object dtype for datetimetz. I think this can be reverted.

In [38]: a = pd.date_range('2000', periods=100, tz='US/Eastern')._data

In [39]: %timeit a.argsort()
325 µs ± 11.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [40]: b = pd.date_range('2000', periods=100)._data

In [41]: %timeit b.argsort()
44.8 µs ± 1.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

This should be reve

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@makbigc I believe you reverted this and pref is back to normal, can you confirm (also do we hav an asv for this example that @TomAugspurger gave; if not can you add).

@makbigc
Copy link
Contributor Author

makbigc commented Jun 10, 2019

Jeff. The nargsort is the wrapper of np.argsort. argsort of EAs has been using np.argsort. The replacement of argsort with nargsort doesn't improve the performance. This PR only standardizes the position of NA values after sorting.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good, some questions.

pandas/core/arrays/categorical.py Show resolved Hide resolved
@@ -501,23 +501,6 @@ def value_counts(self, dropna=True):

return Series(array, index=index)

def _values_for_argsort(self) -> np.ndarray:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does we have a benchmark for sorting Int64? so does this change things?

Copy link
Contributor Author

@makbigc makbigc Jun 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't found the benchmark for argsort. But this change slows down the performance.
HEAD:

In [2]: arr = pd.array([3, None, 1], dtype='Int64')                             

In [3]: %timeit arr.argsort()                                                   
28.3 µs ± 1.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

master:

In [2]: arr = pd.array([3, None, 1], dtype='Int64')                             

In [3]: %timeit arr.argsort()                                                   
5.73 µs ± 15.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add this benchmark (make the arrays longer though)

@@ -193,6 +193,11 @@ def test_searchsorted(self, data_for_sorting):
if not data_for_sorting.ordered:
raise pytest.skip(reason="searchsorted requires ordered data.")

def test_argsort_nan_last(self, data_missing_for_sorting):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this being skipped?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NA-first behaviour of Categorical will be changed to NA-last in the next PR. So the original _values_for_argsort is placed here temporarily.

The corresponding test for NA-first behaviour of Categorical will also be changed.

@makbigc
Copy link
Contributor Author

makbigc commented Jun 14, 2019

It is better to pass EA directly to nargsort. If the output of EA._values_for_argsort is passed to nargsort, the information of NA values' location may be lost in nargsort.

For example, PeriodArray is passed directly to nargsort. The location of its NA values can be obtained by calling isna(). After calling _values_for_argsort and passing the output np.ndarray to nargsort, the NA values are replaced with negative number. nargsort can no longer retrieve the location of NA values.

The nargsort has been modified in #26854. So the _values_for_argsort is called there after the retrieval of location of NA values (i.e. calling isna()). Besides, no matter what the replacement value for NA in _values_for_argsort, nargsort can still function properly.

@@ -403,13 +404,13 @@ def argsort(self, ascending=True, kind='quicksort', *args, **kwargs):
"""
# Implementor note: You have two places to override the behavior of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you augment the doc-string for argsort / _values_for_argsort to say that NaNs are placed at the end.

@@ -501,23 +501,6 @@ def value_counts(self, dropna=True):

return Series(array, index=index)

def _values_for_argsort(self) -> np.ndarray:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add this benchmark (make the arrays longer though)

@@ -188,6 +189,14 @@ def test_nargsort_datetimearray_warning(self):
with tm.assert_produces_warning(None):
nargsort(data)

@pytest.mark.parametrize('data', [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually instead of here, can you add the test in pandas/tests/extension/base/missing.py (and it will be generic with the data_missing fixure)

@jreback
Copy link
Contributor

jreback commented Jun 27, 2019

@makbigc can you merge master

@jorisvandenbossche jorisvandenbossche changed the title [ENH] ExtensionArray.argsort places the missing value at the end API: ExtensionArray.argsort places the missing value at the end Jun 27, 2019
@makbigc
Copy link
Contributor Author

makbigc commented Jun 30, 2019

After #26854, it is better to pass EA directly to nargsort, because the information about the positions of NaN values can be preserved. Then nargsort calls EA._values_for_argsort itself. This PR is continued in #27137 in which the change of code is smaller.

@makbigc makbigc closed this Jun 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: argsort behaviour for ExtensionArray with missing values
6 participants