-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: ExtensionArray.argsort places the missing value at the end #26354
Conversation
Codecov Report
@@ Coverage Diff @@
## master #26354 +/- ##
==========================================
- Coverage 91.86% 91.85% -0.01%
==========================================
Files 179 179
Lines 50700 50701 +1
==========================================
- Hits 46576 46573 -3
- Misses 4124 4128 +4
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #26354 +/- ##
==========================================
- Coverage 92.04% 92.03% -0.01%
==========================================
Files 175 175
Lines 52292 52308 +16
==========================================
+ Hits 48133 48144 +11
- Misses 4159 4164 +5
Continue to review full report at Codecov.
|
@@ -57,6 +57,12 @@ def data_for_sorting(dtype): | |||
return integer_array([1, 2, 0], dtype=dtype) | |||
|
|||
|
|||
@pytest.fixture | |||
def data_multiple_nan(dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we not already have a suitable fixture for this? or just change data_misssing_for_sorting to have another nan; adding this creates even more boilerplate and makes testing even more complex
pandas/core/arrays/base.py
Outdated
def permutation(mask): | ||
# Return a permutation which maps the indices of the | ||
# subarray without nan to the indices of the original array. | ||
permu = np.arange(len(mask)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just mask.cumsum() no? this logic is pretty complicated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather not have a function, just in-line this
pandas/core/arrays/base.py
Outdated
# subarray without nan to the indices of the original array. | ||
permu = np.arange(len(mask)) | ||
nan_loc = np.arange(len(mask))[mask] | ||
offset = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have logic like this elsewhere in the codebase perhaps have a look at pandas/core/indexing.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or pandas/core/sorting.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sub function permutation is simplified.
doc/source/whatsnew/v0.25.0.rst
Outdated
@@ -40,6 +40,7 @@ Other Enhancements | |||
- :meth:`DataFrame.query` and :meth:`DataFrame.eval` now supports quoting column names with backticks to refer to names with spaces (:issue:`6508`) | |||
- :func:`merge_asof` now gives a more clear error message when merge keys are categoricals that are not equal (:issue:`26136`) | |||
- :meth:`pandas.core.window.Rolling` supports exponential (or Poisson) window type (:issue:`21303`) | |||
- :meth:`ExtensionArray.argsort` places the nan at the end of the sorted array (:issue:`21801`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should go under API breaking changes.
pandas/core/arrays/base.py
Outdated
@@ -370,13 +370,15 @@ def _values_for_argsort(self) -> np.ndarray: | |||
ndarray | |||
The transformed values should maintain the ordering between values | |||
within the array. | |||
ndarray | |||
The mask which indicates the NaN values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NaN -> NA.
@TomAugspurger do you have anything else to implement? |
Looks like Jeff had some.
… On May 23, 2019, at 22:05, Mak Sze Chun ***@***.***> wrote:
@TomAugspurger do you have anything else to implement?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
doc/source/whatsnew/v0.25.0.rst
Outdated
@@ -252,6 +252,7 @@ Other API Changes | |||
- The ``arg`` argument in :meth:`pandas.core.groupby.DataFrameGroupBy.agg` has been renamed to ``func`` (:issue:`26089`) | |||
- The ``arg`` argument in :meth:`pandas.core.window._Window.aggregate` has been renamed to ``func`` (:issue:`26372`) | |||
- Most Pandas classes had a ``__bytes__`` method, which was used for getting a python2-style bytestring representation of the object. This method has been removed as a part of dropping Python2 (:issue:`26447`) | |||
- :meth:`ExtensionArray.argsort` places the nan at the end of the sorted array (:issue:`21801`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use double-backticks around nan
pandas/core/arrays/base.py
Outdated
def permutation(mask): | ||
# Return a permutation which maps the indices of the | ||
# subarray without nan to the indices of the original array. | ||
permu = np.arange(len(mask)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather not have a function, just in-line this
pandas/core/arrays/base.py
Outdated
if mask.any(): | ||
notmask = ~mask | ||
notnull = np.argsort(values[notmask], kind=kind, **kwargs) | ||
permu = permutation(mask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might have some utilities in pandas/core/sorting.py which are helpful here (or as a possible future PR to pull common code out)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find the common code in pandas/core/sorting.py. Can we go ahead? I want to implement argmin for EA based on this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @jreback was thinking of nargsort
in pandas/core/sorting. Does that work? it may not be able to handles EAs yet.
This will need some tests. I would re-use the values_for_sorting
fixture and do a .take
to get some missing values.
doc/source/whatsnew/v0.25.0.rst
Outdated
@@ -252,6 +252,7 @@ Other API Changes | |||
- The ``arg`` argument in :meth:`pandas.core.groupby.DataFrameGroupBy.agg` has been renamed to ``func`` (:issue:`26089`) | |||
- The ``arg`` argument in :meth:`pandas.core.window._Window.aggregate` has been renamed to ``func`` (:issue:`26372`) | |||
- Most Pandas classes had a ``__bytes__`` method, which was used for getting a python2-style bytestring representation of the object. This method has been removed as a part of dropping Python2 (:issue:`26447`) | |||
- :meth:`ExtensionArray.argsort` places the ``nan`` at the end of the sorted array (:issue:`21801`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nan
to missing values. Not all EAs use NaN as their missing value marker.
pandas/core/arrays/base.py
Outdated
@@ -370,13 +370,17 @@ def _values_for_argsort(self) -> np.ndarray: | |||
ndarray | |||
The transformed values should maintain the ordering between values | |||
within the array. | |||
ndarray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you have multiple return values of the same type, you should use names
values : ndarray
The transformed values ...
mask : ndarray
The mask indicating which values are NA...
pandas/core/arrays/base.py
Outdated
# permu maps the indices of the subarray | ||
# without nan to the indices of the original array. | ||
permu = np.arange(len(mask)) | ||
permu = permu[~mask] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
~mask
-> notmask
.
pandas/core/arrays/base.py
Outdated
|
||
notnull = permu[notnull] | ||
allnan = np.arange(len(self))[mask] | ||
result = np.append(notnull, allnan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Append copies, which I don't think we want. Use np.concatenate
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think core.sorting.nargsort should be updated to dispatch to ExtensionArrays (the categorical path can be made general for any extension type).
And are we OK with doing a hard break for Categorical here? It has had this "NA first" behaviour already for a long time I think, so we might need to deprecate this first?
Is there a reason that you removed the tests? (this should certainly be tested)
pandas/core/arrays/base.py
Outdated
@@ -361,7 +361,7 @@ def isna(self) -> ArrayLike: | |||
""" | |||
raise AbstractMethodError(self) | |||
|
|||
def _values_for_argsort(self) -> np.ndarray: | |||
def _values_for_argsort(self) -> Tuple[np.ndarray, np.ndarray]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking change for ExtensionArray authors. I am not sure how many already rely on this, but should we deprecate it first? (eg handle both cases (only values and values+mask) where it is used, and raise warning if only values are returned)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or actually, is it needed to change this? As we can get the mask just from self.isna()
?
pandas/core/arrays/base.py
Outdated
|
||
# permu maps the indices of the subarray | ||
# without nan to the indices of the original array. | ||
permu = np.arange(len(mask)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is the name "permu" coming from?
Maybe have a look in pd.core.sorting.nargsort to use more or less the same names in the code.
|
can you add some tests to assert the results of the sort |
pandas/core/arrays/base.py
Outdated
@@ -361,23 +362,6 @@ def isna(self) -> ArrayLike: | |||
""" | |||
raise AbstractMethodError(self) | |||
|
|||
def _values_for_argsort(self) -> np.ndarray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't we leave this and just enhance the signature, passing na_position
? then the default implementation will just work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_values_for_argsort
is added backed to ExtensionArray interface. Actually, _values_for_argsort
which calls np.array
does the same thing as np.asanyarray
in nargsort
. But the code in ExtensionArray.argsort
is organized in more general way.
DatetimeLikeArray
and PeriodArray
's own _values_for_argsort
methods are deleted. Otherwise, NA values in PeriodArray
would get a negative number from self._data
which would be placed in the beginning in nargsort
.
The test_nargsort_datetimearray_warning was already added in #25629. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mentioned some perf issues, is Int64 ok on perf? what about PandasArrays? (do we have asv's for this?)
pandas/core/arrays/datetimelike.py
Outdated
@@ -620,9 +620,6 @@ def _values_for_factorize(self): | |||
def _from_factorized(cls, values, original): | |||
return cls(values, dtype=original.dtype) | |||
|
|||
def _values_for_argsort(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to cast to object dtype for datetimetz. I think this can be reverted.
In [38]: a = pd.date_range('2000', periods=100, tz='US/Eastern')._data
In [39]: %timeit a.argsort()
325 µs ± 11.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [40]: b = pd.date_range('2000', periods=100)._data
In [41]: %timeit b.argsort()
44.8 µs ± 1.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
This should be reve
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@makbigc I believe you reverted this and pref is back to normal, can you confirm (also do we hav an asv for this example that @TomAugspurger gave; if not can you add).
Jeff. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks pretty good, some questions.
@@ -501,23 +501,6 @@ def value_counts(self, dropna=True): | |||
|
|||
return Series(array, index=index) | |||
|
|||
def _values_for_argsort(self) -> np.ndarray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does we have a benchmark for sorting Int64? so does this change things?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't found the benchmark for argsort
. But this change slows down the performance.
HEAD:
In [2]: arr = pd.array([3, None, 1], dtype='Int64')
In [3]: %timeit arr.argsort()
28.3 µs ± 1.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
master:
In [2]: arr = pd.array([3, None, 1], dtype='Int64')
In [3]: %timeit arr.argsort()
5.73 µs ± 15.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add this benchmark (make the arrays longer though)
@@ -193,6 +193,11 @@ def test_searchsorted(self, data_for_sorting): | |||
if not data_for_sorting.ordered: | |||
raise pytest.skip(reason="searchsorted requires ordered data.") | |||
|
|||
def test_argsort_nan_last(self, data_missing_for_sorting): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this being skipped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NA-first
behaviour of Categorical
will be changed to NA-last
in the next PR. So the original _values_for_argsort
is placed here temporarily.
The corresponding test for NA-first
behaviour of Categorical
will also be changed.
It is better to pass EA directly to For example, The |
@@ -403,13 +404,13 @@ def argsort(self, ascending=True, kind='quicksort', *args, **kwargs): | |||
""" | |||
# Implementor note: You have two places to override the behavior of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you augment the doc-string for argsort / _values_for_argsort to say that NaNs are placed at the end.
@@ -501,23 +501,6 @@ def value_counts(self, dropna=True): | |||
|
|||
return Series(array, index=index) | |||
|
|||
def _values_for_argsort(self) -> np.ndarray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add this benchmark (make the arrays longer though)
@@ -188,6 +189,14 @@ def test_nargsort_datetimearray_warning(self): | |||
with tm.assert_produces_warning(None): | |||
nargsort(data) | |||
|
|||
@pytest.mark.parametrize('data', [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually instead of here, can you add the test in pandas/tests/extension/base/missing.py (and it will be generic with the data_missing fixure)
@makbigc can you merge master |
git diff upstream/master -u -- "*.py" | flake8 --diff
_values_for_argsort
returns different values for missing value in different EA. For example, missing value inPandasArray
getsnp.nan
and inPeriodArray
gets-9223372036854775808
(in my machine).np.argsort
placesnp.nan
at the end and the negative number in the beginning.In this PR,
ExtensionArray.argsort
separates the array into two parts, data and missing values. Sort the data part and finally append missing part to it. The approach can work in general, no matter what the replacement of missing value invalues_for_argsort
is.