-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: merging an Integer EA rasises #23262
Conversation
makbigc
commented
Oct 21, 2018
- closes BUG: merging on an Integer EA raises #23020
- 1 test added
- whatsnew entry
Hello @makbigc! Thanks for submitting the PR.
|
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -1109,6 +1109,7 @@ Reshaping | |||
- Bug in :func:`merge_asof` when merging on float values within defined tolerance (:issue:`22981`) | |||
- Bug in :func:`pandas.concat` when concatenating a multicolumn DataFrame with tz-aware data against a DataFrame with a different number of columns (:issue`22796`) | |||
- Bug in :func:`merge_asof` where confusing error message raised when attempting to merge with missing values (:issue:`23189`) | |||
- Bug in :func:`pandas.merge` when merging on an Integer extension array (:issue:`23020`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just move this issue number to the IntegerArray section as that is first announcing IA
@@ -1842,6 +1842,16 @@ def test_merging_with_bool_or_int_cateorical_column(self, category_column, | |||
CDT(categories, ordered=ordered)) | |||
assert_frame_equal(expected, result) | |||
|
|||
def test_merge_on_int_array(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to pandas/tests/extension/base/reshape (and use the fixtures)
pandas/core/reshape/merge.py
Outdated
@@ -1533,6 +1534,10 @@ def _factorize_keys(lk, rk, sort=True): | |||
|
|||
lk = ensure_int64(lk.codes) | |||
rk = ensure_int64(rk) | |||
elif is_extension_array_dtype(lk) and is_extension_array_dtype(rk): | |||
klass = libhashtable.Factorizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is actually more complicated here
elif is_extension_array_dtype(lk) and is_extension_array_dtype(rk) and lk.dtype == rk.dtype:
klass = ...
lk, _ = lk._values_for_factorize()
rk, _ = rk._values_for_factorize()
and actually you can do this before any of the other if statements. That way if the _values_for_factorize
returns an int64 array then it will just work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you merge master and update |
406dd00
to
54dd18f
Compare
@jreback before the push, I already rebased my code. But all checks have failed. Did my change break the checks? Or rebasing hasn't get the code to the latest, has it? Please help. |
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -1332,6 +1332,12 @@ Sparse | |||
- Bug in unary inversion operator (``~``) on a ``SparseSeries`` with boolean values. The performance of this has also been improved (:issue:`22835`) | |||
- Bug in :meth:`SparseArary.unique` not returning the unique values (:issue:`19595`) | |||
|
|||
IntegerArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just add this onto the big section on IntegerArray which already exists
@@ -170,3 +170,11 @@ def test_merge(self, data, na_value): | |||
[data[0], data[0], data[1], data[2], na_value], | |||
dtype=data.dtype)}) | |||
self.assert_frame_equal(res, exp[['ext', 'int1', 'key', 'int2']]) | |||
|
|||
def test_merge_on_int_array(self, df_merge_on_int_array): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to be generic, see the test_merge example. it should work for any array type.
pandas/tests/extension/conftest.py
Outdated
@@ -104,3 +106,9 @@ def data_for_grouping(): | |||
def box_in_series(request): | |||
"""Whether to box the data in a Series""" | |||
return request.param | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not needed
@makbigc can you clarify
It'd be easier to help if you push your changes so that we can see what you have. |
Could you also address @jreback's comments? |
Please give me some more time. I am working on it. |
Thanks. No rush for now, if you think you'll have some time by next week
that'd be great.
…On Thu, Nov 8, 2018 at 8:04 AM Mak Sze Chun ***@***.***> wrote:
Reopened #23262 <#23262>.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#23262 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABQHItIaWxb7_FAxylHQu4eD9wslHz0Yks5utDnYgaJpZM4XyfK9>
.
|
0585c30
to
efa30c7
Compare
b3006e2
to
d3efb23
Compare
@makbigc can you merge master. |
d3efb23
to
b8ed3e9
Compare
Codecov Report
@@ Coverage Diff @@
## master #23262 +/- ##
==========================================
- Coverage 92.2% 92.2% -0.01%
==========================================
Files 162 162
Lines 51714 51722 +8
==========================================
+ Hits 47682 47689 +7
- Misses 4032 4033 +1
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #23262 +/- ##
==========================================
+ Coverage 92.28% 92.28% +<.01%
==========================================
Files 162 162
Lines 51831 51837 +6
==========================================
+ Hits 47830 47837 +7
+ Misses 4001 4000 -1
Continue to review full report at Codecov.
|
@jreback rebased. Please tell me if anything needs to be done. |
There's a linting error: https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=4750
|
6ae8f20
to
a69f7b7
Compare
The test is rewritten. So the EA is the merging key, not like the previous test in which integer array is the merging key for every cases. |
pandas/core/reshape/merge.py
Outdated
@@ -1533,6 +1534,10 @@ def _factorize_keys(lk, rk, sort=True): | |||
|
|||
lk = ensure_int64(lk.codes) | |||
rk = ensure_int64(rk) | |||
elif is_extension_array_dtype(lk) and is_extension_array_dtype(rk): | |||
klass = libhashtable.Factorizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not too sure about the azure failure, but notice that the data generated includes a duplicate in the first three values data = [I, a, a, ... That data is randomly generated. It looks like your I believe the only requirement on
|
@makbigc I'm going to push a few changes here if that's OK. |
I started to write a test that would include NA values in the merge key, but I think the expected behavior of that is up in the air right now: #22491. |
# GH 23020 | ||
df = pd.DataFrame({'A': pd.Series([1, 2, np.nan], dtype='Int64'), | ||
'B': 1}) | ||
result = pd.merge(df, df, on='A') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think can remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this does test merging on an NA key, which the generic test doesn't. So maybe leave it?
Updated the coercion of lk / rk to do
in two distinct blocks. |
@TomAugspurger Thanks for your reply and help. |
thanks @makbigc and @TomAugspurger maybe we should create some asv's for EA's generally |