Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: merging an Integer EA rasises #23262

Merged
merged 15 commits into from
Dec 19, 2018
Merged

Conversation

makbigc
Copy link
Contributor

@makbigc makbigc commented Oct 21, 2018

@pep8speaks
Copy link

Hello @makbigc! Thanks for submitting the PR.

@@ -1109,6 +1109,7 @@ Reshaping
- Bug in :func:`merge_asof` when merging on float values within defined tolerance (:issue:`22981`)
- Bug in :func:`pandas.concat` when concatenating a multicolumn DataFrame with tz-aware data against a DataFrame with a different number of columns (:issue`22796`)
- Bug in :func:`merge_asof` where confusing error message raised when attempting to merge with missing values (:issue:`23189`)
- Bug in :func:`pandas.merge` when merging on an Integer extension array (:issue:`23020`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just move this issue number to the IntegerArray section as that is first announcing IA

@@ -1842,6 +1842,16 @@ def test_merging_with_bool_or_int_cateorical_column(self, category_column,
CDT(categories, ordered=ordered))
assert_frame_equal(expected, result)

def test_merge_on_int_array(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to pandas/tests/extension/base/reshape (and use the fixtures)

@@ -1533,6 +1534,10 @@ def _factorize_keys(lk, rk, sort=True):

lk = ensure_int64(lk.codes)
rk = ensure_int64(rk)
elif is_extension_array_dtype(lk) and is_extension_array_dtype(rk):
klass = libhashtable.Factorizer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is actually more complicated here

elif is_extension_array_dtype(lk) and is_extension_array_dtype(rk) and lk.dtype == rk.dtype:
     klass = ...
     lk, _ = lk._values_for_factorize()
     rk, _ = rk._values_for_factorize()

and actually you can do this before any of the other if statements. That way if the _values_for_factorize returns an int64 array then it will just work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@makbigc did you try @jreback's suggesting of factorizing the EAs before getting into the main if block?

@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode ExtensionArray Extending pandas with custom dtypes or arrays. labels Oct 26, 2018
@jreback
Copy link
Contributor

jreback commented Nov 3, 2018

can you merge master and update

@makbigc
Copy link
Contributor Author

makbigc commented Nov 4, 2018

@jreback before the push, I already rebased my code. But all checks have failed. Did my change break the checks? Or rebasing hasn't get the code to the latest, has it? Please help.

@@ -1332,6 +1332,12 @@ Sparse
- Bug in unary inversion operator (``~``) on a ``SparseSeries`` with boolean values. The performance of this has also been improved (:issue:`22835`)
- Bug in :meth:`SparseArary.unique` not returning the unique values (:issue:`19595`)

IntegerArray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just add this onto the big section on IntegerArray which already exists

@@ -170,3 +170,11 @@ def test_merge(self, data, na_value):
[data[0], data[0], data[1], data[2], na_value],
dtype=data.dtype)})
self.assert_frame_equal(res, exp[['ext', 'int1', 'key', 'int2']])

def test_merge_on_int_array(self, df_merge_on_int_array):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be generic, see the test_merge example. it should work for any array type.

@@ -104,3 +106,9 @@ def data_for_grouping():
def box_in_series(request):
"""Whether to box the data in a Series"""
return request.param


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not needed

@TomAugspurger
Copy link
Contributor

@makbigc can you clarify

But all checks have failed?

It'd be easier to help if you push your changes so that we can see what you have.

@TomAugspurger
Copy link
Contributor

Could you also address @jreback's comments?

@makbigc
Copy link
Contributor Author

makbigc commented Nov 8, 2018

Please give me some more time. I am working on it.

@makbigc makbigc closed this Nov 8, 2018
@makbigc makbigc reopened this Nov 8, 2018
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 8, 2018 via email

@makbigc makbigc force-pushed the bug-fix-23020 branch 2 times, most recently from 0585c30 to efa30c7 Compare November 17, 2018 14:36
doc/source/whatsnew/v0.24.0.rst Outdated Show resolved Hide resolved
pandas/core/reshape/merge.py Outdated Show resolved Hide resolved
pandas/tests/extension/base/reshaping.py Outdated Show resolved Hide resolved
@jreback
Copy link
Contributor

jreback commented Dec 3, 2018

@makbigc can you merge master.

@codecov
Copy link

codecov bot commented Dec 5, 2018

Codecov Report

Merging #23262 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #23262      +/-   ##
==========================================
- Coverage    92.2%    92.2%   -0.01%     
==========================================
  Files         162      162              
  Lines       51714    51722       +8     
==========================================
+ Hits        47682    47689       +7     
- Misses       4032     4033       +1
Flag Coverage Δ
#multiple 90.6% <100%> (ø) ⬆️
#single 43.01% <0%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/core/reshape/merge.py 94.35% <100%> (+0.06%) ⬆️
pandas/io/json/json.py 92.61% <0%> (-0.48%) ⬇️
pandas/util/testing.py 87.51% <0%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9f2c716...b8ed3e9. Read the comment docs.

@codecov
Copy link

codecov bot commented Dec 5, 2018

Codecov Report

Merging #23262 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #23262      +/-   ##
==========================================
+ Coverage   92.28%   92.28%   +<.01%     
==========================================
  Files         162      162              
  Lines       51831    51837       +6     
==========================================
+ Hits        47830    47837       +7     
+ Misses       4001     4000       -1
Flag Coverage Δ
#multiple 90.68% <100%> (ø) ⬆️
#single 43% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/reshape/merge.py 94.3% <100%> (+0.01%) ⬆️
pandas/core/indexes/period.py 93.09% <0%> (+0.03%) ⬆️
pandas/core/reshape/tile.py 94.82% <0%> (+0.06%) ⬆️
pandas/util/testing.py 87.58% <0%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 216986d...6bae6ca. Read the comment docs.

@makbigc
Copy link
Contributor Author

makbigc commented Dec 6, 2018

@jreback rebased. Please tell me if anything needs to be done.

@TomAugspurger
Copy link
Contributor

There's a linting error: https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=4750

2018-12-05T15:26:49.5779150Z Check import format using isort
2018-12-05T15:26:55.4954327Z ERROR: /home/vsts/work/1/s/pandas/core/reshape/merge.py Imports are incorrectly sorted.
2018-12-05T15:26:55.5184554Z Check import format using isort DONE

doc/source/whatsnew/v0.24.0.rst Outdated Show resolved Hide resolved
pandas/core/reshape/merge.py Outdated Show resolved Hide resolved
pandas/tests/extension/base/reshaping.py Outdated Show resolved Hide resolved
@makbigc
Copy link
Contributor Author

makbigc commented Dec 16, 2018

The test is rewritten. So the EA is the merging key, not like the previous test in which integer array is the merging key for every cases.
The old test is placed as a special case in the tests/reshape/merge/test_merge.py.
Would you give me some hints to pass azure test? It seems the new test doesn't work with python 2.7.15.
Thanks.

pandas/core/arrays/sparse.py Outdated Show resolved Hide resolved
@@ -1533,6 +1534,10 @@ def _factorize_keys(lk, rk, sort=True):

lk = ensure_int64(lk.codes)
rk = ensure_int64(rk)
elif is_extension_array_dtype(lk) and is_extension_array_dtype(rk):
klass = libhashtable.Factorizer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@makbigc did you try @jreback's suggesting of factorizing the EAs before getting into the main if block?

pandas/tests/extension/base/reshaping.py Show resolved Hide resolved
@TomAugspurger
Copy link
Contributor

I'm not too sure about the azure failure, but notice that the data generated includes a duplicate in the first three values

data = [I, a, a, ...

That data is randomly generated. It looks like your expected may not handle cases where the key is not unique.

I believe the only requirement on data are that it's

  • length 100
  • the first two values are distinct
  • the first two values are not NA.

@TomAugspurger
Copy link
Contributor

@makbigc I'm going to push a few changes here if that's OK.

* generic release note
* removed _values_for_factorize changes
* made tests depend on just the first two
* do the factorization first
@TomAugspurger
Copy link
Contributor

I started to write a test that would include NA values in the merge key, but I think the expected behavior of that is up in the air right now: #22491.

# GH 23020
df = pd.DataFrame({'A': pd.Series([1, 2, np.nan], dtype='Int64'),
'B': 1})
result = pd.merge(df, df, on='A')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think can remove this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this does test merging on an NA key, which the generic test doesn't. So maybe leave it?

@TomAugspurger
Copy link
Contributor

Updated the coercion of lk / rk to do

  1. all the pre-processing on the values (datetimetz, categorical, extension)
  2. Hashtable / dtype validate

in two distinct blocks.

@TomAugspurger TomAugspurger added this to the 0.24.0 milestone Dec 18, 2018
@makbigc
Copy link
Contributor Author

makbigc commented Dec 19, 2018

@TomAugspurger Thanks for your reply and help.

@jreback jreback merged commit 14c33b0 into pandas-dev:master Dec 19, 2018
@jreback
Copy link
Contributor

jreback commented Dec 19, 2018

thanks @makbigc and @TomAugspurger

maybe we should create some asv's for EA's generally

TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request Dec 20, 2018
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: merging on an Integer EA raises
4 participants