-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
merge on index vs merge on column - different NaN handling #13371
Comments
#13170 will cast join keys if possible. Yeah I would agree that if the index has the same name as a join column and it is not joined on (e.g. So why don't you do a pull-requests to implement your suggestions; see how much is needed to change to make consistent. |
I started looking at #14076 to see if I can do something intelligent, and then found the discussion here. @jreback has a comment from Jun 6 that is confusing to me. Let's assume you are doing a left-join,
In the first case, this is like the 4th merge in the original example, and I think the behavior is correct. In the second case, this is like the 2nd merge above, but I think the name of the resulting column should come from the left DataFrame index name, because the left's DataFrame index is what is preserved as a column in the result. In the 2nd merge example above, it comes from the right, which is confusing. In the third case, since there is no name for the left index, we could raise an error, or maybe the resulting name of the column should be called "left_index" (or something like that), to make it clear that it was that particular index that is used to create the values for the column. When the merges involve a MultiIndex, then I think these are the cases:
Looking forward to your opinion. |
Hi @Dr-Irv! As far as potential solutions go, here are my thoughts for each of the three cases you identify: 1) I agree this works as expected, although I'd like to get rid of the suspicious type casting behaviour (hopefully #13170 fixes this already, but I think there might be something fishy going on in the background that tells me "left"-merging on the left index is handled incorrectly). 2) I think your solution is the right one. For the sake of argument, we could keep both sets of columns and name them e.g. colName_x, colName_y (as in the non-"left"-merge). However, I think this is potentially misleading as it hides the fact that these were columns used for merging (it also gets messy if, as you say only some names match and some do not). So I agree with you. 3) We could raise an error, or we could retain the left index as it is and drop the "right_on" columns. I prefer the latter since it works similarly to joining two dataframes on index and, since we are "left"-merging, we are saying that the joining index/column we care about is the left one. The right one might be useful as a way to get us other values/fields we care about (contained in the right dataframe), but we should not necessarily expect the "right_on" column(s) to be appropriately named and to have reasonable values beyond the ones already included in "left_index". I'd like the solution to be minimal, avoiding many "case/switch"-type statements. Let me know what you think. PS: |
@ialong Regarding your idea of what to do when the left index has no name at all (or in the case of a Namely, suppose you are doing a left merge where you have Consider this behavior with a
This yields:
Note that the index columns
This latter DataFrame has Now, I imagine that if we were to change this undocumented behavior, we'd have a compatibility issue, but maybe that should be handled via a new parameter to Curious to get feedback from @jreback . |
@Dr-Irv I agree! Although maybe it's enough to have an argument like "reset_index=False". It does not seem like erroneous behaviour since in your example the column names actually agree. The problem (and what we need to fix) is that, if the names differ, those that survive are the names of the right dataframe's columns, without the appropriate NaNs in place. The more I think about this, the more it seems like merges should always be done on columns, rather than indices, and then we "set_index" as specified by either the inputs or the arguments to merge. |
@ialong You wrote: "if the names differ, those that survive are the names of the right dataframe's columns,". I'm not sure I agree. If you are doing a left join, then I think the names of the left dataframe index (or columns) should survive. I've had situations where I want to merge using the index on one DataFrame and columns from another, so I don't think we should remove that functionality as you suggest in the last paragraph. I think we just need agreement on what the behavior should be in this case. And we have to work through the 4 cases of left, right, inner, and outer joins, when one frame is joining on index, and the other on columns, and determine what the outcome should be in each of those cases. |
@Dr-Irv That's the behaviour of my second example actually (maybe things have changed since then?): index in "left" is called "c" but right_on='d' as you can see the result is a column "d" with no NaNs, a meaningless index and no "c" column anywhere. Of course if we were merging on two columns (rather than index and column), both survive (first example).
Sorry I realise this was not very clear, what I meant was that the merge routine should (internally) be called on columns. |
@ialong wrote:
I am suggesting that the current behavior should be changed, so that in your second example, the column name would be 'c' that survived.
I am not at all familiar with the current implementation, so I have no opinion on that! |
There no activity for more than two years. Let's close this now and re-open if required. |
can someone help me with python homework? |
In the following examples there are three different merge behaviours when it comes to handling NaNs.
They are all based on pd.merge(..., how="left", ...)
The difference depends on:
1) Whether we are merging using an index or a column.
2) Whether the column keys we are merging on are the same value or not (i.e. if left_on = right_on).
Arguably, if we specify "left" as the merging criterion, the desired behaviour is to have NaNs in the columns coming from the right dataframe where there is no match between the left and right dataframes' key columns (see first merge in example below, 'd' and 'e' columns).
The problem is, if we are merging on left's index, the NaNs get filled with the index values from the left dataframe even if the names of the two columns don't match ('c' and 'd' in the example). We are thus led to believe there was a perfect match between the index of the left dataframe and the "key" column of the right dataframe ('d' here).
Gotchas:
-There is something puzzling going on with the new indices of the resulting dataframe (when merging on index).
-Type casting occurs when merging on index, perhaps suggesting NaNs are explicitly filled in a second step.
Proposed behaviour:
Maybe it is simply a matter of removing this NaN filling step.
Better yet, the "key" column in the merged dataframe should perhaps bear the name of left's index not of the "right_on" key (provided we used left's index to merge). I.e. in the second merge of the example, the 'd' column should be called 'c'.
This is really the source of the confusion when the two names are different. When they are the same the "no NaN" behaviour is arguably legitimate.
Also it might be worthwhile to cast the final column back to the original dtype if there are no NaNs.
Maybe this is not really an issue though, more something to be aware of. I would be interested in hearing any motivation behind this behaviour.
Looking forward to reading your thoughts!
Code Sample:
Output:
output of
pd.show_versions()
:INSTALLED VERSIONS
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 15.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
pandas: 0.18.1
nose: None
pip: 8.1.2
setuptools: 21.0.0
Cython: None
numpy: 1.11.0
scipy: 0.17.1
statsmodels: None
xarray: 0.7.2
IPython: 4.2.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.0.0
tables: None
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: None
xlrd: 0.9.4
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0
pandas_datareader: None
The text was updated successfully, but these errors were encountered: