-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataFrame.copy(deep=True) is not a deep copy of the index #19862
Comments
Indexes are immutable. Changing its underlying data is going to cause all sorts of problems. |
ok. I think the documentation of copy is unclear then: |
hhhmmm I would expect that a copy of a dataframe to be truly deep when deep=True |
Which bit is unclear? The indices are copied, they are different objects: In [3]: df1.index is df2.index
Out[3]: False But the underlying data are shared between indexes since they're immutable. I am noticing that http://pandas-docs.github.io/pandas-docs-travis/dsintro.html doesn't have a section for
|
kind of the same issue as: #19505 meaning docs need a bit more |
Is there any reason the underlying data in the index is not copied? It seems that the df.values is actually copied just the indices are not? |
Performance. Since indices are immutable, the underlying data can safely be shared. There's no reason to copy it. DataFrames / series are mutable, so the data need to be copied.
And just to be clear, the index is a copy, since they are different objects. Its the underlying values (which users should not be mutating) that are not copied. |
ok. I'll close this. |
Apparently there is a
But, I am not sure we actually want to document this? |
yes, this was really only and never implemented (or meant to be), should be removed. |
IMO,
So, IMO, Re:
This is not a valid argument IMO - it's up to me as a user (consenting adults and all...) what I do with my objects, including the indexes, and if I make a deep copy, it's a justified expectation (I would even argue: a built-in expectation of the word "deep") that this will not mess with the original. Plus, if I'm already deep-copying the much larger |
There are other problems as well that are not related to copying that makes directly changing underlying values a bad idea. For example, the internal hashtable that is used for indexing will be no longer valid if you change the underlying values of an index (so indexing will give wrong results).
For DataFrame that might be true (depending on its size), but not for Series. To be clear, I am personally not necessarily against changing this (IMO this would make the behaviour more straightforward, at cost of some performance. So a trade-off, of which I am not fully sure on which side I am), only answering some of your arguments. One additional thing. You mention the comparison to the stdlib deep copy behaviour, but note that even the |
Isn't that moving the goal posts? It is within the power of pandas to influence how its own indexes are handled, whereas arbitrary python objects can obviously be quite complicated. But even then, the meaning of
|
The example looks to work on master. Could use a test
|
I ran into this problem while coding today. Glad to see it was already reported. Just to chime in and agree with a couple of points:
|
thanks for the commentary @DanielGoldfarb haooy to have contributions to improve things |
@jreback |
After For large dataframes, this is not the case. Just to be sure, is this the same issue ? Are the indexes different ? If it is the case, this is really disturbing and I’m afraid this may have lead people to wrong results (I only found it by chance). |
I also ran into this today, discovered that even if the id of the index was different on the copy, modifying the I am totally in line @DanielGoldfarb 's point 1:
A fix for this could be composed of the following elements:
Would this be ok for everyone ? |
Encounter a similar issue, that if dataframe contains nested dataframes, Code to reproduce the issue: import pandas as pd
df1 = pd.DataFrame({"foo": [pd.DataFrame({"bar": [1]})]})
print(df1)
df2 = df1.copy(deep=True)
df_inner = df2.loc[0, "foo"]
df_inner *= 2
print(df2)
print(df1)
Echoing @DanielGoldfarb 's comment, that the argument |
Code Sample, a copy-pastable example if possible
Problem description
DataFrame.copy(deep=True) is not a deep copy of the index.
In
pandas/pandas/core/indexes/base.py
Line 787 in a00154d
maybe deep should be set to True?
Expected Output
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-53-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.21.0
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.5.0.post20170921
Cython: 0.26.1
numpy: 1.13.1
scipy: 0.19.1
pyarrow: 0.8.0
xarray: 0.9.6
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 0.9.8
lxml: 3.8.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.5.0
The text was updated successfully, but these errors were encountered: