Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.copy(deep=True) is not a deep copy of the index #19862

Open
skaae opened this issue Feb 23, 2018 · 20 comments
Open

DataFrame.copy(deep=True) is not a deep copy of the index #19862

skaae opened this issue Feb 23, 2018 · 20 comments

Comments

@skaae
Copy link

skaae commented Feb 23, 2018

Code Sample, a copy-pastable example if possible

df1 = pd.DataFrame(index=['a', 'b'], columns=['foo', 'muu'])
df1.index.name = "foo"
print(df1)

# create deep copy of df1 and change a value in the index
df2 = df1.copy(deep=True)
df2.index.name = "bar"
df2.index.values[0] = 'c'  # changes both df1 and df2

print(df1)
print(df2)

Problem description

DataFrame.copy(deep=True) is not a deep copy of the index.

In

def copy(self, name=None, deep=False, dtype=None, **kwargs):

maybe deep should be set to True?

Expected Output

     foo  muu
foo          
a    NaN  NaN
b    NaN  NaN
     foo  muu
foo          
c    NaN  NaN
b    NaN  NaN
     foo  muu
bar          
c    NaN  NaN
b    NaN  NaN

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-53-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.21.0
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.5.0.post20170921
Cython: 0.26.1
numpy: 1.13.1
scipy: 0.19.1
pyarrow: 0.8.0
xarray: 0.9.6
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 0.9.8
lxml: 3.8.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.5.0

@TomAugspurger
Copy link
Contributor

df2.index.values[0] = 'c' # changes both df1 and df2

Indexes are immutable. Changing its underlying data is going to cause all sorts of problems.

@skaae
Copy link
Author

skaae commented Feb 23, 2018

ok. I think the documentation of copy is unclear then: Make a deep copy, including a copy of the data and the indices.

@bordingj
Copy link

hhhmmm I would expect that a copy of a dataframe to be truly deep when deep=True

@TomAugspurger
Copy link
Contributor

Which bit is unclear? The indices are copied, they are different objects:

In [3]: df1.index is df2.index
Out[3]: False

But the underlying data are shared between indexes since they're immutable.

I am noticing that http://pandas-docs.github.io/pandas-docs-travis/dsintro.html doesn't have a section for Index. It'd be good to add a short one stating that

  1. They're containers for labels, used in indexing and alignment
  2. They're immutable
  3. There are many specializations of the Index for various dtypes.

@jreback
Copy link
Contributor

jreback commented Feb 23, 2018

kind of the same issue as: #19505

meaning docs need a bit more

@skaae
Copy link
Author

skaae commented Feb 23, 2018

But the underlying data are shared between indexes since they're immutable.

Is there any reason the underlying data in the index is not copied? It seems that the df.values is actually copied just the indices are not?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Feb 23, 2018

Is there any reason the underlying data in the index is not copied?

Performance. Since indices are immutable, the underlying data can safely be shared. There's no reason to copy it. DataFrames / series are mutable, so the data need to be copied.

It seems that the df.values is actually copied just the indices are not?

And just to be clear, the index is a copy, since they are different objects. Its the underlying values (which users should not be mutating) that are not copied.

@skaae
Copy link
Author

skaae commented Feb 28, 2018

ok. I'll close this.

@jorisvandenbossche
Copy link
Member

Apparently there is a deep='all', that exactly deals with this (also copying underlying index data or not). To illustrate with the original example:

In [21]: df1 = pd.DataFrame(index=['a', 'b'], columns=['foo', 'muu'])

In [22]: df2 = df1.copy(deep=True)

In [23]: df2.index.values[0] = 'c'

In [24]: df1
Out[24]: 
     foo  muu
c    NaN  NaN    <--- updated
b    NaN  NaN

In [25]: df3 = df1.copy(deep='all')

In [26]: df3.index.values[1] = 'd'

In [27]: df1
Out[27]: 
     foo  muu
c    NaN  NaN
b    NaN  NaN    <--- not updated

But, deep='all' is completely undocumented, and as far as I can find from a quick search, also only used once in our own code base (https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/reduction.pyx#L537)

I am not sure we actually want to document this?
But then we should maybe just remove that ability?

@jreback
Copy link
Contributor

jreback commented Mar 23, 2018

yes, this was really only and never implemented (or meant to be), should be removed.

@jreback jreback added Compat pandas objects compatability with Numpy or Python functions Difficulty Intermediate Clean labels Mar 23, 2018
@jreback jreback added this to the 0.23.0 milestone Mar 23, 2018
@jreback jreback modified the milestones: 0.23.0, Next Major Release Apr 14, 2018
@h-vetinari
Copy link
Contributor

IMO, copy(deep=True) should completely sever all connections between the original and the copied object - compare the official python docs (https://docs.python.org/3/library/copy.html):

A deep copy constructs a new compound object and then, recursively, inserts copies into it of the objects found in the original.

So, IMO, deep=True should come to mean what deep='all' does currently (and the latter can then be removed).

Re:

Indexes are immutable. Changing its underlying data is going to cause all sorts of problems.

This is not a valid argument IMO - it's up to me as a user (consenting adults and all...) what I do with my objects, including the indexes, and if I make a deep copy, it's a justified expectation (I would even argue: a built-in expectation of the word "deep") that this will not mess with the original.

Plus, if I'm already deep-copying the much larger values of a DF, not copying the index only saves a comparatively irrelevant amount of memory.

@jorisvandenbossche
Copy link
Member

Indexes are immutable. Changing its underlying data is going to cause all sorts of problems.

This is not a valid argument IMO - it's up to me as a user (consenting adults and all...) what I do with my objects, including the indexes

There are other problems as well that are not related to copying that makes directly changing underlying values a bad idea. For example, the internal hashtable that is used for indexing will be no longer valid if you change the underlying values of an index (so indexing will give wrong results).

not copying the index only saves a comparatively irrelevant amount of memory.

For DataFrame that might be true (depending on its size), but not for Series.


To be clear, I am personally not necessarily against changing this (IMO this would make the behaviour more straightforward, at cost of some performance. So a trade-off, of which I am not fully sure on which side I am), only answering some of your arguments.

One additional thing. You mention the comparison to the stdlib deep copy behaviour, but note that even the deep='all' is not comparable to that (it does copy the index, but it still does not copy python objects inside the values recursively).

@h-vetinari
Copy link
Contributor

h-vetinari commented Aug 13, 2018

One additional thing. You mention the comparison to the stdlib deep copy behaviour, but note that even the deep='all' is not comparable to that (it does copy the index, but it still does not copy python objects inside the values recursively).

Isn't that moving the goal posts? It is within the power of pandas to influence how its own indexes are handled, whereas arbitrary python objects can obviously be quite complicated.

But even then, the meaning of deep in vanilla python follows the "complete separation" interpretation:

from copy import deepcopy
x = [0, 1]
x.append(x)
x
# [0, 1, [...]]
y = deepcopy(x)
y[2][0] = 10  # same for arbitrarily many times "[2]"
y
# [10, 1, [...]]
x
# [0, 1, [...]]

@mroeschke
Copy link
Member

The example looks to work on master. Could use a test

In [38]: df1 = pd.DataFrame(index=['a', 'b'], columns=['foo', 'muu'])
    ...: df1.index.name = "foo"
    ...: print(df1)
    ...:
    ...: # create deep copy of df1 and change a value in the index
    ...: df2 = df1.copy(deep=True)
    ...: df2.index.name = "bar"
    ...: df2.index.values[0] = 'c'  # changes both df1 and df2
    ...:
    ...: print(df1)
    ...: print(df2)
     foo  muu
foo
a    NaN  NaN
b    NaN  NaN
     foo  muu
foo
c    NaN  NaN
b    NaN  NaN
     foo  muu
bar
c    NaN  NaN
b    NaN  NaN

In [39]: pd.__version__
Out[39]: '1.1.0.dev0+1216.gd4d58f960'

@mroeschke mroeschke removed Clean Compat pandas objects compatability with Numpy or Python functions labels Apr 10, 2020
@DanielGoldfarb
Copy link

I ran into this problem while coding today. Glad to see it was already reported. Just to chime in and agree with a couple of points:

  1. I believe all of the few languages with which I am familiar (including python), adhere to the convention that "deep" means a complete severance of all connections between the original and the copied object. Imho it's therefore difficult to justify that deep=True should be anything other than deep='all'.

  2. I'm not sure I can agree with this statement, unqualified:

    Since indices are immutable, the underlying data can safely be shared

    It seems to me that the underlying data of an immutable object should also be immutable, or not shared, or, as one person commented above, considered private. Pandas, on the other hand, officially gives the user direct read/write access to the underlying mutable data, via DataFrame.Index.values and DataFrame.Index.array. Either deep=True should be the same as deep='all' (my personal preference for reasons stated above) or perhaps DataFrame.Index.values and DataFrame.Index.array should return a copy (thus treating the underlying data as private). Or perhaps both of these changes should be made, in order to be consistent with both ideas: that "deep" really should mean deep, and "immutable" really should mean immutable.

  3. That said, I understand that there may be performance reasons to find a middle ground. I would be OK using deep='all' if all of this was completely documented, that deep=True does not have the common expectation of "deep," and that deep='all' is necessary to acheive that. I think it should also be documented with a clear warning that the references DataFrame.Index.values and DataFrame.Index.array provide dangerous mutable access to the underlying data of an immutable object (thereby allowing the user to corrupt that object).

    In fact this is exactly how I discovered this bug today. It never occurred to me that DataFrame.Index.values would allow me to directly manipulate the Index (because the documentation said Index is immutable). I wanted to change the index, but knowing that the index was immutable I wrote the code like this:

    df2 = df1.copy()
    values = df2.index.values
    values[3] = foo
    df2.index = pd.Index(values)

    I was surprised to find that the index of df1 had also changed. It took me a while to realized that the index (of both dataframes) changed when I did values[3] = foo (not, as I originally thought, when I reassigned df2.index=).

@jreback
Copy link
Contributor

jreback commented Mar 4, 2021

thanks for the commentary @DanielGoldfarb haooy to have contributions to improve things

@DanielGoldfarb
Copy link

@jreback
Jeff, thanks for the response. Have the maintainers made any decisions as to which direction to go with this? With a little guidance, I would be happy to contribute code and/or documentation.
--Daniel

@goulvenguillard-bib
Copy link

goulvenguillard-bib commented Mar 5, 2021

After df_copy = df_orig.copy(deep=True), I expect list((df_orig==df_copy).all(axis=1)).count(True) to match len(df_orig).

For large dataframes, this is not the case. Just to be sure, is this the same issue ? Are the indexes different ?

If it is the case, this is really disturbing and I’m afraid this may have lead people to wrong results (I only found it by chance).

@smarie
Copy link
Contributor

smarie commented Apr 1, 2022

I also ran into this today, discovered that even if the id of the index was different on the copy, modifying the cp.index.to_numpy() values was corrupting the original.

I am totally in line @DanielGoldfarb 's point 1:

I believe all of the few languages with which I am familiar (including python), adhere to the convention that "deep" means a complete severance of all connections between the original and the copied object. Imho it's therefore difficult to justify that deep=True should be anything other than deep='all'.

A fix for this could be composed of the following elements:

  • have the deep=True behave (and be documented) as the intuition, that is, with absolutely no shared items between the copy and the original. The deep='all' alias can stay around, but if it is not yet official maybe it should better be dropped now.

  • accept a new deep='values' where only the values are deep-copied. This is therefore the same behaviour as today's deep=True. Make this the default to preserve legacy compatibility and speed.

  • optionally accept a new deep=index where only the index is deep-copied. I would not really know why this would be needed, but this is just for symmetry of the API

Would this be ok for everyone ?

@ChiQiao
Copy link

ChiQiao commented Aug 23, 2022

Encounter a similar issue, that if dataframe contains nested dataframes, copy(deep=True) is not "deep" actually.

Code to reproduce the issue:

import pandas as pd

df1 = pd.DataFrame({"foo": [pd.DataFrame({"bar": [1]})]})
print(df1)
df2 = df1.copy(deep=True)
df_inner = df2.loc[0, "foo"]
df_inner *= 2
print(df2)
print(df1)
  1. I believe all of the few languages with which I am familiar (including python), adhere to the convention that "deep" means a complete severance of all connections between the original and the copied object. Imho it's therefore difficult to justify that deep=True should be anything other than deep='all'.

Echoing @DanielGoldfarb 's comment, that the argument deep=True in the copy method is misleading.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests