Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: A cheap copy on a dataframe #19102

Closed
has2k1 opened this issue Jan 6, 2018 · 6 comments · Fixed by #56614
Closed

ENH: A cheap copy on a dataframe #19102

has2k1 opened this issue Jan 6, 2018 · 6 comments · Fixed by #56614
Labels
Copy / view semantics Enhancement Needs Discussion Requires discussion from core team before further action

Comments

@has2k1
Copy link
Contributor

has2k1 commented Jan 6, 2018

With the deprecation, there is no cheap way tell pandas that the user knows that they are modifying a copied dataframe. Libraries that use pandas cannot afford to always copy() after sub-setting. My comment in the PR thread gives a specific example.

If resurrecting is_copy is not an option, then a better solution could be a method that can be applied inline.

df = pd.DataFrame({'A':[1,2,3]})
df2 = df[df.A>2].as_copy()               # New method, as_copy
df2['B'] = 2
@jreback
Copy link
Contributor

jreback commented Jan 6, 2018

I am not sure how you think the above would work. If you are filtering, then adding columns you need to copy. You can avoid the memory penalty by reassigning to the original variable which will have the original garbage collected. I actually doubt this is an issue in practice unless you are very nearly at memory limits in any event.

In [2]: df = pd.DataFrame({'A':[1,2,3]})
   ...: df = df[df.A>2].copy()
   ...: df['B'] = 2

We won't have copy-on-write in the current version of pandas, so this is an intractable problem. defensive copying is the solution, and to be honest its not a big deal to do this. Virtually every pandas operation copies anyhow.

@jreback jreback closed this as completed Jan 6, 2018
@jreback jreback added this to the won't fix milestone Jan 6, 2018
@TomAugspurger
Copy link
Contributor

@has2k1 can you explain what this line is supposed to do?

df2 = df[df.A>2].as_copy()               # New method, as_copy

Is that supposed to copy if and only if the underlying data is a view, and not otherwise?

@has2k1
Copy link
Contributor Author

has2k1 commented Jan 6, 2018

Is that supposed to copy if and only if the underlying data is a view, and not otherwise?

Yes

@has2k1
Copy link
Contributor Author

has2k1 commented Jan 6, 2018

I actually doubt this is an issue in practice unless you are very nearly at memory limits in any event.

Virtually every pandas operation copies anyhow.

@jreback , when there are 100s of operations that involve copying the penalty of an extra copy becomes significant. It even worthwhile to tease out the operations where the inplace argument actually avoids a copy e.g reset_index.

As the issue does not be a wider concern, I'll reevaluate the use patterns and maybe switch over to _is_copy until copy-on-write saves the day.

@jreback
Copy link
Contributor

jreback commented Jan 6, 2018

@has2k1 I find it hard to believe that a plotting library is actually affected by copy operations. good luck.

@jorisvandenbossche
Copy link
Member

I don't think we should be so dismissive to the needs of other projects. A plotting library can certainly care about memory usage (plotnine can work with a lot of data, because it does not necessarily plot it all, but calculate statistics (eg hist, boxplot) for different factors. I can imagine in those operations you regularly need to subset a dataframe)

We have seen reports of IMO valid usecases in both plotnine and sklearn. IMO we should need to think about some way to give such power users a means to deal with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Copy / view semantics Enhancement Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants