Skip to content

Views, Copies, and the SettingWithCopyWarning Issue #10954

Closed
@nickeubank

Description

@nickeubank

As pandas approaches its 1.0 release, I would like to raise a concern about one aspect of the pandas architecture that I think is a threat to its widespread adoption: how pandas works with copies and views when setting values (what I will refer to here as the SettingWithCopyWarning issue).

The summary of my concern is the following:

1. SettingWithCopyWarning is a threat to data integrity
2. It is unreasonable to expect the average user to avoid a `SettingWithCopyWarning` issue, as doing
    so requires keeping track of the plethora of factors that determine what generates a copy and what
    generates a view.
    2a. Views made sense in `numpy`, but not in `pandas`
    2b. Chain-indexing is a much more subtle problem than suggested in the `pandas` docs. 
3. Given (1) and (2), data integrity in `pandas` relies on users noticing a non-exception warning in the
    flow of their output.
4. Even aside from the threat to data integrity, this behavior is unpythonic, and likely to frustrate
    alienate lots of potential users of `pandas`. 
5. I think solutions can be found that would have only limited effects on performance for the majority of  
    users

Taking each of these in turn:

(1) SettingWithCopyWarning is a threat to data integrity

The fact that assignment operations do different things depending on whether the target is a view or a copy has already been recognized as a threat to the predictability of pandas. Indeed, the reason a warning was added is because users were consistently asking why pandas was doing un-anticipated things when SettingWithCopyWarning came into play.

(2) It is unreasonable to expect the average user to avoid a SettingWithCopyWarning issue, as doing so requires keeping track of the plethora of factors that determine what generates a copy and what generates a view.

Figuring out when a function will return a copy and when it will return a view in pandas is not simple. Indeed, the pandas documentation doesn't even try to explain when each will occur (link http://pandas-docs.github.io/pandas-docs-travis/indexing.html?highlight=views#indexing-view-versus-copy):

The reason for having the SettingWithCopy warning is this. Sometimes when you slice an array you will simply get a view back, which means you can set it no problem. However, even a single dtyped array can generate a copy if it is sliced in a particular way. A multi-dtyped DataFrame (meaning it has say float and object data), will almost always yield a copy. Whether a view is created is dependent on the memory layout of the array."

(2a) Views made sense in numpy, but not in pandas
Views entered the pandas lexicon via numpy. But the reason they were so useful in numpy is that they were predictable because numpy arrays are always single-typed. In pandas, no such consistent, predictable behavior exists.

(2b) Chain-indexing is a much more subtle problem than suggested in the pandas docs
At first glance, the pandas docs suggest that the SettingWithCopyWarning is easily avoided by avoiding chain-indexing or using .loc. This, I fear, is misleading for two reasos. First, the canonical example of chain indexing in the docs (dfmi['one']['second'] = value) seems to suggest that one can avoid chain indexing by just not falling into the trap of this kind of double slicing. The problem, however, is that these slices need not appear near one another. I know I've had trouble with code of the form:

df2 = dfmi['one']

# Lots of intermediate code that doesn't change dfmi or df2

df2['second'] = 5

Moreover, using .loc only solves this problem if one notices the chained indexing and attempts to fix it in one place. Just consistently using .loc[] (for example, in both the first and second problematic slicings above) would not solve the problem.

(3) Given (1) and (2), data integrity in pandas relies on users noticing a non-exception warning in the flow of their output.

This seems really problematic. If a users is printing values as they go along (which CS developers may not do, but interactive casual users often do to monitor the progress of their code), these warnings are easy to miss. And that seems very dangerous.

(4) Even aside from the threat to data integrity, this behavior is unpythonic, and likely to frustrate alienate lots of potential users of pandas

I suspect I come to pandas from a different perspective than many developers. I am an economist and political scientist who has gotten deeper and deeper into computer science over the past several years for instrumental purposes. As a result, I think I have a pretty good sense of how applied users approach something like pandas, and I can just see this peculiarity of pandas driving this class of users batty. I've taken a year of computer science course work and am one of the most technically trained social scientists I know, and it drives me batty.

It's also unpythonic -- the behavior of basic operators (like =) should not depend on the type of columns in a DataFrame. Python 3 changed the behavior of the / operator because it was felt the behavior of / should not do depend on whether you were working with floats or ints. Since whether functions return a view or copy is in large part (but not exclusively) a function of whether a DataFrame is single or multi-typed (which occurs when some columns are floats and some are ints), we have the same problem -- the operation of a basic operation (=) depends on data types.

In other words, if one of the aims of pandas is to essentially surplant R among applied data scientists, then I think this is a major threat to achieving that goal.

(5) I think solutions can be found that would have only limited effects on performance for the majority of users
pandas uses views because they're so damn fast, so I understand the reluctance to drop them, but I think there are ways to minimize the performance hit. Obviously more talented developers with a better understanding of the innards of pandas may have better suggestions, but hopefully this can get the ball rolling.

* Solution 1: Move `views` to the background.
   When a user tries to look at an object and it's possible to return a view, do so. But just never let a 
   user assign values to a view -- any time an attempt is made to set on a view, convert it to a copy
   before executing the assignment. Views will still operate in the background providing high speed 
   data access in read-only environments, but users don't have to worry about what they're dealing 
   with. Users who *really* need access to views can work with `numpy` arrays. 

  (I would also note that given the unpredictability of when one will get a view or copy, it's not clear to 
  me how anyone can write code that takes advantage of the behavior of views, which makes me 
  doubt there are many people for whom this would seriously impact performance or written code, but 
  I'd be happy to hear if anyone has workarounds!)

* Solution 2: Create an indexer that always returns copies (like .take(), but for axis labels). 
   This would at least give users who want to avoid views all together a way to do so without littering
   their code with `.copy()`s. 

* Solution 3: Change the `SettingWithCopyWarning` to an exception by default. 
  This is currently a setting, but the global default is for it to be a warning. Personally, I still don't like 
  this solution since, as a result of (2) this means `pandas` will now raise exceptions unpredictably, but 
  at least data integrity will be preserved.     

pandas is a brilliant tool, and a huge improvement on everything else out there. I am eager to see it becomes the standard not only among python users, but among data analysts more broadly. Hopefully, by addressing this issue, we can help make this happen.

With that in mind, I would like to suggest the need for two things:

  1. A discussion about the desirability of the various solutions proposed above
  2. Volunteers to help implement this change. Unfortunately, I don't have the programming sophistication or knowledge of pandas internals to take this on alone, and this is likely too big an undertaking for any one individual anyone, so a team is likely to be necessary.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions