Description
As pandas
approaches its 1.0 release, I would like to raise a concern about one aspect of the pandas
architecture that I think is a threat to its widespread adoption: how pandas
works with copies and views when setting values (what I will refer to here as the SettingWithCopyWarning
issue).
The summary of my concern is the following:
1. SettingWithCopyWarning is a threat to data integrity
2. It is unreasonable to expect the average user to avoid a `SettingWithCopyWarning` issue, as doing
so requires keeping track of the plethora of factors that determine what generates a copy and what
generates a view.
2a. Views made sense in `numpy`, but not in `pandas`
2b. Chain-indexing is a much more subtle problem than suggested in the `pandas` docs.
3. Given (1) and (2), data integrity in `pandas` relies on users noticing a non-exception warning in the
flow of their output.
4. Even aside from the threat to data integrity, this behavior is unpythonic, and likely to frustrate
alienate lots of potential users of `pandas`.
5. I think solutions can be found that would have only limited effects on performance for the majority of
users
Taking each of these in turn:
(1) SettingWithCopyWarning
is a threat to data integrity
The fact that assignment operations do different things depending on whether the target is a view or a copy has already been recognized as a threat to the predictability of pandas
. Indeed, the reason a warning was added is because users were consistently asking why pandas
was doing un-anticipated things when SettingWithCopyWarning
came into play.
(2) It is unreasonable to expect the average user to avoid a SettingWithCopyWarning
issue, as doing so requires keeping track of the plethora of factors that determine what generates a copy and what generates a view.
Figuring out when a function will return a copy and when it will return a view in pandas
is not simple. Indeed, the pandas
documentation doesn't even try to explain when each will occur (link http://pandas-docs.github.io/pandas-docs-travis/indexing.html?highlight=views#indexing-view-versus-copy):
The reason for having the SettingWithCopy warning is this. Sometimes when you slice an array you will simply get a view back, which means you can set it no problem. However, even a single dtyped array can generate a copy if it is sliced in a particular way. A multi-dtyped DataFrame (meaning it has say float and object data), will almost always yield a copy. Whether a view is created is dependent on the memory layout of the array."
(2a) Views made sense in numpy
, but not in pandas
Views entered the pandas lexicon via numpy
. But the reason they were so useful in numpy
is that they were predictable because numpy
arrays are always single-typed. In pandas
, no such consistent, predictable behavior exists.
(2b) Chain-indexing is a much more subtle problem than suggested in the pandas
docs
At first glance, the pandas
docs suggest that the SettingWithCopyWarning
is easily avoided by avoiding chain-indexing or using .loc
. This, I fear, is misleading for two reasos. First, the canonical example of chain indexing in the docs (dfmi['one']['second'] = value
) seems to suggest that one can avoid chain indexing by just not falling into the trap of this kind of double slicing. The problem, however, is that these slices need not appear near one another. I know I've had trouble with code of the form:
df2 = dfmi['one']
# Lots of intermediate code that doesn't change dfmi or df2
df2['second'] = 5
Moreover, using .loc
only solves this problem if one notices the chained indexing and attempts to fix it in one place. Just consistently using .loc[]
(for example, in both the first and second problematic slicings above) would not solve the problem.
(3) Given (1) and (2), data integrity in pandas
relies on users noticing a non-exception warning in the flow of their output.
This seems really problematic. If a users is printing values as they go along (which CS developers may not do, but interactive casual users often do to monitor the progress of their code), these warnings are easy to miss. And that seems very dangerous.
(4) Even aside from the threat to data integrity, this behavior is unpythonic, and likely to frustrate alienate lots of potential users of pandas
I suspect I come to pandas
from a different perspective than many developers. I am an economist and political scientist who has gotten deeper and deeper into computer science over the past several years for instrumental purposes. As a result, I think I have a pretty good sense of how applied users approach something like pandas
, and I can just see this peculiarity of pandas
driving this class of users batty. I've taken a year of computer science course work and am one of the most technically trained social scientists I know, and it drives me batty.
It's also unpythonic -- the behavior of basic operators (like =
) should not depend on the type of columns in a DataFrame
. Python 3 changed the behavior of the /
operator because it was felt the behavior of /
should not do depend on whether you were working with float
s or int
s. Since whether functions return a view or copy is in large part (but not exclusively) a function of whether a DataFrame
is single or multi-typed (which occurs when some columns are floats
and some are ints
), we have the same problem -- the operation of a basic operation (=
) depends on data types.
In other words, if one of the aims of pandas
is to essentially surplant R
among applied data scientists, then I think this is a major threat to achieving that goal.
(5) I think solutions can be found that would have only limited effects on performance for the majority of users
pandas
uses views because they're so damn fast, so I understand the reluctance to drop them, but I think there are ways to minimize the performance hit. Obviously more talented developers with a better understanding of the innards of pandas
may have better suggestions, but hopefully this can get the ball rolling.
* Solution 1: Move `views` to the background.
When a user tries to look at an object and it's possible to return a view, do so. But just never let a
user assign values to a view -- any time an attempt is made to set on a view, convert it to a copy
before executing the assignment. Views will still operate in the background providing high speed
data access in read-only environments, but users don't have to worry about what they're dealing
with. Users who *really* need access to views can work with `numpy` arrays.
(I would also note that given the unpredictability of when one will get a view or copy, it's not clear to
me how anyone can write code that takes advantage of the behavior of views, which makes me
doubt there are many people for whom this would seriously impact performance or written code, but
I'd be happy to hear if anyone has workarounds!)
* Solution 2: Create an indexer that always returns copies (like .take(), but for axis labels).
This would at least give users who want to avoid views all together a way to do so without littering
their code with `.copy()`s.
* Solution 3: Change the `SettingWithCopyWarning` to an exception by default.
This is currently a setting, but the global default is for it to be a warning. Personally, I still don't like
this solution since, as a result of (2) this means `pandas` will now raise exceptions unpredictably, but
at least data integrity will be preserved.
pandas
is a brilliant tool, and a huge improvement on everything else out there. I am eager to see it becomes the standard not only among python users, but among data analysts more broadly. Hopefully, by addressing this issue, we can help make this happen.
With that in mind, I would like to suggest the need for two things:
- A discussion about the desirability of the various solutions proposed above
- Volunteers to help implement this change. Unfortunately, I don't have the programming sophistication or knowledge of
pandas
internals to take this on alone, and this is likely too big an undertaking for any one individual anyone, so a team is likely to be necessary.