-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Provide an errors parameter to fillna #15653
Conversation
Try it out yourself: pd.DataFrame({'a': [1,2,np.nan], 'b': [np.nan, 2, 3]}).fillna({'a': 'foo', 'b': 'bar'}, errors='raise')
pd.DataFrame({'a': [1,2,np.nan], 'b': [np.nan, 2, 3]}).fillna({'a': 'foo', 'b': 'bar'}, errors='force')
pd.DataFrame({'a': [1,2,np.nan], 'b': [np.nan, 2, 3]}).fillna({'a': 'foo', 'b': 'bar'}, errors='ignore')
|
@jreback Two API questions:
|
This fails seemingly every
|
Yeah, I think |
pandas/core/generic.py
Outdated
inplace = validate_bool_kwarg(inplace, 'inplace') | ||
|
||
from pandas import Series |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you won't do any of this here, this will all be done in pandas\core\missing.py
pandas/core/internals.py
Outdated
if errors == 'force' or errors == 'raise': | ||
# we cannot coerce the underlying object, so | ||
# make an ObjectBlock | ||
return self.to_object_block(mgr=mgr).fillna(original_value, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, this should all be done in missing
pandas/types/missing.py
Outdated
@@ -391,3 +394,36 @@ def na_value_for_dtype(dtype): | |||
elif is_bool_dtype(dtype): | |||
return False | |||
return np.nan | |||
|
|||
|
|||
def valid_fill_value(value, dtype, is_period=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> validate_fill_value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just returns a bool
, the action verb validate
implies to me raising an error or processing the data somehow. Like in e.g. maybe_upcast
. I can rename this, but I suggest reserving validate_fill_value
as the name for the fillna
sub-impl I'm moving to core/missing.py
per your review.
pandas/types/missing.py
Outdated
---------- | ||
value : scalar | ||
dtype: string / dtype | ||
is_period : bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take this parameter out, you can skip period validation for now if its not working
pandas/core/missing.py
Outdated
@@ -621,3 +622,24 @@ def fill_zeros(result, x, y, name, fill): | |||
result = result.reshape(shape) | |||
|
|||
return result | |||
|
|||
|
|||
def maybe_fill(obj, value, errors): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is better as validate_fill_value
, just raising if its an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would work if we had coerce
and raise
modes only, but with the addition of ignore
we have to flag the non-conformal column so that the fill routine can skip it. Hence why I modified this method to return a bool
.
How else would you implement ignore
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cleanest way I can think of is writing a separate maybe_ignore
method which factors that bit of logic out. But, you will have to run a is_valid_fill_value
operation twice. Maybe that's OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a non-idempotent (probably not the right word for this) function that either raises or returns a flag depending on the mode.
For the fillna
use-case alone one way to avoid this combination would be to add an if-else
to the method body, and split this logic into two separate methods in missing
(so e.g. if mode is errors do validate_fill_value
, otherwise do continue_to_fill
).
This would be fine for fillna
alone. But, I'm trying to keep an eye out also for extending this raise/ignore/coerce stuff to other methods as well in some later PRs (#15533), and obviously you can't go about inserting this same if-else
block everywhere all willy-nilly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having a function that takes an argument, then returns (with possibly modified), and potentially raises is just fine as well. You can always raise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
validation functions on the other hand will generally just return None
OR raise. returning a boolean is a bit odd actually. You want simple as possible for things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback I agree with you about the function signature, will do.
The trouble is that we need some mechanism for telling the underlying function (fillna
) to short-circuit. So we would need e.g.
try:
missing.validate_fill_value(self, value)
except ValueError:
if errors == 'ignore':
return
I don't know of any way of implementing that in a separate function without passing a flag somewhere. If we want to wrap this whole sequence of actions in one function call, that would have to be part of the method signature.
But, looking at this again, I think it'd actually be best to just keep this logic in fillna
for now, and figure out to generalize it more broadly (so we don't have to rewrite this same bit in shift
et. al.) in a different PR later.
pandas/core/missing.py
Outdated
|
||
def maybe_fill(obj, value, errors): | ||
""" | ||
fillna error coercion routine, returns whether or not to continue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add a doc-string
pandas/core/missing.py
Outdated
fillna error coercion routine, returns whether or not to continue. | ||
""" | ||
from pandas import Series | ||
if isinstance(obj, Series): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use ABCSeries
(import at top)
pandas/core/missing.py
Outdated
from pandas import Series | ||
if isinstance(obj, Series): | ||
if errors is None or errors == 'coerce': | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
errors must be coerce, raise, ignore
, not is not acceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, I provided coerce
as the default kwarg and passed that on through the fillna
method stack. However whilst Sparse and Categorical series accept an errors
param they raise immediately if it is not None
(because it's not implemented/doesn't make sense for them). So the default (coerce
) was causing those routines to raise—and a lot of test failures.
I remade the default param to None
and introduced this modified check to address this concern.
How should this be handled instead?
pandas/types/missing.py
Outdated
@@ -391,3 +394,30 @@ def na_value_for_dtype(dtype): | |||
elif is_bool_dtype(dtype): | |||
return False | |||
return np.nan | |||
|
|||
|
|||
def valid_fill_value(value, dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> is_valid_fill_value
ebec92c
to
54c8406
Compare
pandas/core/missing.py
Outdated
_ensure_float64) | ||
|
||
is_float_dtype, is_datetime64_dtype, | ||
is_datetime64tz_dtype, is_integer_dtype, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leave the formatting as is
pandas/core/missing.py
Outdated
|
||
Fillna error coercion routine. | ||
|
||
Construct a piecewise polynomial in the Bernstein basis, compatible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dangers of copy-paste.
pandas/core/missing.py
Outdated
parsed as a sequence of sub-Series later on. | ||
value : object | ||
The value to be used as a fill for the object. | ||
errors : {'raise', 'coerce', or 'ignore'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not a parameter
I'm investigating the five tests in the suite that the current code still doesn't pass, and I've hit something I can't explain originating in At a high level the bug is the fact that with the following bit:
While this PR gives:
This difference seems to be emitting from line 268 in
In
In this PR:
But how is this possible? This PR only adds a bit of if-else logic to the And the current master seems to pass all tests, including this one...huh. I wouldn't call this an edge case, the old behavior definitely needs to be preserved. Maybe I'm missing something, going to have to circle back to this later. |
@ResidentMario this is related to this bug: #15613 I think with the default |
pandas/core/internals.py
Outdated
inplace=inplace, | ||
downcast=False) | ||
if errors == 'coerce' or errors == 'raise': | ||
# we cannot coerce the underlying object, so |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise
shoulnd't coerce.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should have already caught and raised in this case, no need to have this need. Seems I made a bit of a mess while figuring out a working code path.
downcast=False, | ||
errors='ignore') | ||
else: # errors == 'ignore' | ||
return self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is tricky for datetimes, they auto-infer.
Thanks for the clarification, this is seems to be the same as your comment. Going to need to fiddle with it to find a workaround. A more minimal example: pd.Series([pd.Timestamp('2015-01-01'), pd.NaT]).fillna("?") Fills with |
Should probably write the tests first, this got picked up because a test happened to depend on this behavior, maybe there are other cases also like this. |
Looking at the tests, I see three dissatisfactory cases:
The second one needs to be solved, the first and last are limitations of |
Looks like it was simpler than I had thought. The problem is that to keep
To the |
87854e3
to
5010d5c
Compare
Does this perhaps also fix the problem reported in #16112? |
No, trying an error mode on a sparse matrix will raise an error for now. |
Suppose we have: import pandas as pd
from io import StringIO
res = pd.read_csv(StringIO(""",dates1,dates2,floatcol,intcol,stringcol
0,1970-01-01 00:00:00.000000000,1970-01-01 00:00:00.000000000,,,
1,1970-01-01 00:00:00.000000000,,,,
2,1970-01-01 00:00:00.000000000,1970-01-01 00:00:00.000000000,,,
3,1970-01-01 00:00:00.000000000,,,,
4,1970-01-01 00:00:00.000000000,,,,
5,1970-01-01 00:00:00.000000000,1970-01-01 00:00:00.000000000,,,
6,1970-01-01 00:00:00.000000000,1970-01-01 00:00:00.000000000,,,
7,1970-01-01 00:00:00.000000000,,,,
8,1970-01-01 00:00:00.000000000,1970-01-01 00:00:00.000000001,,,
9,1970-01-01 00:00:00.000000001,1970-01-01 00:00:00.000000001,,,"""), index_col=0) On master:
On the feature branch:
On master even though the first column is already full and doesn't receive any boolean fills, the column Since I now force this operation to be done column-by-column, But, this test relies on the old behavior, because it then attempts: res.fillna(True).astype(bool) Which fails with:
Because while I suspect the other remaining failures are similar. |
On master: import pandas as pd
from io import StringIO
res = pd.read_csv(StringIO(""",dates1,dates2
0,1970-01-01 00:00:00.000000000,1970-01-01 00:00:00.000000000
1,1970-01-01 00:00:00.000000000,"""), index_col=0)
res['dates1'] = pd.to_datetime(res['dates1'])
res.astype(bool) Raises:
|
a1a39e1
to
a30fe18
Compare
a30fe18
to
ee1f011
Compare
@ResidentMario your last is correct. if you need to do booleans with that you have to fillna first.
|
@jreback Unsure what you mean above, can you clarify? |
@ResidentMario I mean your example above is exactly as expected an not an error. |
Returning to this after a long absence. @jreback's solution works in the reproduction case, but can't be used in implementation. The problem is that it uses The brute-force way of resolving this, mentioned at #16155, is to upcast the But I'm just not seeing any hooks I can use to get in there and fix just this corner case specifically, either. I'm stumped. Welcoming suggestions from others. This is the last remaining test failure in this PR (assuming the driver failures are benign). There's still work to do with checking the impact on performance and going over the tests again, but this is otherwise reasonably close to a finish. |
closing as stale. this is a bit easier in current master FYI. |
git diff upstream/master | flake8 --diff