ENH: Provide an errors parameter to fillna #15653

ResidentMario · 2017-03-11T01:47:44Z

progress towards ENH: standardize fill_value behavior across the API #15533
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

ResidentMario · 2017-03-11T01:50:04Z

Try it out yourself:

pd.DataFrame({'a': [1,2,np.nan], 'b': [np.nan, 2, 3]}).fillna({'a': 'foo', 'b': 'bar'}, errors='raise')
pd.DataFrame({'a': [1,2,np.nan], 'b': [np.nan, 2, 3]}).fillna({'a': 'foo', 'b': 'bar'}, errors='force')
pd.DataFrame({'a': [1,2,np.nan], 'b': [np.nan, 2, 3]}).fillna({'a': 'foo', 'b': 'bar'}, errors='ignore')

types/missing.py:valid_fill_value has a test suite in test_missing.py, but I'm confused as to where I should put the tests for the new fillna bits...

ResidentMario · 2017-03-11T02:02:19Z

@jreback Two API questions:

Pretty sure it should be coerce, not force. See e.g. here.
Should the default behavior change? You stated in discussion that the default behavior should be ignore, not force. The latter is the default right now, however. This seems like a really big change, I think it'd be wise to have a deprecation cycle first...

ResidentMario · 2017-03-11T04:24:53Z

This fails seemingly every sparse test in the test suite. 👀

sparse things obviously already have hardcoded dtype requirements so I should find a way of disabling this feature for all such types.

TomAugspurger · 2017-03-11T12:24:00Z

Yeah, I think 'coerce' instead of 'force'. Not sure about changing the default. I don't think ignore should be the default. I think raise is more consistent with to_numeric. But it's a big change, and maybe not worth it.

jreback · 2017-03-11T17:34:38Z

pandas/core/generic.py

        inplace = validate_bool_kwarg(inplace, 'inplace')

+        from pandas import Series


you won't do any of this here, this will all be done in pandas\core\missing.py

jreback · 2017-03-11T17:34:56Z

pandas/core/internals.py

+            if errors == 'force' or errors == 'raise':
+                # we cannot coerce the underlying object, so
+                # make an ObjectBlock
+                return self.to_object_block(mgr=mgr).fillna(original_value,


same here, this should all be done in missing

jreback · 2017-03-11T17:35:35Z

pandas/types/missing.py

@@ -391,3 +394,36 @@ def na_value_for_dtype(dtype):
    elif is_bool_dtype(dtype):
        return False
    return np.nan
+
+
+def valid_fill_value(value, dtype, is_period=False):


-> validate_fill_value

This just returns a bool, the action verb validate implies to me raising an error or processing the data somehow. Like in e.g. maybe_upcast. I can rename this, but I suggest reserving validate_fill_value as the name for the fillna sub-impl I'm moving to core/missing.py per your review.

jreback · 2017-03-11T17:35:58Z

pandas/types/missing.py

+    ----------
+    value : scalar
+    dtype: string / dtype
+    is_period : bool


take this parameter out, you can skip period validation for now if its not working

jreback · 2017-03-12T15:19:25Z

pandas/core/missing.py

@@ -621,3 +622,24 @@ def fill_zeros(result, x, y, name, fill):
            result = result.reshape(shape)

    return result
+
+
+def maybe_fill(obj, value, errors):


this is better as validate_fill_value, just raising if its an error

That would work if we had coerce and raise modes only, but with the addition of ignore we have to flag the non-conformal column so that the fill routine can skip it. Hence why I modified this method to return a bool.

How else would you implement ignore?

The cleanest way I can think of is writing a separate maybe_ignore method which factors that bit of logic out. But, you will have to run a is_valid_fill_value operation twice. Maybe that's OK.

This is a non-idempotent (probably not the right word for this) function that either raises or returns a flag depending on the mode.

For the fillna use-case alone one way to avoid this combination would be to add an if-else to the method body, and split this logic into two separate methods in missing (so e.g. if mode is errors do validate_fill_value, otherwise do continue_to_fill).

This would be fine for fillna alone. But, I'm trying to keep an eye out also for extending this raise/ignore/coerce stuff to other methods as well in some later PRs (#15533), and obviously you can't go about inserting this same if-else block everywhere all willy-nilly.

having a function that takes an argument, then returns (with possibly modified), and potentially raises is just fine as well. You can always raise.

validation functions on the other hand will generally just return None OR raise. returning a boolean is a bit odd actually. You want simple as possible for things.

@jreback I agree with you about the function signature, will do.

The trouble is that we need some mechanism for telling the underlying function (fillna) to short-circuit. So we would need e.g.

try: missing.validate_fill_value(self, value) except ValueError: if errors == 'ignore': return

I don't know of any way of implementing that in a separate function without passing a flag somewhere. If we want to wrap this whole sequence of actions in one function call, that would have to be part of the method signature.

But, looking at this again, I think it'd actually be best to just keep this logic in fillna for now, and figure out to generalize it more broadly (so we don't have to rewrite this same bit in shift et. al.) in a different PR later.

jreback · 2017-03-12T15:19:31Z

pandas/core/missing.py

+
+def maybe_fill(obj, value, errors):
+    """
+    fillna error coercion routine, returns whether or not to continue.


pls add a doc-string

jreback · 2017-03-12T15:19:47Z

pandas/core/missing.py

+    fillna error coercion routine, returns whether or not to continue.
+    """
+    from pandas import Series
+    if isinstance(obj, Series):


use ABCSeries (import at top)

jreback · 2017-03-12T15:20:18Z

pandas/core/missing.py

+    from pandas import Series
+    if isinstance(obj, Series):
+        if errors is None or errors == 'coerce':
+            return True


errors must be coerce, raise, ignore, not is not acceptable.

Previously, I provided coerce as the default kwarg and passed that on through the fillna method stack. However whilst Sparse and Categorical series accept an errors param they raise immediately if it is not None (because it's not implemented/doesn't make sense for them). So the default (coerce) was causing those routines to raise—and a lot of test failures.

I remade the default param to None and introduced this modified check to address this concern.

How should this be handled instead?

jreback · 2017-03-12T15:21:13Z

pandas/types/missing.py

@@ -391,3 +394,30 @@ def na_value_for_dtype(dtype):
    elif is_bool_dtype(dtype):
        return False
    return np.nan
+
+
+def valid_fill_value(value, dtype):


-> is_valid_fill_value

jreback · 2017-04-02T22:28:53Z

pandas/core/missing.py

-                                 _ensure_float64)
-
+                                 is_float_dtype, is_datetime64_dtype,
+                                 is_datetime64tz_dtype, is_integer_dtype,


leave the formatting as is

jreback · 2017-04-02T22:29:20Z

pandas/core/missing.py

+
+    Fillna error coercion routine.
+
+    Construct a piecewise polynomial in the Bernstein basis, compatible


The dangers of copy-paste.

jreback · 2017-04-02T22:29:29Z

pandas/core/missing.py

+        parsed as a sequence of sub-Series later on.
+    value : object
+        The value to be used as a fill for the object.
+    errors : {'raise', 'coerce', or 'ignore'}


this is not a parameter

ResidentMario · 2017-04-02T23:10:38Z

I'm investigating the five tests in the suite that the current code still doesn't pass, and I've hit something I can't explain originating in TestDataFrameMissingData.test_fillna_datetime_columns.

At a high level the bug is the fact that with the following bit:

>>> pd.DataFrame({'A': [-1, -2, np.nan],
              'B': [pd.Timestamp('2013-01-01'), pd.Timestamp('2013-01-02'), pd.NaT],
              'C': ['foo', 'bar', None],
              'D': ['foo2', 'bar2', None]},
              index=pd.date_range('20130110', periods=3)).fillna('?')

0.19.2 gives:

            A                    B    C     D
2013-01-10 -1  2013-01-01 00:00:00  foo  foo2
2013-01-11 -2  2013-01-02 00:00:00  bar  bar2
2013-01-12  ?                    ?    ?     ?

While this PR gives:

            A                    B    C     D
2013-01-10 -1  2013-01-01 00:00:00  foo  foo2
2013-01-11 -2  2013-01-02 00:00:00  bar  bar2
2013-01-12  ?           	NaT   ?     ?

This difference seems to be emitting from line 268 in frame.py, where in DataFrame.__init__ you have:

    if isinstance(data, BlockManager):
        mgr = self._init_mgr(data, axes=dict(index=index, columns=columns),
                             dtype=dtype, copy=copy)

In 0.19.2 this emits:

BlockManager
Items: Index(['A', 'B', 'C', 'D'], dtype='object')
Axis 1: DatetimeIndex(['2013-01-10', '2013-01-11', '2013-01-12'], dtype='datetime64[ns]', freq='D')
ObjectBlock: slice(0, 4, 1), 4 x 3, dtype: object

In this PR:

BlockManager
Items: Index(['A', 'B', 'C', 'D'], dtype='object')
Axis 1: DatetimeIndex(['2013-01-10', '2013-01-11', '2013-01-12'], dtype='datetime64[ns]', freq='D')
DatetimeBlock: slice(1, 2, 1), 1 x 3, dtype: datetime64[ns]
ObjectBlock: [0, 2, 3], 3 x 3, dtype: object

But how is this possible? This PR only adds a bit of if-else logic to the fillna method; it doesn't touch any of these deep internals (which are mostly beyond me anyway). So it shouldn't cause any differences in how block managers operate.

And the current master seems to pass all tests, including this one...huh.

I wouldn't call this an edge case, the old behavior definitely needs to be preserved.

Maybe I'm missing something, going to have to circle back to this later.

jreback · 2017-04-03T15:04:04Z

@ResidentMario this is related to this bug: #15613 I think

with the default errors='ignore' this should return as 0.19.2

jreback · 2017-04-03T15:05:10Z

pandas/core/internals.py

-                                                        inplace=inplace,
-                                                        downcast=False)
+            if errors == 'coerce' or errors == 'raise':
+                # we cannot coerce the underlying object, so


raise shoulnd't coerce.

Yes, we should have already caught and raised in this case, no need to have this need. Seems I made a bit of a mess while figuring out a working code path.

jreback · 2017-04-03T15:05:23Z

pandas/core/internals.py

+                                                            downcast=False,
+                                                            errors='ignore')
+            else:  # errors == 'ignore'
+                return self


this is tricky for datetimes, they auto-infer.

ResidentMario · 2017-04-04T01:39:45Z

Thanks for the clarification, this is seems to be the same as your comment. Going to need to fiddle with it to find a workaround.

A more minimal example:

pd.Series([pd.Timestamp('2015-01-01'), pd.NaT]).fillna("?")

Fills with ? on master and NaT on fillna-errors.

ResidentMario · 2017-04-04T01:49:15Z

Should probably write the tests first, this got picked up because a test happened to depend on this behavior, maybe there are other cases also like this.

ResidentMario · 2017-04-04T17:55:16Z

Looking at the tests, I see three dissatisfactory cases:

Period columns are always dtype('O'), and there's no easy way of checking whether or not a column is period-type (previously discussed).
date-type things auto-coerce to NaT, so their default behavior is equivalent to errors="ignore" for everything else (see above).
A column with a bool and nan is dtype('O'). So e.g. we can't prevent Series([True, np.nan]).fillna('foobar', errors='ignore') from producing True, 'foobar'.

The second one needs to be solved, the first and last are limitations of pandas nullity handling that probably need to be lived with.

ResidentMario · 2017-04-16T21:22:59Z

Looks like it was simpler than I had thought.

The problem is that to keep generic/fillna working with categorical types, I pass None instead of 'coerce' down as the default value to internals/fillna. Adding:

if not errors:
    errors = 'coerce'

To the fillna method body fixed this problem. For now, this is OK. But, I'm hoping I'll find a better solution later.

kernc · 2017-04-24T17:37:54Z

Does this perhaps also fix the problem reported in #16112?

ResidentMario · 2017-04-26T20:30:21Z

No, trying an error mode on a sparse matrix will raise an error for now.

ResidentMario · 2017-04-26T21:37:02Z

TestDataFrameOperators.test_timestamp_compare fails because of something remarkably subtle.

Suppose we have:

import pandas as pd
from io import StringIO

res = pd.read_csv(StringIO(""",dates1,dates2,floatcol,intcol,stringcol
0,1970-01-01 00:00:00.000000000,1970-01-01 00:00:00.000000000,,,
1,1970-01-01 00:00:00.000000000,,,,
2,1970-01-01 00:00:00.000000000,1970-01-01 00:00:00.000000000,,,
3,1970-01-01 00:00:00.000000000,,,,
4,1970-01-01 00:00:00.000000000,,,,
5,1970-01-01 00:00:00.000000000,1970-01-01 00:00:00.000000000,,,
6,1970-01-01 00:00:00.000000000,1970-01-01 00:00:00.000000000,,,
7,1970-01-01 00:00:00.000000000,,,,
8,1970-01-01 00:00:00.000000000,1970-01-01 00:00:00.000000001,,,
9,1970-01-01 00:00:00.000000001,1970-01-01 00:00:00.000000001,,,"""), index_col=0)

On master:

In [14]: res.fillna(True).dtypes
Out[14]: 
dates1        object
dates2        object
floatcol     float64
intcol       float64
stringcol    float64
dtype: object

On the feature branch:

In [14]: res.fillna(True).dtypes
Out[14]: 
dates1        datetime64[ns]
dates2        object
floatcol     float64
intcol       float64
stringcol    float64
dtype: object

On master even though the first column is already full and doesn't receive any boolean fills, the column dtype still gets upcast to object. I think that's because fillna internally runs the operation, and the upcast, on the 2-by-n "double datetimes" block.

Since I now force this operation to be done column-by-column, fillna will no longer do this—it will leave that first column's dtype untouched. That's not a bug; that's a feature!

But, this test relies on the old behavior, because it then attempts:

res.fillna(True).astype(bool)

Which fails with:

*** TypeError: cannot astype a datetimelike from [datetime64[ns]] to [bool]

Because while object columns can be cast to bool, datetime64 columns cannot.

I suspect the other remaining failures are similar.

ResidentMario · 2017-04-26T22:40:10Z

On master:

import pandas as pd
from io import StringIO

res = pd.read_csv(StringIO(""",dates1,dates2
0,1970-01-01 00:00:00.000000000,1970-01-01 00:00:00.000000000
1,1970-01-01 00:00:00.000000000,"""), index_col=0)

res['dates1'] = pd.to_datetime(res['dates1'])
res.astype(bool)

Raises:

TypeError: cannot astype a datetimelike from [datetime64[ns]] to [bool]

jreback · 2017-04-28T21:14:33Z

@ResidentMario your last is correct. if you need to do booleans with that you have to fillna first.

In [25]: pd.to_datetime(res['dates2'])
Out[25]: 
0   1970-01-01
1          NaT
Name: dates2, dtype: datetime64[ns]

In [26]: pd.to_datetime(res['dates2']).fillna(False).astype(bool)
Out[26]: 
0     True
1    False
Name: dates2, dtype: bool

ResidentMario · 2017-04-29T02:21:18Z

@jreback Unsure what you mean above, can you clarify?

jreback · 2017-04-29T18:29:56Z

@ResidentMario I mean your example above is exactly as expected an not an error.

ResidentMario · 2017-06-11T20:27:22Z

Returning to this after a long absence.

@jreback's solution works in the reproduction case, but can't be used in implementation. The problem is that it uses fillna, and this changeset is a feature patch for...fillna.

The brute-force way of resolving this, mentioned at #16155, is to upcast the datetime to object dtype beforehand. That doesn't seem right at all.

But I'm just not seeing any hooks I can use to get in there and fix just this corner case specifically, either.

I'm stumped. Welcoming suggestions from others.

This is the last remaining test failure in this PR (assuming the driver failures are benign). There's still work to do with checking the impact on performance and going over the tests again, but this is otherwise reasonably close to a finish.

jreback · 2017-08-17T10:31:07Z

closing as stale. this is a bit easier in current master FYI.

jreback reviewed Mar 11, 2017

View reviewed changes

jreback added Error Reporting Incorrect or improved errors from pandas Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Mar 11, 2017

jreback requested changes Mar 12, 2017

View reviewed changes

ResidentMario force-pushed the fillna-errors branch from ebec92c to 54c8406 Compare April 2, 2017 20:04

jreback reviewed Apr 2, 2017

View reviewed changes

jreback reviewed Apr 3, 2017

View reviewed changes

ResidentMario force-pushed the fillna-errors branch from 87854e3 to 5010d5c Compare April 17, 2017 17:54

This was referenced Apr 26, 2017

BUG: Exception to the boolean frame comparison guarantee #16155

Closed

BUG: fillna returns frame when inplace=True if fill values are specified as a dict #16156

Closed

ResidentMario force-pushed the fillna-errors branch from a1a39e1 to a30fe18 Compare April 28, 2017 04:11

ResidentMario added 2 commits April 28, 2017 00:14

Add fill validation methods.

f051112

Clobber commits together to ease merge.

ee1f011

ResidentMario force-pushed the fillna-errors branch from a30fe18 to ee1f011 Compare April 28, 2017 04:15

tm.assertRaises -> pytest.raises

9dea754

Short-circuit when passing DataFrame-on-DataFrame

9b15e87

Merge branch 'master' into fillna-errors

863da12

jreback closed this Aug 17, 2017

		inplace = validate_bool_kwarg(inplace, 'inplace')

		from pandas import Series


		Fillna error coercion routine.

		Construct a piecewise polynomial in the Bernstein basis, compatible

Uh oh!

ENH: Provide an errors parameter to fillna #15653

ENH: Provide an errors parameter to fillna #15653

Uh oh!

Conversation

ResidentMario commented Mar 11, 2017

Uh oh!

ResidentMario commented Mar 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ResidentMario commented Mar 11, 2017

Uh oh!

ResidentMario commented Mar 11, 2017

Uh oh!

TomAugspurger commented Mar 11, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ResidentMario Mar 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ResidentMario Mar 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ResidentMario commented Apr 2, 2017

Uh oh!

jreback commented Apr 3, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ResidentMario commented Apr 4, 2017

Uh oh!

ResidentMario commented Apr 4, 2017

Uh oh!

ResidentMario commented Apr 4, 2017

Uh oh!

ResidentMario commented Apr 16, 2017

Uh oh!

ResidentMario commented Mar 11, 2017 •

edited

Loading

ResidentMario Mar 26, 2017 •

edited

Loading

ResidentMario Mar 12, 2017 •

edited

Loading

ResidentMario commented Apr 26, 2017 •

edited

Loading

ResidentMario commented Apr 26, 2017 •

edited

Loading