Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Provide an errors parameter to fillna #15653

Closed
wants to merge 5 commits into from

Conversation

ResidentMario
Copy link
Contributor

@ResidentMario
Copy link
Contributor Author

ResidentMario commented Mar 11, 2017

Try it out yourself:

pd.DataFrame({'a': [1,2,np.nan], 'b': [np.nan, 2, 3]}).fillna({'a': 'foo', 'b': 'bar'}, errors='raise')
pd.DataFrame({'a': [1,2,np.nan], 'b': [np.nan, 2, 3]}).fillna({'a': 'foo', 'b': 'bar'}, errors='force')
pd.DataFrame({'a': [1,2,np.nan], 'b': [np.nan, 2, 3]}).fillna({'a': 'foo', 'b': 'bar'}, errors='ignore')

types/missing.py:valid_fill_value has a test suite in test_missing.py, but I'm confused as to where I should put the tests for the new fillna bits...

@ResidentMario
Copy link
Contributor Author

@jreback Two API questions:

  1. Pretty sure it should be coerce, not force. See e.g. here.
  2. Should the default behavior change? You stated in discussion that the default behavior should be ignore, not force. The latter is the default right now, however. This seems like a really big change, I think it'd be wise to have a deprecation cycle first...

@ResidentMario
Copy link
Contributor Author

This fails seemingly every sparse test in the test suite. 👀

sparse things obviously already have hardcoded dtype requirements so I should find a way of disabling this feature for all such types.

@TomAugspurger
Copy link
Contributor

Yeah, I think 'coerce' instead of 'force'. Not sure about changing the default. I don't think ignore should be the default. I think raise is more consistent with to_numeric. But it's a big change, and maybe not worth it.

inplace = validate_bool_kwarg(inplace, 'inplace')

from pandas import Series
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you won't do any of this here, this will all be done in pandas\core\missing.py

if errors == 'force' or errors == 'raise':
# we cannot coerce the underlying object, so
# make an ObjectBlock
return self.to_object_block(mgr=mgr).fillna(original_value,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, this should all be done in missing

@@ -391,3 +394,36 @@ def na_value_for_dtype(dtype):
elif is_bool_dtype(dtype):
return False
return np.nan


def valid_fill_value(value, dtype, is_period=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> validate_fill_value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just returns a bool, the action verb validate implies to me raising an error or processing the data somehow. Like in e.g. maybe_upcast. I can rename this, but I suggest reserving validate_fill_value as the name for the fillna sub-impl I'm moving to core/missing.py per your review.

----------
value : scalar
dtype: string / dtype
is_period : bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take this parameter out, you can skip period validation for now if its not working

@jreback jreback added Error Reporting Incorrect or improved errors from pandas Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Mar 11, 2017
@@ -621,3 +622,24 @@ def fill_zeros(result, x, y, name, fill):
result = result.reshape(shape)

return result


def maybe_fill(obj, value, errors):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is better as validate_fill_value, just raising if its an error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would work if we had coerce and raise modes only, but with the addition of ignore we have to flag the non-conformal column so that the fill routine can skip it. Hence why I modified this method to return a bool.

How else would you implement ignore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleanest way I can think of is writing a separate maybe_ignore method which factors that bit of logic out. But, you will have to run a is_valid_fill_value operation twice. Maybe that's OK.

Copy link
Contributor Author

@ResidentMario ResidentMario Mar 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a non-idempotent (probably not the right word for this) function that either raises or returns a flag depending on the mode.

For the fillna use-case alone one way to avoid this combination would be to add an if-else to the method body, and split this logic into two separate methods in missing (so e.g. if mode is errors do validate_fill_value, otherwise do continue_to_fill).

This would be fine for fillna alone. But, I'm trying to keep an eye out also for extending this raise/ignore/coerce stuff to other methods as well in some later PRs (#15533), and obviously you can't go about inserting this same if-else block everywhere all willy-nilly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having a function that takes an argument, then returns (with possibly modified), and potentially raises is just fine as well. You can always raise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validation functions on the other hand will generally just return None OR raise. returning a boolean is a bit odd actually. You want simple as possible for things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I agree with you about the function signature, will do.

The trouble is that we need some mechanism for telling the underlying function (fillna) to short-circuit. So we would need e.g.

        try:
            missing.validate_fill_value(self, value)
        except ValueError:
            if errors == 'ignore':
                return

I don't know of any way of implementing that in a separate function without passing a flag somewhere. If we want to wrap this whole sequence of actions in one function call, that would have to be part of the method signature.

But, looking at this again, I think it'd actually be best to just keep this logic in fillna for now, and figure out to generalize it more broadly (so we don't have to rewrite this same bit in shift et. al.) in a different PR later.


def maybe_fill(obj, value, errors):
"""
fillna error coercion routine, returns whether or not to continue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add a doc-string

fillna error coercion routine, returns whether or not to continue.
"""
from pandas import Series
if isinstance(obj, Series):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use ABCSeries (import at top)

from pandas import Series
if isinstance(obj, Series):
if errors is None or errors == 'coerce':
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errors must be coerce, raise, ignore, not is not acceptable.

Copy link
Contributor Author

@ResidentMario ResidentMario Mar 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, I provided coerce as the default kwarg and passed that on through the fillna method stack. However whilst Sparse and Categorical series accept an errors param they raise immediately if it is not None (because it's not implemented/doesn't make sense for them). So the default (coerce) was causing those routines to raise—and a lot of test failures.

I remade the default param to None and introduced this modified check to address this concern.

How should this be handled instead?

@@ -391,3 +394,30 @@ def na_value_for_dtype(dtype):
elif is_bool_dtype(dtype):
return False
return np.nan


def valid_fill_value(value, dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> is_valid_fill_value

_ensure_float64)

is_float_dtype, is_datetime64_dtype,
is_datetime64tz_dtype, is_integer_dtype,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leave the formatting as is


Fillna error coercion routine.

Construct a piecewise polynomial in the Bernstein basis, compatible
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dangers of copy-paste.

parsed as a sequence of sub-Series later on.
value : object
The value to be used as a fill for the object.
errors : {'raise', 'coerce', or 'ignore'}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a parameter

@ResidentMario
Copy link
Contributor Author

I'm investigating the five tests in the suite that the current code still doesn't pass, and I've hit something I can't explain originating in TestDataFrameMissingData.test_fillna_datetime_columns.

At a high level the bug is the fact that with the following bit:

>>> pd.DataFrame({'A': [-1, -2, np.nan],
              'B': [pd.Timestamp('2013-01-01'), pd.Timestamp('2013-01-02'), pd.NaT],
              'C': ['foo', 'bar', None],
              'D': ['foo2', 'bar2', None]},
              index=pd.date_range('20130110', periods=3)).fillna('?')

0.19.2 gives:

            A                    B    C     D
2013-01-10 -1  2013-01-01 00:00:00  foo  foo2
2013-01-11 -2  2013-01-02 00:00:00  bar  bar2
2013-01-12  ?                    ?    ?     ?

While this PR gives:

            A                    B    C     D
2013-01-10 -1  2013-01-01 00:00:00  foo  foo2
2013-01-11 -2  2013-01-02 00:00:00  bar  bar2
2013-01-12  ?           	NaT   ?     ?

This difference seems to be emitting from line 268 in frame.py, where in DataFrame.__init__ you have:

    if isinstance(data, BlockManager):
        mgr = self._init_mgr(data, axes=dict(index=index, columns=columns),
                             dtype=dtype, copy=copy)

In 0.19.2 this emits:

BlockManager
Items: Index(['A', 'B', 'C', 'D'], dtype='object')
Axis 1: DatetimeIndex(['2013-01-10', '2013-01-11', '2013-01-12'], dtype='datetime64[ns]', freq='D')
ObjectBlock: slice(0, 4, 1), 4 x 3, dtype: object

In this PR:

BlockManager
Items: Index(['A', 'B', 'C', 'D'], dtype='object')
Axis 1: DatetimeIndex(['2013-01-10', '2013-01-11', '2013-01-12'], dtype='datetime64[ns]', freq='D')
DatetimeBlock: slice(1, 2, 1), 1 x 3, dtype: datetime64[ns]
ObjectBlock: [0, 2, 3], 3 x 3, dtype: object

But how is this possible? This PR only adds a bit of if-else logic to the fillna method; it doesn't touch any of these deep internals (which are mostly beyond me anyway). So it shouldn't cause any differences in how block managers operate.

And the current master seems to pass all tests, including this one...huh.

I wouldn't call this an edge case, the old behavior definitely needs to be preserved.

Maybe I'm missing something, going to have to circle back to this later.

@jreback
Copy link
Contributor

jreback commented Apr 3, 2017

@ResidentMario this is related to this bug: #15613 I think

with the default errors='ignore' this should return as 0.19.2

inplace=inplace,
downcast=False)
if errors == 'coerce' or errors == 'raise':
# we cannot coerce the underlying object, so
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raise shoulnd't coerce.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should have already caught and raised in this case, no need to have this need. Seems I made a bit of a mess while figuring out a working code path.

downcast=False,
errors='ignore')
else: # errors == 'ignore'
return self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is tricky for datetimes, they auto-infer.

@ResidentMario
Copy link
Contributor Author

Thanks for the clarification, this is seems to be the same as your comment. Going to need to fiddle with it to find a workaround.

A more minimal example:

pd.Series([pd.Timestamp('2015-01-01'), pd.NaT]).fillna("?")

Fills with ? on master and NaT on fillna-errors.

@ResidentMario
Copy link
Contributor Author

Should probably write the tests first, this got picked up because a test happened to depend on this behavior, maybe there are other cases also like this.

@ResidentMario
Copy link
Contributor Author

Looking at the tests, I see three dissatisfactory cases:

  1. Period columns are always dtype('O'), and there's no easy way of checking whether or not a column is period-type (previously discussed).
  2. date-type things auto-coerce to NaT, so their default behavior is equivalent to errors="ignore" for everything else (see above).
  3. A column with a bool and nan is dtype('O'). So e.g. we can't prevent Series([True, np.nan]).fillna('foobar', errors='ignore') from producing True, 'foobar'.

The second one needs to be solved, the first and last are limitations of pandas nullity handling that probably need to be lived with.

@ResidentMario
Copy link
Contributor Author

Looks like it was simpler than I had thought.

The problem is that to keep generic/fillna working with categorical types, I pass None instead of 'coerce' down as the default value to internals/fillna. Adding:

if not errors:
    errors = 'coerce'

To the fillna method body fixed this problem. For now, this is OK. But, I'm hoping I'll find a better solution later.

@kernc
Copy link
Contributor

kernc commented Apr 24, 2017

Does this perhaps also fix the problem reported in #16112?

@ResidentMario
Copy link
Contributor Author

ResidentMario commented Apr 26, 2017

No, trying an error mode on a sparse matrix will raise an error for now.

@ResidentMario
Copy link
Contributor Author

ResidentMario commented Apr 26, 2017

TestDataFrameOperators.test_timestamp_compare fails because of something remarkably subtle.

Suppose we have:

import pandas as pd
from io import StringIO

res = pd.read_csv(StringIO(""",dates1,dates2,floatcol,intcol,stringcol
0,1970-01-01 00:00:00.000000000,1970-01-01 00:00:00.000000000,,,
1,1970-01-01 00:00:00.000000000,,,,
2,1970-01-01 00:00:00.000000000,1970-01-01 00:00:00.000000000,,,
3,1970-01-01 00:00:00.000000000,,,,
4,1970-01-01 00:00:00.000000000,,,,
5,1970-01-01 00:00:00.000000000,1970-01-01 00:00:00.000000000,,,
6,1970-01-01 00:00:00.000000000,1970-01-01 00:00:00.000000000,,,
7,1970-01-01 00:00:00.000000000,,,,
8,1970-01-01 00:00:00.000000000,1970-01-01 00:00:00.000000001,,,
9,1970-01-01 00:00:00.000000001,1970-01-01 00:00:00.000000001,,,"""), index_col=0)

On master:

In [14]: res.fillna(True).dtypes
Out[14]: 
dates1        object
dates2        object
floatcol     float64
intcol       float64
stringcol    float64
dtype: object

On the feature branch:

In [14]: res.fillna(True).dtypes
Out[14]: 
dates1        datetime64[ns]
dates2        object
floatcol     float64
intcol       float64
stringcol    float64
dtype: object

On master even though the first column is already full and doesn't receive any boolean fills, the column dtype still gets upcast to object. I think that's because fillna internally runs the operation, and the upcast, on the 2-by-n "double datetimes" block.

Since I now force this operation to be done column-by-column, fillna will no longer do this—it will leave that first column's dtype untouched. That's not a bug; that's a feature!

But, this test relies on the old behavior, because it then attempts:

res.fillna(True).astype(bool)

Which fails with:

*** TypeError: cannot astype a datetimelike from [datetime64[ns]] to [bool]

Because while object columns can be cast to bool, datetime64 columns cannot.

I suspect the other remaining failures are similar.

@ResidentMario
Copy link
Contributor Author

On master:

import pandas as pd
from io import StringIO

res = pd.read_csv(StringIO(""",dates1,dates2
0,1970-01-01 00:00:00.000000000,1970-01-01 00:00:00.000000000
1,1970-01-01 00:00:00.000000000,"""), index_col=0)

res['dates1'] = pd.to_datetime(res['dates1'])
res.astype(bool)

Raises:

TypeError: cannot astype a datetimelike from [datetime64[ns]] to [bool]

@jreback
Copy link
Contributor

jreback commented Apr 28, 2017

@ResidentMario your last is correct. if you need to do booleans with that you have to fillna first.

In [25]: pd.to_datetime(res['dates2'])
Out[25]: 
0   1970-01-01
1          NaT
Name: dates2, dtype: datetime64[ns]

In [26]: pd.to_datetime(res['dates2']).fillna(False).astype(bool)
Out[26]: 
0     True
1    False
Name: dates2, dtype: bool

@ResidentMario
Copy link
Contributor Author

@jreback Unsure what you mean above, can you clarify?

@jreback
Copy link
Contributor

jreback commented Apr 29, 2017

@ResidentMario I mean your example above is exactly as expected an not an error.

@ResidentMario
Copy link
Contributor Author

Returning to this after a long absence.

@jreback's solution works in the reproduction case, but can't be used in implementation. The problem is that it uses fillna, and this changeset is a feature patch for...fillna.

The brute-force way of resolving this, mentioned at #16155, is to upcast the datetime to object dtype beforehand. That doesn't seem right at all.

But I'm just not seeing any hooks I can use to get in there and fix just this corner case specifically, either.

I'm stumped. Welcoming suggestions from others.

This is the last remaining test failure in this PR (assuming the driver failures are benign). There's still work to do with checking the impact on performance and going over the tests again, but this is otherwise reasonably close to a finish.

@jreback
Copy link
Contributor

jreback commented Aug 17, 2017

closing as stale. this is a bit easier in current master FYI.

@jreback jreback closed this Aug 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants