Skip to content

Conversation

@pulkitmaloo
Copy link
Contributor

@pulkitmaloo pulkitmaloo commented Aug 2, 2018

@datapythonista datapythonista added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Strings String extension data type and string data Categorical Categorical Data Type labels Aug 2, 2018
@TomAugspurger
Copy link
Contributor

Failure is at https://travis-ci.org/pandas-dev/pandas/jobs/411345500#L2287-L2289

Basically, you're fix only works if there are missing values.

This may fix it, but may break other things

diff --git a/pandas/core/strings.py b/pandas/core/strings.py
index 8097ed196..06fd1391e 100644
--- a/pandas/core/strings.py
+++ b/pandas/core/strings.py
@@ -1855,7 +1855,12 @@ class StringMethods(NoNewAttributesMixin):
         # before the transformation...
         if use_codes and self._is_categorical:
             result = take_1d(result, self._orig.cat.codes)
-            result[isna(result)] = na
+            missing = isna(result)
+
+            if missing.any():
+                result_type = np.result_type(result, na)
+                result = result.astype(result_type, copy=False)
+                result[isna(result)] = na
 
         if not hasattr(result, 'ndim') or not hasattr(result, 'dtype'):
             return result

@pep8speaks
Copy link

pep8speaks commented Aug 17, 2018

Hello @pulkitmaloo! Thanks for updating the PR.

Comment last updated on September 24, 2018 at 01:15 Hours UTC

@TomAugspurger
Copy link
Contributor

Looks like a maybe unrelated CI failure. Can you merge master and push again?

@TomAugspurger
Copy link
Contributor

And can you ensure (and test) that na is handled correctly for all the string functions? It looks like at least match isn't handled correctly.

@codecov
Copy link

codecov bot commented Aug 24, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@960a73f). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #22170   +/-   ##
=========================================
  Coverage          ?   92.19%           
=========================================
  Files             ?      169           
  Lines             ?    50950           
  Branches          ?        0           
=========================================
  Hits              ?    46975           
  Misses            ?     3975           
  Partials          ?        0
Flag Coverage Δ
#multiple 90.62% <100%> (?)
#single 42.27% <0%> (?)
Impacted Files Coverage Δ
pandas/core/strings.py 98.58% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 960a73f...44b36a4. Read the comment docs.

- cython=0.28.2
- jinja2=2.8
- numexpr=2.4.4 # we test that we correctly don't use an unsupported numexpr
- numpy=1.9.2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you revert this (and the rest of the .yaml)

@pulkitmaloo
Copy link
Contributor Author

@jreback Thanks for the feedback, I have updated the PR as you requested. However, the changes in yaml files leads to circleci failure.

@jreback
Copy link
Contributor

jreback commented Sep 23, 2018

@pulkitmaloo you need to rebase on master

@jreback jreback added this to the 0.24.0 milestone Sep 25, 2018
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, can you add a whatsnew note in bug fixes

^^^^^^^

-
- Bug in :class:`StringMethods` where `str_contains()` was not filling missing values with given argument for na for a categorical Series (:issue:`22158`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use double backticks on Series, can you use the actual reference here, e.g. :func:`Series.str.contains`

Copy link
Contributor Author

@pulkitmaloo pulkitmaloo Sep 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Thanks, I've updated it. Please let me know if there's anything else you'd like to change.

Copy link
Contributor Author

@pulkitmaloo pulkitmaloo Oct 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Can you please review and merge this PR.

Co-Authored-By: pulkitmaloo <maloo.pulkit@gmail.com>
values = Series(["a", "b", "c", "a", np.nan], dtype="category")
result = values.str.contains('a', na=True).astype(object)
expected = Series([True, False, False, True, True], dtype=np.object_)
assert isinstance(result, Series)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be removed. The assert_series_equal checks that.

assert res.loc[2] == "foo"
# na for category
values = Series(["a", "b", "c", "a", np.nan], dtype="category")
result = values.str.contains('a', na=True).astype(object)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the .astype(object). What's the dtype when you don't do that? Categorical? If so, I'd rather we document that as part of the API and test it.


# na for objects
values = Series(["a", "b", "c", "a", np.nan])
result = values.str.contains('a', na=True).astype(object)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment for the astype object.

@TomAugspurger
Copy link
Contributor

@pulkitmaloo are you able to update?

@jreback
Copy link
Contributor

jreback commented Nov 18, 2018

@TomAugspurger should be fixed up.

@jreback jreback merged commit 2af56d4 into pandas-dev:master Nov 20, 2018
@jreback
Copy link
Contributor

jreback commented Nov 20, 2018

thanks @pulkitmaloo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Categorical Categorical Data Type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Strings String extension data type and string data

Projects

None yet

Development

Successfully merging this pull request may close these issues.

.str.contains(..., na=False) consistency between categorical and object series

5 participants