Skip to content

API: allow nan-likes in StringArray constructor #41412

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 41 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
3e1784d
API: allow nan-likes in StringArray constructor
lithomas1 May 10, 2021
96ff1da
Revert weird changes & Fix stuff
lithomas1 May 11, 2021
418e1d2
Remove failing test
lithomas1 May 11, 2021
25a6c4d
Changes from code review
lithomas1 May 19, 2021
47d68f7
Merge branch 'master' of https://github.com/pandas-dev/pandas into st…
lithomas1 May 19, 2021
8257dbd
typo
lithomas1 May 20, 2021
922436a
Update lib.pyi
lithomas1 May 21, 2021
2f28086
Update lib.pyx
lithomas1 May 29, 2021
3ee2198
Update lib.pyx
lithomas1 May 29, 2021
9426a52
Merge branch 'master' of https://github.com/pandas-dev/pandas into st…
lithomas1 May 30, 2021
3ee55f2
Updates
lithomas1 May 30, 2021
fe4981a
Update lib.pyx
lithomas1 May 30, 2021
a66948a
Update lib.pyx
lithomas1 May 30, 2021
e852719
Update lib.pyx
lithomas1 May 31, 2021
91b73bb
disallow invalid nans in stringarray constructor
lithomas1 Jun 4, 2021
42ec3e4
Merge branch 'master' into stringarray-nan
lithomas1 Jun 4, 2021
41f49d2
add to _from_sequence and fixes
lithomas1 Jun 4, 2021
62cc5be
address code review
lithomas1 Jun 4, 2021
29909f3
Merge branch 'master' into stringarray-nan
lithomas1 Jun 4, 2021
153b6b4
Fix failures
lithomas1 Jun 5, 2021
b27a839
maybe fix benchmarks?
lithomas1 Jun 5, 2021
ed5b953
Partially address code review
lithomas1 Jun 5, 2021
caa5705
Test coerce=False
lithomas1 Jun 6, 2021
2d75031
move benchmarks
lithomas1 Jun 7, 2021
52a00d1
accidental formatting changes
lithomas1 Jun 7, 2021
8dc0b66
Fix
lithomas1 Jun 8, 2021
1bacaed
Merge branch 'master' into stringarray-nan
lithomas1 Jun 8, 2021
66be087
missing import from conflict
lithomas1 Jun 8, 2021
7b058cd
Merge branch 'master' into stringarray-nan
lithomas1 Jun 9, 2021
1be1bdf
Merge branch 'pandas-dev:master' into stringarray-nan
lithomas1 Jun 22, 2021
3c57094
remove old whatsnew
lithomas1 Jul 21, 2021
03738a9
Merge branch 'master' of https://github.com/pandas-dev/pandas into st…
lithomas1 Jul 21, 2021
12351de
move whatsnew
lithomas1 Jul 21, 2021
889829a
Merge branch 'master' into stringarray-nan
lithomas1 Oct 4, 2021
358000f
typo
lithomas1 Oct 4, 2021
c649b1d
Merge branch 'master' into stringarray-nan
lithomas1 Oct 16, 2021
5e5aa9c
Merge branch 'master' into stringarray-nan
lithomas1 Nov 27, 2021
eb7d8f2
Merge branch 'master' into stringarray-nan
lithomas1 Dec 18, 2021
2426319
Merge branch 'master' into stringarray-nan
lithomas1 Dec 27, 2021
20817a7
address comments
lithomas1 Dec 27, 2021
33d8f9a
accept any float nan w/ util.is_nan
lithomas1 Dec 27, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
add to _from_sequence and fixes
  • Loading branch information
lithomas1 committed Jun 4, 2021
commit 41f49d21d8da2bbdcc37d33714d009ea2b862049
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v1.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -644,7 +644,7 @@ Other API changes
- Partially initialized :class:`CategoricalDtype` (i.e. those with ``categories=None`` objects will no longer compare as equal to fully initialized dtype objects.
- Accessing ``_constructor_expanddim`` on a :class:`DataFrame` and ``_constructor_sliced`` on a :class:`Series` now raise an ``AttributeError``. Previously a ``NotImplementedError`` was raised (:issue:`38782`)
- Added new ``engine`` and ``**engine_kwargs`` parameters to :meth:`DataFrame.to_sql` to support other future "SQL engines". Currently we still only use ``SQLAlchemy`` under the hood, but more engines are planned to be supported such as ``turbodbc`` (:issue:`36893`)
- :class:`StringArray` now accepts nan-likes(``None``, ``nan``, ``NaT``, ``NA``, Decimal("NaN")) in its constructor in addition to strings.
- :class:`StringArray` now accepts nan-likes(``None``, ``nan``, ``NA``) in its constructor in addition to strings.
- Removed redundant ``freq`` from :class:`PeriodIndex` string representation (:issue:`41653`)


Expand Down
18 changes: 14 additions & 4 deletions pandas/core/arrays/string_.py
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,7 @@ class StringArray(PandasArray):

Currently, this expects an object-dtype ndarray
where the elements are Python strings
or nan-likes(``None``, ``nan``, ``NaT``, ``NA``, Decimal("NaN")).
or nan-likes(``None``, ``nan``, ``NA``).
This may change without warning in the future. Use
:meth:`pandas.array` with ``dtype="string"`` for a stable way of
creating a `StringArray` from any sequence.
Expand Down Expand Up @@ -239,23 +239,33 @@ def _validate(self):
raise ValueError("StringArray requires a sequence of strings or pandas.NA")

@classmethod
def _from_sequence(cls, scalars, *, dtype: Dtype | None = None, copy=False):
def _from_sequence(
cls, scalars, *, dtype: Dtype | None = None, copy=False, coerce=True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is coerce: bool enough here?

this is like errors='coerce' for coerce=True and errors='raise' for coerce=False, i guess 'ignore' would be meaningless.

but I still think the errors= keyword is better for flexiblity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will take a look soon-ish. I'm wary of adding keywords here

):
if dtype:
assert dtype == "string"

from pandas.core.arrays.masked import BaseMaskedArray

if isinstance(scalars, BaseMaskedArray):
# avoid costly conversion to object dtype
if coerce:
coerce = "non-null"
else:
coerce = None
na_values = scalars._mask
result = scalars._data
result = lib.ensure_string_array(result, copy=copy, coerce="non-null")
result = lib.ensure_string_array(result, copy=copy, coerce=coerce)
result[na_values] = StringDtype.na_value

else:
# convert non-na-likes to str, and nan-likes to StringDtype.na_value
if coerce:
coerce = "all"
else:
coerce = "strict-null"
result = lib.ensure_string_array(
scalars, na_value=StringDtype.na_value, copy=copy
scalars, na_value=StringDtype.na_value, copy=copy, coerce=coerce
)

# Manually creating new array avoids the validation step in the __init__, so is
Expand Down
16 changes: 13 additions & 3 deletions pandas/core/arrays/string_arrow.py
Original file line number Diff line number Diff line change
Expand Up @@ -237,7 +237,9 @@ def __init__(self, values):
)

@classmethod
def _from_sequence(cls, scalars, dtype: Dtype | None = None, copy: bool = False):
def _from_sequence(
cls, scalars, dtype: Dtype | None = None, copy: bool = False, coerce=True
):
from pandas.core.arrays.masked import BaseMaskedArray

_chk_pyarrow_available()
Expand All @@ -247,11 +249,19 @@ def _from_sequence(cls, scalars, dtype: Dtype | None = None, copy: bool = False)
# numerical issues with Float32Dtype
na_values = scalars._mask
result = scalars._data
result = lib.ensure_string_array(result, copy=copy, coerce="non-null")
if coerce:
coerce = "non-null"
else:
coerce = None
result = lib.ensure_string_array(result, copy=copy, coerce=coerce)
return cls(pa.array(result, mask=na_values, type=pa.string()))

# convert non-na-likes to str
result = lib.ensure_string_array(scalars, copy=copy)
if coerce:
coerce = "all"
else:
coerce = "strict-null"
result = lib.ensure_string_array(scalars, copy=copy, coerce=coerce)
return cls(pa.array(result, type=pa.string(), from_pandas=True))

@classmethod
Expand Down