Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split fastpath IntegerArray constructor and general purpose constructor #22070

Merged

Conversation

jorisvandenbossche
Copy link
Member

Related to the discussion we were having on the PR: #21160 (@jreback @shoyer).

Opening this PR for discussion, to show what I meant there, to hopefully make it more concrete discussion.

Briefly the changes / rationale:

  • Separate the "fastpath" and general purpose constructor (in master everything is passed through coerce_to_array:
    • IntegerArray is the fastpath if you know you have a proper values and mask
    • integer_array as general purpose constructor function, that eg parses lists, detects NaNs etc (exact API to discuss of course)
  • IMO this clear separation makes the code cleaner (in all internal usage we know what we want). In the past we often added a fastpath keyword to the constructor, but IMO we should not repeat that pattern in new code (it is overloading the main constructor by combining things).
  • This also has the benefit of improved performance. Eg on an array of 1e5 values, the current constructor has an overhead of ca 80 µs, not much of course but that means that a simple slice is 82 instead of 2 µs (of course, this improvement does not need the exact proposal in this PR, you could certainly do a PR adding a fastpath keyword that achieves the same boost).

Note that this is not yet a fully implemented PR:

  • Docstrings still need to be updated
  • Tests for new behaviour needs to be added
  • coerce_to_array can be simplified a little bit, as currently the mask argument is nowhere used (unless we foresee use cases for that?)

@jreback
Copy link
Contributor

jreback commented Jul 26, 2018

looks interesting - will look tomorrow

@jbrockmendel
Copy link
Member

+1 on having a fastpath-like constructor, +1 on it not being accessed via a fastpath kwarg. -0.5 on introducing a new naming convention for constructors.

We don't need to require it of downstream EA authors, but for pandas-internal EAs we should have some kind of consistent patterns for EA constructors. Since _simple_new+_shallow_copy are already well-established (and used all over the Datetime/Timedelta/Period EA mixins), why not stick with that?

@gfyoung gfyoung added ExtensionArray Extending pandas with custom dtypes or arrays. Needs Discussion Requires discussion from core team before further action labels Jul 27, 2018
@pandas-dev pandas-dev deleted a comment from codecov bot Jul 27, 2018
@jorisvandenbossche
Copy link
Member Author

@jbrockmendel good questions.

_shallow_copy is in this case not fully applicable I think, as IntegerArray currently has no metadata (which I think is the purpose of _shallow_copy to make a copy but pass through metadata). At least for now, I don't think there is a use case for it in the IntegerArray implementation.
But of course, if we start sharing code or for other reasons, I might make sense to add such a method at some point (which for IntegerArray would be identical to the fastpath constructor at the moment). But if we want to generalize that, we might need to consider expanding the EA interface for this (if the need comes up).

For _simple_new: historically, we have overloaded the class constructors to do many things (and exposed as the main user API entry-point), but personally I think it is a better design to have designated method/functions for such constructors (which is of course debatable). In the EA interface we already have _from_sequence for some part of this general constructor from values, because we decided that we don't want to pin the class constructor on doing this.
So that leaves open the possibility to use the class constructor for the fastpath / _simple_new, so why not use it? I think in general, such a fastpath constructor / _simple_new would only be used within the class itself (eg the IntervalIndex or other interval-related functionality outside the array never calls the fastpath of IntervalArray), so we are kind of free what to do here.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good to me, onboard with this change. may need some doc updates in whatsnew (and add this PR number to the original whatsnew).

@@ -76,7 +76,7 @@ def construct_from_string(cls, string):
"'{}'".format(cls, string))


def to_integer_array(values, dtype=None):
def integer_array(values, dtype=None, copy=False):
"""
Infer and return an integer array of the values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update doc-string

self._data, self._mask = coerce_to_array(
values, dtype=dtype, mask=mask, copy=copy)
if not (isinstance(values, np.ndarray)
and np.issubdtype(values.dtype, np.integer)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is is_integer_dtype

if not (isinstance(values, np.ndarray)
and np.issubdtype(values.dtype, np.integer)):
raise TypeError("values should be integer numpy array")
if not (isinstance(mask, np.ndarray) and mask.dtype == np.bool_):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use is_bool_dtype

@@ -94,7 +94,8 @@ def to_integer_array(values, dtype=None):
------
TypeError if incompatible types
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there still a ref to to_integer_array? (I see it in the diff, but I also see that you changed it above)

@jreback jreback added this to the 0.24.0 milestone Jul 28, 2018
return type(self)(self._data[item],
mask=self._mask[item],
dtype=self.dtype)
return type(self)(self._data[item], self._mask[item])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback additional question: what do you find of writing IntegerArray(...) instead of type(self)(...) ?

Python perfectly allows that (and is the same here, as we don't subclass this one further), and I personally find that easier to read.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like using type() as I expect this to be subclasses for BooleanArray, though it could be that this not needed for that, but want to keep open that possiblitiy.

@codecov
Copy link

codecov bot commented Jul 28, 2018

Codecov Report

Merging #22070 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22070      +/-   ##
==========================================
- Coverage   92.05%   92.05%   -0.01%     
==========================================
  Files         169      169              
  Lines       50709    50719      +10     
==========================================
+ Hits        46679    46688       +9     
- Misses       4030     4031       +1
Flag Coverage Δ
#multiple 90.46% <100%> (-0.01%) ⬇️
#single 42.24% <7.4%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/arrays/__init__.py 100% <ø> (ø) ⬆️
pandas/core/arrays/integer.py 94.48% <100%> (-0.19%) ⬇️
pandas/core/series.py 93.72% <100%> (ø) ⬆️
pandas/core/indexes/base.py 96.43% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f7f266c...5948ae5. Read the comment docs.

@jorisvandenbossche
Copy link
Member Author

So the failures is because I now changed the code so that a list of numpy scalars is always coerced to int64 (before it took the type of the scalars).
Such a change is consistent with eg Series, but then we need to more thoroughly pass through the dtype to _from_sequence in the tests to get this passing.

@jorisvandenbossche
Copy link
Member Author

@jreback can you have a look at this again?

@jreback
Copy link
Contributor

jreback commented Aug 14, 2018

yes

@jorisvandenbossche
Copy link
Member Author

ping

@jreback jreback merged commit 83be235 into pandas-dev:master Aug 20, 2018
@jreback
Copy link
Contributor

jreback commented Aug 20, 2018

thanks @jorisvandenbossche nice improvement

@jorisvandenbossche jorisvandenbossche deleted the integer-array-constructor branch August 20, 2018 11:14
@jorisvandenbossche
Copy link
Member Author

Thanks!

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants