Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add mask-aware implementation of factorize algos #30037

Closed
jorisvandenbossche opened this issue Dec 4, 2019 · 2 comments · Fixed by #48109
Closed

ENH: add mask-aware implementation of factorize algos #30037

jorisvandenbossche opened this issue Dec 4, 2019 · 2 comments · Fixed by #48109
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. NA - MaskedArrays Related to pd.NA and nullable extension arrays

Comments

@jorisvandenbossche
Copy link
Member

Now we start to have mask-based dtypes/arrays (integer, boolean), we should also look into making our algos work with such masked arrays. An example for which we could explore this is factorize / unique.

Currently, BooleanArray and IntegerArray need to convert their masked array into a single numpy array using a certain "NA sentinel" that is specified so the algo can recognize this sentinel. This happens through the ExtensionArray._values_for_factorize, which returns a (numpy array, NA sentinel) tuple.
In practice this means that the boolean array is converted to integer (with NA as -1), and IntegerArray is converted to float array with NA as NaN, so the algos can handle this.

We should look into:

  • Can we adapt or make a specific version of the unique/factorize hashtable class that takes a mask instead of a NA sentinel
  • We could then have a variant of ExtensionArray._values_for_factorize that then returns (array, mask) instead of (array, NA).
@jorisvandenbossche jorisvandenbossche added Enhancement Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff ExtensionArray Extending pandas with custom dtypes or arrays. labels Dec 4, 2019
@jorisvandenbossche jorisvandenbossche added this to the Contributions Welcome milestone Dec 4, 2019
@TomAugspurger
Copy link
Contributor

and IntegerArray is converted to float array with NA as NaN

On master, IntegerArray uses an object-dtype ndarray, with NA as NaN. We probably need object to avoid floating-point imprecision for large integers.

In [3]: a = pd.array([1, 2, None])

In [4]: a._values_for_factorize()
Out[4]: (array([1, 2, nan], dtype=object), nan)

I suppose we could avoid that object-dtype when there are no missing values, but that's straying from the original issue.

@jorisvandenbossche
Copy link
Member Author

So #33064 added mask support to HashTable.factorize, which is used in the factorize methods.

Something that is still missing is eg the unique support, which for ExtensionArrays has the following base implementation:

uniques = unique(self.astype(object))
return self._from_sequence(uniques, dtype=self.dtype)

So going through an object cast. And this is also used for the masked arrays.

The actual HashTable implementation actually already supports masks, though, so I think it is mostly a matter of exposing a mask argument in HashTable.unique (just as we did for HashTable.factorize), and having a way to call this from the masked array class (for factorize, that goes through factorize_array in algorithms.py).

The HashTable classes also have other methods (lookup, map_locations, get_labels_groupby), but those can covered in separate issues I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants