Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: support masked arrays in hashtable-based functions (duplicated, isin, mode) #45776

Closed
jorisvandenbossche opened this issue Feb 2, 2022 · 4 comments · Fixed by #55340
Closed
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff duplicated duplicated, drop_duplicates Enhancement isin isin method NA - MaskedArrays Related to pd.NA and nullable extension arrays

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Feb 2, 2022

We have some algorithms which are based on the class-based HashTable implementation (hashtable_class_helper.pxi.in), such as unique, and those are already partly updated to support masked arrays as input (eg #33064). This is covered in #30037.

Besides that, we also have some algorithms that are written as functions that directly use khash (in hashtable_func_helper.pxi.in), and more specifically this are: value_counts, duplicated, ismember (isin) and mode.

For some we already have support for the masked arrays at a higher level: MaskedArray.isin calculates isin only for the data part, and then updates the result with the mask (this is possible since this is a purely element-wise operation).
For value_counts we have a workaround in place also at the MaskedArray level: we only pass the non-NA values (self._data[~self._mask]) to the underlying value counts algo.

But for duplicated (and thus also drop_duplicates) and mode, we still just coerce to a numpy array before passing it into the cython algo, which means casting to object dtype if there are missing values.

Possible solutions:

  • Implement a version of the cython algos that also take a mask ndarray as input
  • Add workarounds for the duplicated/mode algos on the higher level as well to avoid casting to object

This might also need some performance investigation to see which approach is most interesting (eg for the workaround, casting to float instead of to object might already be better for the nullable int/bool dtypes. EDIT: although that's not possible for large integers).

@jorisvandenbossche jorisvandenbossche added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Feb 2, 2022
@jbrockmendel
Copy link
Member

Looking at this it is clear how this works with dropna=True, not so sure what to do with dropna=False.

@jorisvandenbossche
Copy link
Member Author

@jbrockmendel can you expand a bit on that?

Looking at the value_count_{dtype} implementation:

val = {{to_c_type}}(values[i])
if not is_nan_{{c_type}}(val) or not dropna:
k = kh_get_{{ttype}}(table, val)
if k != table.n_buckets:
table.vals[k] += 1
else:
k = kh_put_{{ttype}}(table, val, &ret)
table.vals[k] = 1
result_keys.append(val)

So if we also pass a mask, we can't just use val, because in that case the val itself can have a random value.

I suppose one option would be to keep track of the null count separately (mask.sum()), and if dropna=False append that count to the result manually? (instead of getting the count from the hashtable with kh_get_..).
The main question in that case is what the return value of value_count should be, because the first element of the return (the result_keys) would then need to include an NA as well if the result_counts include the NA count. Maybe we can have a version of value_count that, if getting passed a mask, also returns a mask next to the result_keys ndarray.

If we want to maintain the "order of occurrence" of the result, also for NAs (and not just append the NA count to the end), that would require updating the Vector class as well to be able to "append an NA" and keep track of a mask and return an ndarray/mask combo.

These are just some thought explorations, there are certainly other ways to approach this.
Given those complications, it might also be worth keeping workarounds at a higher level (as we already do for MaskedArray.value_counts, where we append the NA count to the result with dropping the NAs).

@jbrockmendel
Copy link
Member

Looking at this it is clear how this works with dropna=True, not so sure what to do with dropna=False.

can you expand a bit on that?

I guess I was imagining keeping track of mask-based NAs within the cython function (partially motivated by the fact that value_counts dispatches directly to mode, so avoiding prost-processing mode would be convenient).

I'll revisit this fresh on Monday (taking tomorrow AFK) and hopefully will be more coherent.

@jbrockmendel
Copy link
Member

Hey @jorisvandenbossche, I'm still dealing with [something TBD] that is keeping me from operating at 100%, but I can show you the branch I had going for this before getting sidetracked: https://github.com/pandas-dev/pandas/compare/main...jbrockmendel:nullable-value_counts?expand=1

@simonjayhawkins simonjayhawkins added duplicated duplicated, drop_duplicates isin isin method labels Jun 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff duplicated duplicated, drop_duplicates Enhancement isin isin method NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants