Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#911 fix #945

Merged
merged 3 commits into from
Apr 6, 2022
Merged

#911 fix #945

merged 3 commits into from
Apr 6, 2022

Conversation

endremborza
Copy link
Contributor

Hello.

This should fix #911
sorry, as this is sort of a duplicate of #928 but I couldn't make commitlint work there, so I just opened a new one.

I modified a test a little, as it seems to me some versions of pandas secondary sort on index, and that is what messed up the tests on #928

also, pre-commit seems all kinds of broken, but I linted manually best I could

@fabclmnt fabclmnt requested review from sbrugman and fabclmnt March 29, 2022 23:42
@ieaves
Copy link
Contributor

ieaves commented Mar 30, 2022

Thanks for taking this on @endremborza, it looks like my old PR was broken after the repository transferred ownership over to ydataai. Would you mind adding a unit test for this? I believe the current implementation will fail for the following:

data = [1,2,3,4]
weights = [1,1,1,100]
weighted_median(data, weights)
-> TypeError: only integer scalar arrays can be converted to a scalar index

The issue arises on this line

w_median = (data[np.array(weights) == np.max(weights)])[0]

Because data is still technically a list and can't be indexed by a boolean array.

I'm not particularly familiar with this function but according to the doc notes it's intended to support lists and arrays, if we are forced into coercion to accomplish that, might as well make it explicit and take advantage of numpy functionality everywhere. Something like

    if not isinstance(data, np.ndarray):
        data = np.array(data)
    if not isinstance(weights, np.ndarray):
        weights = np.array(weights)
        
    s_data, s_weights = map(np.sort, [data, weights])
    midpoint = 0.5 * np.sum(s_weights)

    if s_weights[-1] > midpoint:
        w_median = data[weights == np.max(weights)][0]
    else:
        cs_weights = np.cumsum(s_weights)
        idx = np.where(cs_weights <= midpoint)[0][-1]
        if cs_weights[idx] == midpoint:
            w_median = np.mean(s_data[idx : idx + 2])
        else:
            w_median = s_data[idx + 1]
    return w_median

The type signature should also be fixed, you could go with something like this

from typing import Union

def weighted_median(data: Union[list, np.ndarray], weights: Union[list, np.ndarray]) -> Union[int, float]:

I'm not 100% sure about the return signature but the base implementation certainly supports float returns for float data.

Finally, this isn't an issue for you @endremborza but @sbrugman. Who is the primary contact / maintainer at ydataai? It doesn't look like the automated tests are running on PRs anymore based on this specific instance.

@endremborza
Copy link
Contributor Author

weighted_median is only called once in the whole codebase, here:
https://github.com/ydataai/pandas-profiling/blob/22a49d11a53e7c9205b4882d10500d97a76e34ce/src/pandas_profiling/model/pandas/describe_categorical_pandas.py#L182

and I'm quite positive that there is no way .values returns a list, so adding casting seems unnecessary until this function is utilized more.

should a unit test still cover this one function with arrays?

@ieaves
Copy link
Contributor

ieaves commented Mar 31, 2022

and I'm quite positive that there is no way .values returns a list, so adding casting seems unnecessary until this function is utilized more.

Doesn't the current PR propose to solve issue #911 by casting weights to np.array? Am I missing something?

@endremborza
Copy link
Contributor Author

Yes, but never from a list, but from a pd.arrays.PandasArray that pandas seems to return now sometimes for values for some reason, and numpy does not allow that as an indexer. This cast makes sure that this is never a problem.

@ieaves
Copy link
Contributor

ieaves commented Mar 31, 2022

Exactly! So either way you're casting right? List, pandasArray, whatever doesn't really matter.

@endremborza
Copy link
Contributor Author

Alright, but for the data parameter it doesn't really matter, as being indexed should not be a problem, and in the else block, the casted weights is not used, so my reasoning was that this way it's shorter and a bit more efficient.

What you're suggesting is more explicit though, and if that's more important I'm fine with that. As it usually is in OS, someone has to make a decision, but its not straightforward, who 🙂

@ieaves
Copy link
Contributor

ieaves commented Mar 31, 2022

Just a suggestion, the above solution is strictly more efficient than the PR in a few ways:

  1. There are at most two coercions to np.ndarray where the current PR has a a minimum of two with up to three.
  2. It removes the unnecessary O(n) call to any() by using the sorted array.
  3. Uses np.sort which is substantially faster than sorted on numpy arrays.

I'm just a contributor as well though :). Feel free to take it or leave it!

@endremborza
Copy link
Contributor Author

Oh sorry, fair enough. I didn't acknowledge the rest of the changes only the 2 casting lines, as I didn't have the original function to compare it to. I just applied it

@ieaves ieaves mentioned this pull request Mar 31, 2022
@sbrugman sbrugman merged commit 1f69b1b into ydataai:develop Apr 6, 2022
@sbrugman
Copy link
Collaborator

sbrugman commented Apr 6, 2022

@endremborza and @ieaves Thanks a ton for your help :)

@jfsantos-ds
Copy link

Seems fixed!
Apparently pandas 1.4.x changed the output type of the .values property on the Series in some instances. However, according to the pandas documentation, .values is not the recommended way to obtain the numpy array. Instead this should be done via .to_numpy().
Ensuring the desired dtypes and avoiding the casting on the util method, it seems that in length_summary_vc we could just change the weighted_median call arguments to:

        "median_length": weighted_median(
            length_counts.index.to_numpy(), weights=length_counts.to_numpy()
        ),

@endremborza this would be my suggestion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ProfileReport produces error output when using arg minimal=False Potential incompatiblity with Pandas 1.4.0
4 participants