#911 fix #945

endremborza · 2022-03-28T21:14:44Z

Hello.

This should fix #911
sorry, as this is sort of a duplicate of #928 but I couldn't make commitlint work there, so I just opened a new one.

I modified a test a little, as it seems to me some versions of pandas secondary sort on index, and that is what messed up the tests on #928

also, pre-commit seems all kinds of broken, but I linted manually best I could

ieaves · 2022-03-30T23:50:18Z

Thanks for taking this on @endremborza, it looks like my old PR was broken after the repository transferred ownership over to ydataai. Would you mind adding a unit test for this? I believe the current implementation will fail for the following:

data = [1,2,3,4]
weights = [1,1,1,100]
weighted_median(data, weights)
-> TypeError: only integer scalar arrays can be converted to a scalar index

The issue arises on this line

w_median = (data[np.array(weights) == np.max(weights)])[0]

Because data is still technically a list and can't be indexed by a boolean array.

I'm not particularly familiar with this function but according to the doc notes it's intended to support lists and arrays, if we are forced into coercion to accomplish that, might as well make it explicit and take advantage of numpy functionality everywhere. Something like

    if not isinstance(data, np.ndarray):
        data = np.array(data)
    if not isinstance(weights, np.ndarray):
        weights = np.array(weights)
        
    s_data, s_weights = map(np.sort, [data, weights])
    midpoint = 0.5 * np.sum(s_weights)

    if s_weights[-1] > midpoint:
        w_median = data[weights == np.max(weights)][0]
    else:
        cs_weights = np.cumsum(s_weights)
        idx = np.where(cs_weights <= midpoint)[0][-1]
        if cs_weights[idx] == midpoint:
            w_median = np.mean(s_data[idx : idx + 2])
        else:
            w_median = s_data[idx + 1]
    return w_median

The type signature should also be fixed, you could go with something like this

from typing import Union

def weighted_median(data: Union[list, np.ndarray], weights: Union[list, np.ndarray]) -> Union[int, float]:

I'm not 100% sure about the return signature but the base implementation certainly supports float returns for float data.

Finally, this isn't an issue for you @endremborza but @sbrugman. Who is the primary contact / maintainer at ydataai? It doesn't look like the automated tests are running on PRs anymore based on this specific instance.

endremborza · 2022-03-30T23:59:25Z

weighted_median is only called once in the whole codebase, here:
https://github.com/ydataai/pandas-profiling/blob/22a49d11a53e7c9205b4882d10500d97a76e34ce/src/pandas_profiling/model/pandas/describe_categorical_pandas.py#L182

and I'm quite positive that there is no way .values returns a list, so adding casting seems unnecessary until this function is utilized more.

should a unit test still cover this one function with arrays?

ieaves · 2022-03-31T03:43:16Z

and I'm quite positive that there is no way .values returns a list, so adding casting seems unnecessary until this function is utilized more.

Doesn't the current PR propose to solve issue #911 by casting weights to np.array? Am I missing something?

endremborza · 2022-03-31T09:09:25Z

Yes, but never from a list, but from a pd.arrays.PandasArray that pandas seems to return now sometimes for values for some reason, and numpy does not allow that as an indexer. This cast makes sure that this is never a problem.

ieaves · 2022-03-31T14:58:40Z

Exactly! So either way you're casting right? List, pandasArray, whatever doesn't really matter.

endremborza · 2022-03-31T17:55:10Z

Alright, but for the data parameter it doesn't really matter, as being indexed should not be a problem, and in the else block, the casted weights is not used, so my reasoning was that this way it's shorter and a bit more efficient.

What you're suggesting is more explicit though, and if that's more important I'm fine with that. As it usually is in OS, someone has to make a decision, but its not straightforward, who 🙂

ieaves · 2022-03-31T19:10:34Z

Just a suggestion, the above solution is strictly more efficient than the PR in a few ways:

There are at most two coercions to np.ndarray where the current PR has a a minimum of two with up to three.
It removes the unnecessary O(n) call to any() by using the sorted array.
Uses np.sort which is substantially faster than sorted on numpy arrays.

I'm just a contributor as well though :). Feel free to take it or leave it!

endremborza · 2022-03-31T19:29:58Z

Oh sorry, fair enough. I didn't acknowledge the rest of the changes only the 2 casting lines, as I didn't have the original function to compare it to. I just applied it

sbrugman · 2022-04-06T11:35:44Z

@endremborza and @ieaves Thanks a ton for your help :)

jfsantos-ds · 2022-04-06T11:54:26Z

Seems fixed!
Apparently pandas 1.4.x changed the output type of the .values property on the Series in some instances. However, according to the pandas documentation, .values is not the recommended way to obtain the numpy array. Instead this should be done via .to_numpy().
Ensuring the desired dtypes and avoiding the casting on the util method, it seems that in length_summary_vc we could just change the weighted_median call arguments to:

        "median_length": weighted_median(
            length_counts.index.to_numpy(), weights=length_counts.to_numpy()
        ),

@endremborza this would be my suggestion

endremborza added 2 commits March 28, 2022 22:49

fix: cast to np.array

e16ec20

fix: possibly align with index sorting

7394bfc

fabclmnt requested review from sbrugman and fabclmnt March 29, 2022 23:42

fix: explicit casting and numpy utilization

007f1d8

ieaves mentioned this pull request Mar 31, 2022

Issue fix for #911 #928

Closed

sbrugman merged commit 1f69b1b into ydataai:develop Apr 6, 2022

jfsantos-ds linked an issue Apr 6, 2022 that may be closed by this pull request

ProfileReport produces error output when using arg minimal=False #923

Closed

jfsantos-ds mentioned this pull request Apr 7, 2022

ProfileReport produces error output when using arg minimal=False #923

Closed

sbrugman mentioned this pull request May 7, 2022

pandas-profiling giving random error #949

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#911 fix #945

#911 fix #945

endremborza commented Mar 28, 2022

ieaves commented Mar 30, 2022

endremborza commented Mar 30, 2022

ieaves commented Mar 31, 2022

endremborza commented Mar 31, 2022

ieaves commented Mar 31, 2022

endremborza commented Mar 31, 2022

ieaves commented Mar 31, 2022 •

edited

Loading

endremborza commented Mar 31, 2022

sbrugman commented Apr 6, 2022

jfsantos-ds commented Apr 6, 2022

#911 fix #945

#911 fix #945

Conversation

endremborza commented Mar 28, 2022

ieaves commented Mar 30, 2022

endremborza commented Mar 30, 2022

ieaves commented Mar 31, 2022

endremborza commented Mar 31, 2022

ieaves commented Mar 31, 2022

endremborza commented Mar 31, 2022

ieaves commented Mar 31, 2022 • edited Loading

endremborza commented Mar 31, 2022

sbrugman commented Apr 6, 2022

jfsantos-ds commented Apr 6, 2022

ieaves commented Mar 31, 2022 •

edited

Loading