Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent Results Between Pandas and Polars using cut (and qcut)? #18236

Open
2 tasks done
pinpss opened this issue Aug 16, 2024 · 3 comments
Open
2 tasks done

Inconsistent Results Between Pandas and Polars using cut (and qcut)? #18236

pinpss opened this issue Aug 16, 2024 · 3 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@pinpss
Copy link

pinpss commented Aug 16, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

I asked this question on StackOverflow here but am unsure if it’s related to a bug.

I’m switching from Pandas to Polars to create quantile-based portfolios, aiming to categorize a numerical variable into equal-sized portfolios using quantile breakpoints. Therefore, I am using the cut function.

However, I’m seeing discrepancies in the bins generated by Pandas and Polars, resulting in inconsistent outcomes between the two implementations.

Function for quantile-based binning using Pandas

    import pandas as pd
    import polars as pl
    import numpy as np
    from sklearn.metrics import confusion_matrix
    
    # Set random seed for reproducibility
    np.random.seed(42)

    def nqtile(df, sorting_variable, nq):
        # Calculate quantile breakpoints
        breakpoints = np.quantile(
            df[sorting_variable].dropna(), 
            np.linspace(0, 1, nq + 1), 
            method="linear"
        )
        
        # Assign observations to bins
        nqports = pd.cut(
            df[sorting_variable],
            bins=nq,
            labels=range(1, breakpoints.size),
            include_lowest=True,
            right=False
        )
        
        return nqports
    

Function for quantile-based binning using Polars

    def nqtile_pl(df, sorting_variable, nq):
        # Calculate quantile breakpoints
        breakpoints = np.quantile(
            df[sorting_variable].drop_nulls(), 
            np.linspace(0, 1, nq + 1), 
            method="linear"
        )
        
        # Remove the first and last breakpoints for binning
        breakpoints_1 = breakpoints[1:-1]
        labels_list = [str(i) for i in range(1, nq + 1)]
        
        # Assign observations to bins
        nqports = pl.Series.cut(
            df[sorting_variable],
            breaks=breakpoints_1,
            labels=labels_list,
            left_closed=False
        )
        return nqports

Example

  data = {
       'id': np.random.choice(range(1, 20), size=1000, replace=True),
       'random_number': np.random.randint(1, 100, size=1000)
   }
   
   # Create a Pandas DataFrame
   df = pd.DataFrame(data)
   
   # Apply Pandas-based quantile binning
   df['port_Q_pd'] = nqtile(df, 'random_number', 5)
   df['port_Q_pd'] = df['port_Q_pd'].astype(int)
   
   # Convert to Polars DataFrame
   df_pl = pl.DataFrame(df)
   
   # Apply Polars-based quantile binning and add to the DataFrame
   df_pl = df_pl.with_columns(
       nqtile_pl(df_pl, 'random_number', 5).alias('port_Q_pl')
   )
   
   # Convert back to Pandas for comparison
   df_conf = df_pl.to_pandas()
   
   # Ensure both columns are integers for comparison
   df_conf['port_Q_pl'] = df_conf['port_Q_pl'].astype(int)
   df_conf['port_Q_pd'] = df_conf['port_Q_pd'].astype(int)
   
   # Calculate the confusion matrix to compare the results
   cm = confusion_matrix(df_conf['port_Q_pl'], df_conf['port_Q_pd'])



   [[207   0   0   0   0]
    [  0 200   0   0   0]
    [  0   0 183  13   0]
    [  0   0   0 184  19]
    [  0   0   0   0 194]]

Log output

No response

Issue description

Discrepancies in the bins generated by Pandas and Polars, resulting in inconsistent outcomes between the two implementations.

Expected behavior

Same bins for both packages.

Installed versions

--------Version info---------
Polars:               1.5.0
Index type:           UInt32
Platform:             macOS-14.2.1-arm64-arm-64bit
Python:               3.9.6 (default, Nov 10 2023, 13:38:27) 
[Clang 15.0.0 (clang-1500.1.0.2.5)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           3.9.1
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              17.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
sqlalchemy:           2.0.31
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@pinpss pinpss added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Aug 16, 2024
@cmdlineluser
Copy link
Contributor

I'm not too familiar with these functions, but this is an attempt to understand what the actual problem is with a minimal example.

pandas seems to use a specific formula when bins=scalar1

import pandas as pd
import polars as pl
import numpy as np

random_number = [88, 24, 3, 22, 53, 2, 88, 30]
n_bins = 5

breakpoints = np.quantile(random_number, np.linspace(0, 1, n_bins + 1), method="linear")

breakpoints_pd = np.linspace(min(random_number), max(random_number), n_bins + 1, endpoint=True)
breakpoints_pd[-1] += (max(random_number) - min(random_number)) * 0.0001 # right=False

Which seems to produce different breakpoints:

breakpoints
# array([ 2. , 10.6, 23.6, 34.6, 74. , 88. ])

breakpoints_pd
# array([ 2.    , 19.2   , 36.4   , 53.6   , 70.8   , 88.0086])
pd.cut(pd.Series(random_number), n_bins)
# 0     (70.8, 88.0]
# 1     (19.2, 36.4]
# 2    (1.914, 19.2]
# 3     (19.2, 36.4]
# 4     (36.4, 53.6]
# 5    (1.914, 19.2]
# 6     (70.8, 88.0]
# 7     (19.2, 36.4]

pd.cut(pd.Series(random_number), n_bins, retbins=True)[-1]
# array([ 1.914, 19.2  , 36.4  , 53.6  , 70.8  , 88.   ])

Using cut with breakpoints_pd seems to produce similar results apart from the NaN

pd.cut(pd.Series(random_number), breakpoints_pd)
# 0    (70.8, 88.009]
# 1      (19.2, 36.4]
# 2       (2.0, 19.2]
# 3      (19.2, 36.4]
# 4      (36.4, 53.6]
# 5               NaN
# 6    (70.8, 88.009]
# 7      (19.2, 36.4]

pl.Series(random_number).cut(breakpoints_pd)
# shape: (8,)
# Series: '' [cat]
# [
# 	"(70.8, 88.0086]"
# 	"(19.2, 36.4]"
# 	"(2, 19.2]"
# 	"(19.2, 36.4]"
# 	"(36.4, 53.599999999999994]"
# 	"(-inf, 2]"
# 	"(70.8, 88.0086]"
# 	"(19.2, 36.4]"
# ]

But I'm not sure if this is expected or not?

Footnotes

  1. https://github.com/pandas-dev/pandas/blob/523afa840a84bababa3d7708bd9a4030cec0b025/pandas/core/reshape/tile.py#L386

@mcrumiller
Copy link
Contributor

mcrumiller commented Aug 16, 2024

There has been a PR in the pipeline for a while that addresses this: #16942.

Edit: that may only be for hist, let me check.

@xuJ14
Copy link

xuJ14 commented Aug 23, 2024

The behavior is discussed thoroughly here #10468

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

4 participants