Inconsistent Results Between Pandas and Polars using cut (and qcut)? #18236

pinpss · 2024-08-16T13:22:18Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

I asked this question on StackOverflow here but am unsure if it’s related to a bug.

I’m switching from Pandas to Polars to create quantile-based portfolios, aiming to categorize a numerical variable into equal-sized portfolios using quantile breakpoints. Therefore, I am using the cut function.

However, I’m seeing discrepancies in the bins generated by Pandas and Polars, resulting in inconsistent outcomes between the two implementations.

Function for quantile-based binning using Pandas

    import pandas as pd
    import polars as pl
    import numpy as np
    from sklearn.metrics import confusion_matrix
    
    # Set random seed for reproducibility
    np.random.seed(42)

    def nqtile(df, sorting_variable, nq):
        # Calculate quantile breakpoints
        breakpoints = np.quantile(
            df[sorting_variable].dropna(), 
            np.linspace(0, 1, nq + 1), 
            method="linear"
        )
        
        # Assign observations to bins
        nqports = pd.cut(
            df[sorting_variable],
            bins=nq,
            labels=range(1, breakpoints.size),
            include_lowest=True,
            right=False
        )
        
        return nqports

Function for quantile-based binning using Polars

    def nqtile_pl(df, sorting_variable, nq):
        # Calculate quantile breakpoints
        breakpoints = np.quantile(
            df[sorting_variable].drop_nulls(), 
            np.linspace(0, 1, nq + 1), 
            method="linear"
        )
        
        # Remove the first and last breakpoints for binning
        breakpoints_1 = breakpoints[1:-1]
        labels_list = [str(i) for i in range(1, nq + 1)]
        
        # Assign observations to bins
        nqports = pl.Series.cut(
            df[sorting_variable],
            breaks=breakpoints_1,
            labels=labels_list,
            left_closed=False
        )
        return nqports

Example

  data = {
       'id': np.random.choice(range(1, 20), size=1000, replace=True),
       'random_number': np.random.randint(1, 100, size=1000)
   }
   
   # Create a Pandas DataFrame
   df = pd.DataFrame(data)
   
   # Apply Pandas-based quantile binning
   df['port_Q_pd'] = nqtile(df, 'random_number', 5)
   df['port_Q_pd'] = df['port_Q_pd'].astype(int)
   
   # Convert to Polars DataFrame
   df_pl = pl.DataFrame(df)
   
   # Apply Polars-based quantile binning and add to the DataFrame
   df_pl = df_pl.with_columns(
       nqtile_pl(df_pl, 'random_number', 5).alias('port_Q_pl')
   )
   
   # Convert back to Pandas for comparison
   df_conf = df_pl.to_pandas()
   
   # Ensure both columns are integers for comparison
   df_conf['port_Q_pl'] = df_conf['port_Q_pl'].astype(int)
   df_conf['port_Q_pd'] = df_conf['port_Q_pd'].astype(int)
   
   # Calculate the confusion matrix to compare the results
   cm = confusion_matrix(df_conf['port_Q_pl'], df_conf['port_Q_pd'])



   [[207   0   0   0   0]
    [  0 200   0   0   0]
    [  0   0 183  13   0]
    [  0   0   0 184  19]
    [  0   0   0   0 194]]

Log output

No response

Issue description

Discrepancies in the bins generated by Pandas and Polars, resulting in inconsistent outcomes between the two implementations.

Expected behavior

Same bins for both packages.

Installed versions

--------Version info---------
Polars:               1.5.0
Index type:           UInt32
Platform:             macOS-14.2.1-arm64-arm-64bit
Python:               3.9.6 (default, Nov 10 2023, 13:38:27) 
[Clang 15.0.0 (clang-1500.1.0.2.5)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           3.9.1
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              17.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
sqlalchemy:           2.0.31
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

The text was updated successfully, but these errors were encountered:

cmdlineluser · 2024-08-16T14:52:36Z

I'm not too familiar with these functions, but this is an attempt to understand what the actual problem is with a minimal example.

pandas seems to use a specific formula when bins=scalar¹

import pandas as pd
import polars as pl
import numpy as np

random_number = [88, 24, 3, 22, 53, 2, 88, 30]
n_bins = 5

breakpoints = np.quantile(random_number, np.linspace(0, 1, n_bins + 1), method="linear")

breakpoints_pd = np.linspace(min(random_number), max(random_number), n_bins + 1, endpoint=True)
breakpoints_pd[-1] += (max(random_number) - min(random_number)) * 0.0001 # right=False

Which seems to produce different breakpoints:

breakpoints
# array([ 2. , 10.6, 23.6, 34.6, 74. , 88. ])

breakpoints_pd
# array([ 2.    , 19.2   , 36.4   , 53.6   , 70.8   , 88.0086])

pd.cut(pd.Series(random_number), n_bins)
# 0     (70.8, 88.0]
# 1     (19.2, 36.4]
# 2    (1.914, 19.2]
# 3     (19.2, 36.4]
# 4     (36.4, 53.6]
# 5    (1.914, 19.2]
# 6     (70.8, 88.0]
# 7     (19.2, 36.4]

pd.cut(pd.Series(random_number), n_bins, retbins=True)[-1]
# array([ 1.914, 19.2  , 36.4  , 53.6  , 70.8  , 88.   ])

Using cut with breakpoints_pd seems to produce similar results apart from the NaN

pd.cut(pd.Series(random_number), breakpoints_pd)
# 0    (70.8, 88.009]
# 1      (19.2, 36.4]
# 2       (2.0, 19.2]
# 3      (19.2, 36.4]
# 4      (36.4, 53.6]
# 5               NaN
# 6    (70.8, 88.009]
# 7      (19.2, 36.4]

pl.Series(random_number).cut(breakpoints_pd)
# shape: (8,)
# Series: '' [cat]
# [
# 	"(70.8, 88.0086]"
# 	"(19.2, 36.4]"
# 	"(2, 19.2]"
# 	"(19.2, 36.4]"
# 	"(36.4, 53.599999999999994]"
# 	"(-inf, 2]"
# 	"(70.8, 88.0086]"
# 	"(19.2, 36.4]"
# ]

But I'm not sure if this is expected or not?

https://github.com/pandas-dev/pandas/blob/523afa840a84bababa3d7708bd9a4030cec0b025/pandas/core/reshape/tile.py#L386 ↩

mcrumiller · 2024-08-16T15:38:33Z

There has been a PR in the pipeline for a while that addresses this: #16942.

Edit: that may only be for hist, let me check.

xuJ14 · 2024-08-23T03:04:50Z

The behavior is discussed thoroughly here #10468

pinpss added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Aug 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent Results Between Pandas and Polars using cut (and qcut)? #18236

Inconsistent Results Between Pandas and Polars using cut (and qcut)? #18236

pinpss commented Aug 16, 2024

cmdlineluser commented Aug 16, 2024

mcrumiller commented Aug 16, 2024 •

edited

Loading

xuJ14 commented Aug 23, 2024

Inconsistent Results Between Pandas and Polars using cut (and qcut)? #18236

Inconsistent Results Between Pandas and Polars using cut (and qcut)? #18236

Comments

pinpss commented Aug 16, 2024

Checks

Reproducible example

Function for quantile-based binning using Pandas

Function for quantile-based binning using Polars

Example

Log output

Issue description

Expected behavior

Installed versions

cmdlineluser commented Aug 16, 2024

Footnotes

mcrumiller commented Aug 16, 2024 • edited Loading

xuJ14 commented Aug 23, 2024

mcrumiller commented Aug 16, 2024 •

edited

Loading