Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
Problem Description
Currently, Pandas does not offer a method for calculating pairwise chi-square tests between columns in DataFrame
or between two Series
objects. Chi-square tests are useful for understanding associations between categorical variables. While correlation methods like .corr()
serve to evaluate relationships among continuous data, there is no equivalent method for categorical data.
Researchers and data analysts who work with categorical data currently need to rely on external libraries or custom code to perform chi-square tests across columns in a DataFrame
or between two Series
.
Potential Benefits May Include
- Swifter Categorical Data Analysis: Enable exploration of associations within categorical data directly within Pandas.
- Consistent API: By mimicking the structure and options of
.corr()
, the.chi2()
method will feel intuitive. - Enhanced Efficiency: Avoids the need to transfer data between Pandas and other libraries.
- Optimized for Large Datasets: Uses Cython to improve performance, making it feasible to compute pairwise chi-square tests even on large datasets.
Potential Use Cases + Target Users
- Data Scientists and Statisticians performing exploratory analysis of categorical data.
- Researchers in fields where categorical variables are prevalent.
- Data Analysts in biz domains where categorical variables are prevalent.
Using the Titanic data ideal model output could be as follows:
import pandas as pd
import numpy as np
import seaborn as sns
df = sns.load_dataset('titanic')
df.chi2()
sex embarked class who deck
sex 0.00000 0.00126 0.00021 0.00000 0.00774
embarked 0.00126 0.00000 0.00000 0.00440 0.05592
class 0.00021 0.00000 0.00000 0.00000 0.00000
who 0.00000 0.00440 0.00000 0.00000 0.00003
deck 0.00774 0.05592 0.00000 0.00003 0.00000
Feature Description
Solution
a .chi2()
method for both DataFrame
and Series
classes would provide efficient and consistent code options that will perform these so-called pairwise chi-square tests (and produce a correlation-matrix-like output we could call or think of ass a so-called chi2-matrix):
-
DataFrame.chi2()
: To perform pairwise chi-square tests for all categorical or integer columns within aDataFrame
, returning a symmetric matrix similar to theDataFrame.corr()
method. Users can choose to output either p-values or chi-square statistics, and an adjustablemax_categories
parameter limits the inclusion of columns with too many unique values. -
Series.chi2(other_series)
: Performs a chi-square test between twoSeries
objects. It returns either the p-value or chi-square statistic.
Both would have optional verbose
modes to include degrees of freedom values in the output.
Potential Code pandas/core/frame.py
from pandas._libs.algos import nanchi2
import numpy as np
import pandas as pd
class DataFrame:
# Other methods ...
def chi2(
self,
output: str = "p-value",
max_categories: int = 40,
verbose: bool = False
) -> pd.DataFrame:
"""
Compute pairwise chi-square analysis of categorical columns, excluding NA/null values.
Parameters
----------
output : {'p-value', 'chi2stat'}, default 'p-value'
Determines output format:
* 'p-value': returns a matrix of p-values from chi-square tests.
* 'chi2stat': returns a matrix of chi-square statistics. If `verbose=True`,
each entry is a tuple (chi2_statistic, degrees_of_freedom, p-value).
max_categories : int, default 40
Maximum number of unique values allowed for `object` and `int` data types to be included
in the chi-square calculations. Columns with more than `max_categories` unique values are excluded.
verbose : bool, default False
If True and `output="chi2stat"`, each entry in the matrix contains (chi2_statistic, degrees_of_freedom, p-value).
Returns
-------
DataFrame
Chi-square matrix with pairwise comparisons between columns.
Raises
------
ValueError
If the DataFrame contains no columns meeting the criteria for chi-square analysis.
Notes
-----
Only categorical data and `int`/`object` data types with fewer than `max_categories` unique
values are used. Identical columns return p-value=1.0 and chi2stat=0.0 for optimization.
Examples
--------
>>> import pandas as pd
>>> df = pd.DataFrame({
... "A": ["dog", "dog", "cat", "dog"],
... "B": ["apple", "orange", "apple", "orange"],
... "C": [1, 2, 1, 2]
... })
>>> df.chi2(output="p-value")
"""
# Filter columns by dtype and unique values
valid_columns = [
col for col in self.columns
if (self[col].dtype == 'object' or self[col].dtype == 'int' or pd.api.types.is_categorical_dtype(self[col]))
and self[col].nunique(dropna=True) <= max_categories
]
if not valid_columns:
raise ValueError(
"No columns meet the criteria for chi-square analysis. "
"Ensure categorical, `int`, or `object` columns with fewer than "
f"{max_categories} unique values are present."
)
# Prepare data array with valid columns
data = self[valid_columns].to_numpy(dtype=float, na_value=np.nan)
# Use the nanchi2 function from _libs.algos for efficient chi-square calculation
chi2_matrix = nanchi2(data, max_categories=max_categories, output=output)
# Handle verbose output for chi2stat
if output == "chi2stat" and verbose:
result = pd.DataFrame(index=valid_columns, columns=valid_columns)
for i, col1 in enumerate(valid_columns):
for j, col2 in enumerate(valid_columns):
if i == j:
result.loc[col1, col2] = (0.0, 0, 1.0) # Identical columns
else:
chi2_stat = chi2_matrix[i, j]
dof = (self[col1].nunique() - 1) * (self[col2].nunique() - 1)
p_val = chi2_matrix[i, j]
result.loc[col1, col2] = (chi2_stat, dof, p_val)
return result
# Convert result to DataFrame for standard output
result = pd.DataFrame(chi2_matrix, index=valid_columns, columns=valid_columns)
return result
Potential Code pandas/core/series.py
from pandas._libs.algos import nanchi2
import numpy as np
import pandas as pd
class Series:
# Other methods ...
def chi2(
self,
other: pd.Series,
output: str = "p-value",
max_categories: int = 40,
verbose: bool = False
) -> float:
"""
Compute chi-square association between this Series and another Series, excluding NA/null values.
Parameters
----------
other : Series
The other Series with which to compute the chi-square statistic.
output : {'p-value', 'chi2stat'}, default 'p-value'
Determines output format:
* 'p-value': returns the p-value from the chi-square test.
* 'chi2stat': returns the chi-square statistic.
max_categories : int, default 40
Maximum number of unique values allowed for `object` and `int` data types to be included
in the chi-square calculations. Series with more than `max_categories` unique values are excluded.
verbose : bool, default False
If True, returns a tuple with (chi2_statistic, degrees_of_freedom, p-value).
Ignored if `output` is 'p-value'.
Returns
-------
float or tuple
Chi-square test result. If `output="p-value"`, returns the p-value.
If `output="chi2stat"`, returns the chi-square statistic. If `verbose=True`,
returns a tuple with (chi2_statistic, degrees_of_freedom, p-value).
Raises
------
ValueError
If the Series have incompatible lengths, unsupported data types, or excessive unique values.
Notes
-----
Only categorical data and `int`/`object` data types with fewer than `max_categories` unique
values are used. Identical Series return p-value=1.0 and chi2stat=0.0 for optimization.
Examples
--------
>>> s1 = pd.Series(["dog", "dog", "cat", "dog"])
>>> s2 = pd.Series(["apple", "orange", "apple", "orange"])
>>> s1.chi2(s2, output="p-value")
"""
# Ensure the other input is a Series and has compatible length
if not isinstance(other, pd.Series):
raise TypeError("`other` must be a Series.")
if len(self) != len(other):
raise ValueError("Both Series must have the same length.")
# Check if both Series meet unique value criteria and have supported dtypes
if (self.nunique(dropna=True) > max_categories or other.nunique(dropna=True) > max_categories):
raise ValueError(
"Both Series must have fewer than `max_categories` unique values for chi-square analysis."
)
if not (
pd.api.types.is_categorical_dtype(self) or
pd.api.types.is_integer_dtype(self) or
pd.api.types.is_object_dtype(self)
):
raise ValueError("Series must be of type 'int', 'object', or 'category'.")
if not (
pd.api.types.is_categorical_dtype(other) or
pd.api.types.is_integer_dtype(other) or
pd.api.types.is_object_dtype(other)
):
raise ValueError("`other` must be of type 'int', 'object', or 'category'.")
# Check if the Series are identical and optimize by returning expected values
if self.equals(other):
return 1.0 if output == "p-value" else 0.0
# Prepare the data as a 2D array for nanchi2 function
data = np.vstack([self.fillna(np.nan), other.fillna(np.nan)]).T
chi2_matrix = nanchi2(data, max_categories=max_categories, output=output)
# Retrieve the appropriate output format
if output == "p-value":
return chi2_matrix[0, 1]
else:
chi2_stat = chi2_matrix[0, 1]
dof = (self.nunique() - 1) * (other.nunique() - 1)
p_val = chi2_matrix[0, 1] if verbose else None
return (chi2_stat, dof, p_val) if verbose else chi2_stat
Potential Code doc/source/reference/api/pandas.DataFrame.chi2.rst
.. _pandas.DataFrame.chi2:
pandas.DataFrame.chi2
=====================
DataFrame.chi2(output='p-value', max_categories=40) -> DataFrame
Compute pairwise chi-square analysis of categorical columns, excluding NA/null values.
This method calculates the chi-square association between pairs of columns in a DataFrame, comparing categorical columns or those with a limited number of unique values (default: 40). The output can either be a matrix of p-values or chi-square statistics.
Parameters
----------
output : {'p-value', 'chi2stat'}, default 'p-value'
Determines output format:
* 'p-value': returns a matrix of p-values from chi-square tests.
* 'chi2stat': returns a matrix of chi-square statistics with degrees of freedom.
max_categories : int, default 40
Maximum number of unique values allowed for `object` and `int` data types to be included
in the chi-square calculations. Columns with more than `max_categories` unique values are excluded.
Returns
-------
DataFrame
Symmetric chi-square matrix with pairwise comparisons between columns.
Raises
------
ValueError
If the DataFrame contains no columns meeting the criteria for chi-square analysis.
Notes
-----
Only categorical data and `int`/`object` data types with fewer than `max_categories` unique
values are included. Identical columns return p-value=1.0 and chi2stat=0.0 for optimization.
Examples
--------
>>> import pandas as pd
>>> df = pd.DataFrame({
... "A": ["dog", "dog", "cat", "dog"],
... "B": ["apple", "orange", "apple", "orange"],
... "C": [1, 2, 1, 2]
... })
>>> df.chi2(output="p-value")
A B C
A 1.000000 0.300000 0.200000
B 0.300000 1.000000 0.150000
C 0.200000 0.150000 1.000000
See Also
--------
pandas.DataFrame.corr : Compute pairwise correlation of columns.
pandas.DataFrame.corrwith : Compute pairwise correlation with another DataFrame or Series.
pandas.Series.chi2 : Compute chi-square association with another Series.
Potential Code doc/source/reference/api/pandas.Series.chi2.rst
.. _pandas.Series.chi2:
pandas.Series.chi2
==================
Series.chi2(other, output='p-value', max_categories=40, verbose=False) -> float or tuple
Compute chi-square association between this Series and another Series, excluding NA/null values.
This method calculates the chi-square association between two Series, comparing categorical values or those with a limited number of unique values (default: 40). The output can be the p-value, the chi-square statistic, or additional details if `verbose=True`.
Parameters
----------
other : Series
The other Series with which to compute the chi-square statistic.
output : {'p-value', 'chi2stat'}, default 'p-value'
Determines output format:
* 'p-value': returns the p-value from the chi-square test.
* 'chi2stat': returns the chi-square statistic.
max_categories : int, default 40
Maximum number of unique values allowed for `object` and `int` data types to be included
in the chi-square calculations. Series with more than `max_categories` unique values are excluded.
verbose : bool, default False
If True and `output="chi2stat"`, returns a tuple with (chi2_statistic, degrees_of_freedom, p-value).
Returns
-------
float or tuple
Chi-square test result. If `output="p-value"`, returns the p-value.
If `output="chi2stat"`, returns the chi-square statistic. If `verbose=True`
with `output="chi2stat"`, returns a tuple (chi2_statistic, degrees_of_freedom, p-value).
Raises
------
ValueError
If the Series have incompatible lengths, unsupported data types, or excessive unique values.
Notes
-----
Only categorical data and `int`/`object` data types with fewer than `max_categories` unique
values are used. Identical Series return p-value=1.0 and chi2stat=0.0 for optimization.
Examples
--------
>>> import pandas as pd
>>> s1 = pd.Series(["dog", "dog", "cat", "dog"])
>>> s2 = pd.Series(["apple", "orange", "apple", "orange"])
>>> s1.chi2(s2, output="p-value")
0.300000
See Also
--------
pandas.Series.corr : Compute correlation with another Series.
pandas.DataFrame.chi2 : Compute pairwise chi-square association between columns of a DataFrame.
pandas.Series.chi2 : Compute chi-square association with another Series.
Potential Code pandas/tests/frame/methods/test_chi2.py
import numpy as np
import pytest
import pandas as pd
from pandas import DataFrame
import pandas._testing as tm
class TestDataFrameChi2:
def test_chi2_basic(self):
# Test basic functionality with categorical data
df = DataFrame({
"A": ["dog", "dog", "cat", "dog"],
"B": ["apple", "orange", "apple", "orange"],
"C": [1, 2, 1, 2]
})
result = df.chi2()
assert result.shape == (3, 3)
assert result.index.equals(df.columns)
assert result.columns.equals(df.columns)
def test_chi2_output_p_value(self):
# Test output="p-value"
df = DataFrame({
"A": ["yes", "no", "yes", "yes"],
"B": ["high", "low", "medium", "medium"],
"C": [1, 2, 1, 3]
})
result = df.chi2(output="p-value")
assert result.shape == (3, 3)
assert result.loc["A", "B"] >= 0 # p-value range check
def test_chi2_output_chi2stat(self):
# Test output="chi2stat"
df = DataFrame({
"A": ["up", "down", "up", "down"],
"B": ["high", "medium", "medium", "low"],
"C": [1, 2, 2, 1]
})
result = df.chi2(output="chi2stat")
assert result.shape == (3, 3)
assert isinstance(result.loc["A", "B"], float) # Check statistic is a float
def test_chi2_max_categories(self):
# Test max_categories threshold
df = DataFrame({
"A": ["cat" + str(i) for i in range(50)], # Exceeds default max_categories of 40
"B": ["type" + str(i % 3) for i in range(50)]
})
with pytest.raises(ValueError, match="No columns meet the criteria for chi-square analysis"):
df.chi2()
def test_chi2_na_handling(self):
# Test handling of NaNs
df = DataFrame({
"A": ["yes", "no", np.nan, "yes"],
"B": ["high", np.nan, "medium", "medium"],
"C": [1, 2, 1, np.nan]
})
result = df.chi2(output="p-value")
assert result.loc["A", "B"] >= 0 # p-value should be non-negative
assert np.isnan(result.loc["A", "C"]) # Row with NaNs should yield NaN
def test_chi2_identical_columns(self):
# Test optimization for identical columns
df = DataFrame({
"A": ["dog", "dog", "cat", "dog"],
"B": ["dog", "dog", "cat", "dog"],
"C": [1, 2, 1, 2]
})
result = df.chi2(output="p-value")
assert result.loc["A", "B"] == 1.0 # Identical columns should return p-value=1.0
def test_chi2_non_categorical_data(self):
# Test error handling for non-categorical data
df = DataFrame({
"A": [1.5, 2.5, 3.5, 4.5], # Continuous numeric data
"B": ["apple", "orange", "apple", "orange"],
"C": ["yes", "no", "yes", "yes"]
})
with pytest.raises(ValueError, match="must be of type 'int', 'object', or 'category'"):
df.chi2()
def test_chi2_single_column(self):
# Test single column DataFrame
df = DataFrame({
"A": ["dog", "dog", "cat", "dog"]
})
result = df.chi2()
assert result.shape == (1, 1)
assert result.loc["A", "A"] == 1.0 # Single column should return p-value=1.0
Potential Code pandas/tests/frame/methods/test_chi2.py
import numpy as np
import pytest
import pandas as pd
from pandas import Series
import pandas._testing as tm
class TestSeriesChi2:
def test_chi2_basic(self):
# Basic functionality with categorical data
s1 = Series(["dog", "dog", "cat", "dog"])
s2 = Series(["apple", "orange", "apple", "orange"])
result = s1.chi2(s2)
assert isinstance(result, float) # Expecting a single p-value
def test_chi2_output_p_value(self):
# Test output="p-value" explicitly
s1 = Series(["yes", "no", "yes", "yes"])
s2 = Series(["high", "low", "medium", "medium"])
result = s1.chi2(s2, output="p-value")
assert 0 <= result <= 1 # p-value should be within this range
def test_chi2_output_chi2stat(self):
# Test output="chi2stat"
s1 = Series(["up", "down", "up", "down"])
s2 = Series(["high", "medium", "medium", "low"])
result = s1.chi2(s2, output="chi2stat")
assert isinstance(result, float) # Expecting chi-square statistic as a float
def test_chi2_verbose_output(self):
# Test verbose output for chi2stat
s1 = Series(["yes", "no", "yes", "yes"])
s2 = Series(["high", "low", "medium", "medium"])
result = s1.chi2(s2, output="chi2stat", verbose=True)
assert isinstance(result, tuple) # Should return tuple in verbose mode
assert len(result) == 3 # Tuple should contain (chi2_statistic, degrees_of_freedom, p-value)
def test_chi2_max_categories(self):
# Test max_categories threshold
s1 = Series(["cat" + str(i) for i in range(50)]) # Exceeds default max_categories of 40
s2 = Series(["type" + str(i % 3) for i in range(50)])
with pytest.raises(ValueError, match="must have fewer than `max_categories` unique values"):
s1.chi2(s2)
def test_chi2_na_handling(self):
# Test handling of NaNs
s1 = Series(["yes", "no", np.nan, "yes"])
s2 = Series(["high", np.nan, "medium", "medium"])
result = s1.chi2(s2, output="p-value")
assert 0 <= result <= 1 or np.isnan(result) # Allow p-value or NaN
def test_chi2_identical_series(self):
# Test optimization for identical Series
s1 = Series(["dog", "dog", "cat", "dog"])
s2 = s1.copy() # Identical Series
result = s1.chi2(s2, output="p-value")
assert result == 1.0 # Identical series should return p-value=1.0
def test_chi2_non_categorical_data(self):
# Test error handling for non-categorical data
s1 = Series([1.5, 2.5, 3.5, 4.5]) # Continuous numeric data
s2 = Series(["apple", "orange", "apple", "orange"])
with pytest.raises(ValueError, match="must be of type 'int', 'object', or 'category'"):
s1.chi2(s2)
def test_chi2_mismatched_lengths(self):
# Test error handling for mismatched Series lengths
s1 = Series(["dog", "dog", "cat", "dog"])
s2 = Series(["apple", "orange", "apple"]) # Mismatched length
with pytest.raises(ValueError, match="Both Series must have the same length"):
s1.chi2(s2)
Alternative Solutions
Currently, to perform chi-square tests on pairs of categorical columns in a DataFrame
, users can rely on a combination of the following libraries and approaches:
Using Scipy’s chi2_contingency
Function
- Import
chi2_contingency
fromscipy.stats
and compute chi-square values using a contingency orpd.crosstab()
for each pair of categorical columns. - Example:
import pandas as pd from scipy.stats import chi2_contingency import seaborn as sns # Load a dataset and calculate chi-square for a pair of columns df = sns.load_dataset('titanic') chi2_result = chi2_contingency(pd.crosstab(df['pclass'], df['embark_town']))
- This approach requires manually constructing
pd.crosstab
tables for each pair of columns (or doing so in a loop), making it cumbersome for pairwise analysis across multiple columns. It also lacks an optimized and integrated way to produce pairwise matrices directly within Pandas.
Other Third-Party Libraries:
- There are other libraries like
seaborn
orstatsmodels
facilitate chi-square tests and visualizations which may weigh against implementing this in Pandas. However the same can be said for correlation, which is available in many other libraries.
Fuilt-in functionality would streamline categorical data analysis within Pandas, aligning with the goal of being a comprehensive tool for data manipulation and analysis.
Additional Context
Searched for related issues, found none. However I may have missed them. Thanks to all in the world of Pandas for consideration, review, and efforts.