ENH: A .chi2() method on the DataFrame and Series class that will resemble the .corr() methods

### Feature Type

- [X] Adding new functionality to pandas

- [ ] Changing existing functionality in pandas

- [ ] Removing existing functionality in pandas


### Problem Description

**Problem Description**

Currently, Pandas does not offer a method for calculating pairwise chi-square tests between columns in `DataFrame` or between two `Series` objects. Chi-square tests are useful for understanding associations between categorical variables. While correlation methods like `.corr()` serve to evaluate relationships among continuous data, there is no equivalent method for categorical data.

Researchers and data analysts who work with categorical data currently need to rely on external libraries or custom code to perform chi-square tests across columns in a `DataFrame` or between two `Series`. 

**Potential Benefits May Include**

1. **Swifter Categorical Data Analysis**: Enable exploration of associations within categorical data directly within Pandas.
2. **Consistent API**: By mimicking the structure and options of `.corr()`, the `.chi2()` method will feel intuitive.
3. **Enhanced Efficiency**: Avoids the need to transfer data between Pandas and other libraries.
4. **Optimized for Large Datasets**: Uses Cython to improve performance, making it feasible to compute pairwise chi-square tests even on large datasets.

**Potential Use Cases + Target Users**

- **Data Scientists** and **Statisticians** performing exploratory analysis of categorical data.
- **Researchers** in fields where categorical variables are prevalent.
- **Data Analysts** in biz domains where categorical variables are prevalent.

Using the Titanic data ideal model output could be as follows:

```Python
import pandas as pd
import numpy as np
import seaborn as sns

df = sns.load_dataset('titanic')
df.chi2()

              sex  embarked    class      who     deck
sex       0.00000   0.00126  0.00021  0.00000  0.00774
embarked  0.00126   0.00000  0.00000  0.00440  0.05592
class     0.00021   0.00000  0.00000  0.00000  0.00000
who       0.00000   0.00440  0.00000  0.00000  0.00003
deck      0.00774   0.05592  0.00000  0.00003  0.00000
```



### Feature Description

**Solution**

a `.chi2()` method for both `DataFrame` and `Series` classes would provide efficient and consistent code options that will perform these so-called pairwise chi-square tests (and produce a correlation-matrix-like output we could call or think of ass a so-called chi2-matrix):

- **`DataFrame.chi2()`**: To perform pairwise chi-square tests for all categorical or integer columns within a `DataFrame`, returning a symmetric matrix similar to the `DataFrame.corr()` method. Users can choose to output either p-values or chi-square statistics, and an adjustable `max_categories` parameter limits the inclusion of columns with too many unique values.
  
- **`Series.chi2(other_series)`**: Performs a chi-square test between two `Series` objects. It returns either the p-value or chi-square statistic.

Both would have optional `verbose` modes to include degrees of freedom values in the output.

**Potential Code `pandas/core/frame.py`**

```Python
from pandas._libs.algos import nanchi2
import numpy as np
import pandas as pd

class DataFrame:
    # Other methods ...

    def chi2(
        self,
        output: str = "p-value",
        max_categories: int = 40,
        verbose: bool = False
    ) -> pd.DataFrame:
        """
        Compute pairwise chi-square analysis of categorical columns, excluding NA/null values.

        Parameters
        ----------
        output : {'p-value', 'chi2stat'}, default 'p-value'
            Determines output format:
            * 'p-value': returns a matrix of p-values from chi-square tests.
            * 'chi2stat': returns a matrix of chi-square statistics. If `verbose=True`,
              each entry is a tuple (chi2_statistic, degrees_of_freedom, p-value).
        max_categories : int, default 40
            Maximum number of unique values allowed for `object` and `int` data types to be included
            in the chi-square calculations. Columns with more than `max_categories` unique values are excluded.
        verbose : bool, default False
            If True and `output="chi2stat"`, each entry in the matrix contains (chi2_statistic, degrees_of_freedom, p-value).

        Returns
        -------
        DataFrame
            Chi-square matrix with pairwise comparisons between columns.

        Raises
        ------
        ValueError
            If the DataFrame contains no columns meeting the criteria for chi-square analysis.

        Notes
        -----
        Only categorical data and `int`/`object` data types with fewer than `max_categories` unique
        values are used. Identical columns return p-value=1.0 and chi2stat=0.0 for optimization.

        Examples
        --------
        >>> import pandas as pd
        >>> df = pd.DataFrame({
        ...     "A": ["dog", "dog", "cat", "dog"],
        ...     "B": ["apple", "orange", "apple", "orange"],
        ...     "C": [1, 2, 1, 2]
        ... })
        >>> df.chi2(output="p-value")
        """

        # Filter columns by dtype and unique values
        valid_columns = [
            col for col in self.columns
            if (self[col].dtype == 'object' or self[col].dtype == 'int' or pd.api.types.is_categorical_dtype(self[col]))
            and self[col].nunique(dropna=True) <= max_categories
        ]
        if not valid_columns:
            raise ValueError(
                "No columns meet the criteria for chi-square analysis. "
                "Ensure categorical, `int`, or `object` columns with fewer than "
                f"{max_categories} unique values are present."
            )

        # Prepare data array with valid columns
        data = self[valid_columns].to_numpy(dtype=float, na_value=np.nan)

        # Use the nanchi2 function from _libs.algos for efficient chi-square calculation
        chi2_matrix = nanchi2(data, max_categories=max_categories, output=output)

        # Handle verbose output for chi2stat
        if output == "chi2stat" and verbose:
            result = pd.DataFrame(index=valid_columns, columns=valid_columns)
            for i, col1 in enumerate(valid_columns):
                for j, col2 in enumerate(valid_columns):
                    if i == j:
                        result.loc[col1, col2] = (0.0, 0, 1.0)  # Identical columns
                    else:
                        chi2_stat = chi2_matrix[i, j]
                        dof = (self[col1].nunique() - 1) * (self[col2].nunique() - 1)
                        p_val = chi2_matrix[i, j]
                        result.loc[col1, col2] = (chi2_stat, dof, p_val)
            return result

        # Convert result to DataFrame for standard output
        result = pd.DataFrame(chi2_matrix, index=valid_columns, columns=valid_columns)
        return result
```
**Potential Code `pandas/core/series.py`**
```Python
from pandas._libs.algos import nanchi2
import numpy as np
import pandas as pd

class Series:
    # Other methods ...

    def chi2(
        self,
        other: pd.Series,
        output: str = "p-value",
        max_categories: int = 40,
        verbose: bool = False
    ) -> float:
        """
        Compute chi-square association between this Series and another Series, excluding NA/null values.

        Parameters
        ----------
        other : Series
            The other Series with which to compute the chi-square statistic.
        output : {'p-value', 'chi2stat'}, default 'p-value'
            Determines output format:
            * 'p-value': returns the p-value from the chi-square test.
            * 'chi2stat': returns the chi-square statistic.
        max_categories : int, default 40
            Maximum number of unique values allowed for `object` and `int` data types to be included
            in the chi-square calculations. Series with more than `max_categories` unique values are excluded.
        verbose : bool, default False
            If True, returns a tuple with (chi2_statistic, degrees_of_freedom, p-value). 
            Ignored if `output` is 'p-value'.

        Returns
        -------
        float or tuple
            Chi-square test result. If `output="p-value"`, returns the p-value. 
            If `output="chi2stat"`, returns the chi-square statistic. If `verbose=True`, 
            returns a tuple with (chi2_statistic, degrees_of_freedom, p-value).

        Raises
        ------
        ValueError
            If the Series have incompatible lengths, unsupported data types, or excessive unique values.

        Notes
        -----
        Only categorical data and `int`/`object` data types with fewer than `max_categories` unique
        values are used. Identical Series return p-value=1.0 and chi2stat=0.0 for optimization.

        Examples
        --------
        >>> s1 = pd.Series(["dog", "dog", "cat", "dog"])
        >>> s2 = pd.Series(["apple", "orange", "apple", "orange"])
        >>> s1.chi2(s2, output="p-value")
        """

        # Ensure the other input is a Series and has compatible length
        if not isinstance(other, pd.Series):
            raise TypeError("`other` must be a Series.")
        if len(self) != len(other):
            raise ValueError("Both Series must have the same length.")

        # Check if both Series meet unique value criteria and have supported dtypes
        if (self.nunique(dropna=True) > max_categories or other.nunique(dropna=True) > max_categories):
            raise ValueError(
                "Both Series must have fewer than `max_categories` unique values for chi-square analysis."
            )
        if not (
            pd.api.types.is_categorical_dtype(self) or
            pd.api.types.is_integer_dtype(self) or
            pd.api.types.is_object_dtype(self)
        ):
            raise ValueError("Series must be of type 'int', 'object', or 'category'.")

        if not (
            pd.api.types.is_categorical_dtype(other) or
            pd.api.types.is_integer_dtype(other) or
            pd.api.types.is_object_dtype(other)
        ):
            raise ValueError("`other` must be of type 'int', 'object', or 'category'.")

        # Check if the Series are identical and optimize by returning expected values
        if self.equals(other):
            return 1.0 if output == "p-value" else 0.0

        # Prepare the data as a 2D array for nanchi2 function
        data = np.vstack([self.fillna(np.nan), other.fillna(np.nan)]).T
        chi2_matrix = nanchi2(data, max_categories=max_categories, output=output)

        # Retrieve the appropriate output format
        if output == "p-value":
            return chi2_matrix[0, 1]
        else:
            chi2_stat = chi2_matrix[0, 1]
            dof = (self.nunique() - 1) * (other.nunique() - 1)
            p_val = chi2_matrix[0, 1] if verbose else None

            return (chi2_stat, dof, p_val) if verbose else chi2_stat
```
**Potential Code `doc/source/reference/api/pandas.DataFrame.chi2.rst`**
```Python
.. _pandas.DataFrame.chi2:

pandas.DataFrame.chi2
=====================

DataFrame.chi2(output='p-value', max_categories=40) -> DataFrame

Compute pairwise chi-square analysis of categorical columns, excluding NA/null values.

This method calculates the chi-square association between pairs of columns in a DataFrame, comparing categorical columns or those with a limited number of unique values (default: 40). The output can either be a matrix of p-values or chi-square statistics.

Parameters
----------
output : {'p-value', 'chi2stat'}, default 'p-value'
    Determines output format:
    * 'p-value': returns a matrix of p-values from chi-square tests.
    * 'chi2stat': returns a matrix of chi-square statistics with degrees of freedom.

max_categories : int, default 40
    Maximum number of unique values allowed for `object` and `int` data types to be included
    in the chi-square calculations. Columns with more than `max_categories` unique values are excluded.

Returns
-------
DataFrame
    Symmetric chi-square matrix with pairwise comparisons between columns.

Raises
------
ValueError
    If the DataFrame contains no columns meeting the criteria for chi-square analysis.

Notes
-----
Only categorical data and `int`/`object` data types with fewer than `max_categories` unique
values are included. Identical columns return p-value=1.0 and chi2stat=0.0 for optimization.

Examples
--------
>>> import pandas as pd
>>> df = pd.DataFrame({
...     "A": ["dog", "dog", "cat", "dog"],
...     "B": ["apple", "orange", "apple", "orange"],
...     "C": [1, 2, 1, 2]
... })
>>> df.chi2(output="p-value")
          A         B         C
A  1.000000  0.300000  0.200000
B  0.300000  1.000000  0.150000
C  0.200000  0.150000  1.000000

See Also
--------
pandas.DataFrame.corr : Compute pairwise correlation of columns.
pandas.DataFrame.corrwith : Compute pairwise correlation with another DataFrame or Series.
pandas.Series.chi2 : Compute chi-square association with another Series.

```
**Potential Code `doc/source/reference/api/pandas.Series.chi2.rst`**
```Python
.. _pandas.Series.chi2:

pandas.Series.chi2
==================

Series.chi2(other, output='p-value', max_categories=40, verbose=False) -> float or tuple

Compute chi-square association between this Series and another Series, excluding NA/null values.

This method calculates the chi-square association between two Series, comparing categorical values or those with a limited number of unique values (default: 40). The output can be the p-value, the chi-square statistic, or additional details if `verbose=True`.

Parameters
----------
other : Series
    The other Series with which to compute the chi-square statistic.
output : {'p-value', 'chi2stat'}, default 'p-value'
    Determines output format:
    * 'p-value': returns the p-value from the chi-square test.
    * 'chi2stat': returns the chi-square statistic.
max_categories : int, default 40
    Maximum number of unique values allowed for `object` and `int` data types to be included
    in the chi-square calculations. Series with more than `max_categories` unique values are excluded.
verbose : bool, default False
    If True and `output="chi2stat"`, returns a tuple with (chi2_statistic, degrees_of_freedom, p-value).

Returns
-------
float or tuple
    Chi-square test result. If `output="p-value"`, returns the p-value. 
    If `output="chi2stat"`, returns the chi-square statistic. If `verbose=True` 
    with `output="chi2stat"`, returns a tuple (chi2_statistic, degrees_of_freedom, p-value).

Raises
------
ValueError
    If the Series have incompatible lengths, unsupported data types, or excessive unique values.

Notes
-----
Only categorical data and `int`/`object` data types with fewer than `max_categories` unique
values are used. Identical Series return p-value=1.0 and chi2stat=0.0 for optimization.

Examples
--------
>>> import pandas as pd
>>> s1 = pd.Series(["dog", "dog", "cat", "dog"])
>>> s2 = pd.Series(["apple", "orange", "apple", "orange"])
>>> s1.chi2(s2, output="p-value")
0.300000

See Also
--------
pandas.Series.corr : Compute correlation with another Series.
pandas.DataFrame.chi2 : Compute pairwise chi-square association between columns of a DataFrame.
pandas.Series.chi2 : Compute chi-square association with another Series.

```
**Potential Code `pandas/tests/frame/methods/test_chi2.py`**
```Python
import numpy as np
import pytest
import pandas as pd
from pandas import DataFrame
import pandas._testing as tm

class TestDataFrameChi2:
    def test_chi2_basic(self):
        # Test basic functionality with categorical data
        df = DataFrame({
            "A": ["dog", "dog", "cat", "dog"],
            "B": ["apple", "orange", "apple", "orange"],
            "C": [1, 2, 1, 2]
        })
        result = df.chi2()
        assert result.shape == (3, 3)
        assert result.index.equals(df.columns)
        assert result.columns.equals(df.columns)
    
    def test_chi2_output_p_value(self):
        # Test output="p-value"
        df = DataFrame({
            "A": ["yes", "no", "yes", "yes"],
            "B": ["high", "low", "medium", "medium"],
            "C": [1, 2, 1, 3]
        })
        result = df.chi2(output="p-value")
        assert result.shape == (3, 3)
        assert result.loc["A", "B"] >= 0  # p-value range check

    def test_chi2_output_chi2stat(self):
        # Test output="chi2stat"
        df = DataFrame({
            "A": ["up", "down", "up", "down"],
            "B": ["high", "medium", "medium", "low"],
            "C": [1, 2, 2, 1]
        })
        result = df.chi2(output="chi2stat")
        assert result.shape == (3, 3)
        assert isinstance(result.loc["A", "B"], float)  # Check statistic is a float

    def test_chi2_max_categories(self):
        # Test max_categories threshold
        df = DataFrame({
            "A": ["cat" + str(i) for i in range(50)],  # Exceeds default max_categories of 40
            "B": ["type" + str(i % 3) for i in range(50)]
        })
        with pytest.raises(ValueError, match="No columns meet the criteria for chi-square analysis"):
            df.chi2()

    def test_chi2_na_handling(self):
        # Test handling of NaNs
        df = DataFrame({
            "A": ["yes", "no", np.nan, "yes"],
            "B": ["high", np.nan, "medium", "medium"],
            "C": [1, 2, 1, np.nan]
        })
        result = df.chi2(output="p-value")
        assert result.loc["A", "B"] >= 0  # p-value should be non-negative
        assert np.isnan(result.loc["A", "C"])  # Row with NaNs should yield NaN

    def test_chi2_identical_columns(self):
        # Test optimization for identical columns
        df = DataFrame({
            "A": ["dog", "dog", "cat", "dog"],
            "B": ["dog", "dog", "cat", "dog"],
            "C": [1, 2, 1, 2]
        })
        result = df.chi2(output="p-value")
        assert result.loc["A", "B"] == 1.0  # Identical columns should return p-value=1.0

    def test_chi2_non_categorical_data(self):
        # Test error handling for non-categorical data
        df = DataFrame({
            "A": [1.5, 2.5, 3.5, 4.5],  # Continuous numeric data
            "B": ["apple", "orange", "apple", "orange"],
            "C": ["yes", "no", "yes", "yes"]
        })
        with pytest.raises(ValueError, match="must be of type 'int', 'object', or 'category'"):
            df.chi2()
    
    def test_chi2_single_column(self):
        # Test single column DataFrame
        df = DataFrame({
            "A": ["dog", "dog", "cat", "dog"]
        })
        result = df.chi2()
        assert result.shape == (1, 1)
        assert result.loc["A", "A"] == 1.0  # Single column should return p-value=1.0

```
**Potential Code `pandas/tests/frame/methods/test_chi2.py`**
```Python
import numpy as np
import pytest
import pandas as pd
from pandas import Series
import pandas._testing as tm

class TestSeriesChi2:
    def test_chi2_basic(self):
        # Basic functionality with categorical data
        s1 = Series(["dog", "dog", "cat", "dog"])
        s2 = Series(["apple", "orange", "apple", "orange"])
        result = s1.chi2(s2)
        assert isinstance(result, float)  # Expecting a single p-value

    def test_chi2_output_p_value(self):
        # Test output="p-value" explicitly
        s1 = Series(["yes", "no", "yes", "yes"])
        s2 = Series(["high", "low", "medium", "medium"])
        result = s1.chi2(s2, output="p-value")
        assert 0 <= result <= 1  # p-value should be within this range

    def test_chi2_output_chi2stat(self):
        # Test output="chi2stat"
        s1 = Series(["up", "down", "up", "down"])
        s2 = Series(["high", "medium", "medium", "low"])
        result = s1.chi2(s2, output="chi2stat")
        assert isinstance(result, float)  # Expecting chi-square statistic as a float

    def test_chi2_verbose_output(self):
        # Test verbose output for chi2stat
        s1 = Series(["yes", "no", "yes", "yes"])
        s2 = Series(["high", "low", "medium", "medium"])
        result = s1.chi2(s2, output="chi2stat", verbose=True)
        assert isinstance(result, tuple)  # Should return tuple in verbose mode
        assert len(result) == 3  # Tuple should contain (chi2_statistic, degrees_of_freedom, p-value)

    def test_chi2_max_categories(self):
        # Test max_categories threshold
        s1 = Series(["cat" + str(i) for i in range(50)])  # Exceeds default max_categories of 40
        s2 = Series(["type" + str(i % 3) for i in range(50)])
        with pytest.raises(ValueError, match="must have fewer than `max_categories` unique values"):
            s1.chi2(s2)

    def test_chi2_na_handling(self):
        # Test handling of NaNs
        s1 = Series(["yes", "no", np.nan, "yes"])
        s2 = Series(["high", np.nan, "medium", "medium"])
        result = s1.chi2(s2, output="p-value")
        assert 0 <= result <= 1 or np.isnan(result)  # Allow p-value or NaN

    def test_chi2_identical_series(self):
        # Test optimization for identical Series
        s1 = Series(["dog", "dog", "cat", "dog"])
        s2 = s1.copy()  # Identical Series
        result = s1.chi2(s2, output="p-value")
        assert result == 1.0  # Identical series should return p-value=1.0

    def test_chi2_non_categorical_data(self):
        # Test error handling for non-categorical data
        s1 = Series([1.5, 2.5, 3.5, 4.5])  # Continuous numeric data
        s2 = Series(["apple", "orange", "apple", "orange"])
        with pytest.raises(ValueError, match="must be of type 'int', 'object', or 'category'"):
            s1.chi2(s2)

    def test_chi2_mismatched_lengths(self):
        # Test error handling for mismatched Series lengths
        s1 = Series(["dog", "dog", "cat", "dog"])
        s2 = Series(["apple", "orange", "apple"])  # Mismatched length
        with pytest.raises(ValueError, match="Both Series must have the same length"):
            s1.chi2(s2)

```

### Alternative Solutions

Currently, to perform chi-square tests on pairs of categorical columns in a `DataFrame`, users can rely on a combination of the following libraries and approaches:

**Using Scipy’s `chi2_contingency` Function**
   - Import `chi2_contingency` from `scipy.stats` and compute chi-square values using a contingency or `pd.crosstab()` for each pair of categorical columns.
   - **Example**:
     ```python
     import pandas as pd
     from scipy.stats import chi2_contingency
     import seaborn as sns

     # Load a dataset and calculate chi-square for a pair of columns
     df = sns.load_dataset('titanic')
     chi2_result = chi2_contingency(pd.crosstab(df['pclass'], df['embark_town']))
     ```
   - This approach requires manually constructing `pd.crosstab` tables for each pair of columns (or doing so in a loop), making it cumbersome for pairwise analysis across multiple columns. It also lacks an optimized and integrated way to produce pairwise matrices directly within Pandas.

**Other Third-Party Libraries**:
   - There are other libraries like `seaborn` or `statsmodels` facilitate chi-square tests and visualizations which may weigh against implementing this in Pandas. However the same can be said for correlation, which is available in many other libraries.

Fuilt-in functionality would streamline categorical data analysis within Pandas, aligning with the goal of being a comprehensive tool for data manipulation and analysis.

### Additional Context

Searched for related issues, found none. However I may have missed them. Thanks to all in the world of Pandas for consideration, review, and efforts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: A .chi2() method on the DataFrame and Series class that will resemble the .corr() methods #60111

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

ENH: A .chi2() method on the DataFrame and Series class that will resemble the .corr() methods #60111

Description

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions