Description
I am trying to investigate a regression in pandas test suite involving numexpr evaluation, where we have some tests that seem to have started failing with the recent release of numpy 2.3.0.
I have been able to trim it down to the following reproducible example (unfortunately not yet entirely eliminating the pandas usage, but the final code giving the wrong result only involves numpy and numexpr, pandas is only used for creating the test data):
import numpy as np
import numexpr as ne
# creating the test data through pandas
from pandas import DataFrame
arr = np.random.default_rng(2).integers(1, 100, size=(10001, 4))
df = DataFrame(arr, columns=list("ABCD"))
other = df.copy() + 1
# extracting the numpy arrays
a = df._mgr.blocks[0].values
b = other._mgr.blocks[0].values
# trying to create the data with just numpy -> not yet reproducing it
# arr = np.random.default_rng(2).integers(1, 100, size=(10001, 4))
# a = arr.T
# b = a.copy() + 1
# equality using numpy
expected = a == b
# equality using numexpr
result = ne.evaluate("b == a", casting="safe")
print(f"numpy: {np.__version__}")
print(f"numexpr: {ne.__version__}")
# given "b = a + 1", we expect all False, i.e. a sum of 0
print(f"numpy eq: {expected.sum()}")
print(f"numexpr eq: {result.sum()}")
I can consistently reproduce wrong output with numpy 2.3.0, and correct results with previous numpy 2.2. In both cases using numexpr 2.10, so not the just released version.
$ mamba create -n test-np22 python=3.11 numpy=2.2 pandas=2.2 numexpr=2.10
$ mamba create -n test-np23 python=3.11 numpy=2.3 pandas=2.2 numexpr=2.10
$ mamba run -n test-py311-np22 python test_numexpr_eq_bug.py
numpy: 2.2.6
numexpr: 2.10.2
numpy eq: 0
numexpr eq: 0
$ mamba run -n test-py311-np23 python test_numexpr_eq_bug.py
numpy: 2.3.0
numexpr: 2.10.2
numpy eq: 0
numexpr eq: 51 # <--- the equality is giving True for some values
The arrays a
and b
have a different order, so was thinking that might trigger the issue. But when trying to recreate test data directly using numpy, I can't reproduce the issue. I will try to further look into what pandas exactly does with the arrays while creating the dataframes.