Description
Describe the bug, including details regarding any error messages, version, and platform.
When running the test suite on 32-bit x86, I'm getting the following test failures:
FAILED tests/test_array.py::test_dictionary_to_numpy - TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe'
FAILED tests/test_io.py::test_python_file_large_seeks - assert 5 == ((2 ** 32) + 5)
FAILED tests/test_io.py::test_memory_map_large_seeks - OSError: Read out of bounds (offset = 4294967301, size = 5) in file of size 10
FAILED tests/test_pandas.py::TestConvertStructTypes::test_from_numpy_nested - AssertionError: assert 8 == 12
FAILED tests/test_schema.py::test_schema_sizeof - assert 28 > 30
FAILED tests/interchange/test_conversion.py::test_pandas_roundtrip_string - OverflowError: Python int too large to convert to C ssize_t
FAILED tests/interchange/test_conversion.py::test_pandas_roundtrip_large_string - OverflowError: Python int too large to convert to C ssize_t
FAILED tests/interchange/test_conversion.py::test_pandas_roundtrip_string_with_missing - OverflowError: Python int too large to convert to C ssize_t
FAILED tests/interchange/test_conversion.py::test_pandas_roundtrip_categorical - OverflowError: Python int too large to convert to C ssize_t
FAILED tests/interchange/test_conversion.py::test_empty_dataframe - OverflowError: Python int too large to convert to C ssize_t
Tracebacks
============================================================== FAILURES ===============================================================
______________________________________________________ test_dictionary_to_numpy _______________________________________________________
obj = array([13.7, 11. ]), method = 'take', args = (array([0, 1, 1, 0], dtype=int64),)
kwds = {'axis': None, 'mode': 'raise', 'out': None}, bound = <built-in method take of numpy.ndarray object at 0xeaca6ad0>
def _wrapfunc(obj, method, *args, **kwds):
bound = getattr(obj, method, None)
if bound is None:
return _wrapit(obj, method, *args, **kwds)
try:
> return bound(*args, **kwds)
E TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe'
args = (array([0, 1, 1, 0], dtype=int64),)
bound = <built-in method take of numpy.ndarray object at 0xeaca6ad0>
kwds = {'axis': None, 'mode': 'raise', 'out': None}
method = 'take'
obj = array([13.7, 11. ])
/usr/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: TypeError
During handling of the above exception, another exception occurred:
def test_dictionary_to_numpy():
expected = pa.array(
["foo", "bar", None, "foo"]
).to_numpy(zero_copy_only=False)
a = pa.DictionaryArray.from_arrays(
pa.array([0, 1, None, 0]),
pa.array(['foo', 'bar'])
)
np.testing.assert_array_equal(a.to_numpy(zero_copy_only=False),
expected)
with pytest.raises(pa.ArrowInvalid):
# If this would be changed to no longer raise in the future,
# ensure to test the actual result because, currently, to_numpy takes
# for granted that when zero_copy_only=True there will be no nulls
# (it's the decoding of the DictionaryArray that handles the nulls and
# this is only activated with zero_copy_only=False)
a.to_numpy(zero_copy_only=True)
anonulls = pa.DictionaryArray.from_arrays(
pa.array([0, 1, 1, 0]),
pa.array(['foo', 'bar'])
)
expected = pa.array(
["foo", "bar", "bar", "foo"]
).to_numpy(zero_copy_only=False)
np.testing.assert_array_equal(anonulls.to_numpy(zero_copy_only=False),
expected)
with pytest.raises(pa.ArrowInvalid):
anonulls.to_numpy(zero_copy_only=True)
afloat = pa.DictionaryArray.from_arrays(
pa.array([0, 1, 1, 0]),
pa.array([13.7, 11.0])
)
expected = pa.array([13.7, 11.0, 11.0, 13.7]).to_numpy()
> np.testing.assert_array_equal(afloat.to_numpy(zero_copy_only=True),
expected)
a = <pyarrow.lib.DictionaryArray object at 0xeafe6ed0>
-- dictionary:
[
"foo",
"bar"
]
-- indices:
[
0,
1,
null,
0
]
afloat = <pyarrow.lib.DictionaryArray object at 0xeafe6fb0>
-- dictionary:
[
13.7,
11
]
-- indices:
[
0,
1,
1,
0
]
anonulls = <pyarrow.lib.DictionaryArray object at 0xeafe6e60>
-- dictionary:
[
"foo",
"bar"
]
-- indices:
[
0,
1,
1,
0
]
expected = array([13.7, 11. , 11. , 13.7])
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/test_array.py:823:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pyarrow/array.pxi:1590: in pyarrow.lib.Array.to_numpy
???
/usr/lib/python3.11/site-packages/numpy/core/fromnumeric.py:192: in take
return _wrapfunc(a, 'take', indices, axis=axis, out=out, mode=mode)
a = array([13.7, 11. ])
axis = None
indices = array([0, 1, 1, 0], dtype=int64)
mode = 'raise'
out = None
/usr/lib/python3.11/site-packages/numpy/core/fromnumeric.py:68: in _wrapfunc
return _wrapit(obj, method, *args, **kwds)
args = (array([0, 1, 1, 0], dtype=int64),)
bound = <built-in method take of numpy.ndarray object at 0xeaca6ad0>
kwds = {'axis': None, 'mode': 'raise', 'out': None}
method = 'take'
obj = array([13.7, 11. ])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
obj = array([13.7, 11. ]), method = 'take', args = (array([0, 1, 1, 0], dtype=int64),)
kwds = {'axis': None, 'mode': 'raise', 'out': None}, wrap = <built-in method __array_wrap__ of numpy.ndarray object at 0xeaca6ad0>
def _wrapit(obj, method, *args, **kwds):
try:
wrap = obj.__array_wrap__
except AttributeError:
wrap = None
> result = getattr(asarray(obj), method)(*args, **kwds)
E TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe'
args = (array([0, 1, 1, 0], dtype=int64),)
kwds = {'axis': None, 'mode': 'raise', 'out': None}
method = 'take'
obj = array([13.7, 11. ])
wrap = <built-in method __array_wrap__ of numpy.ndarray object at 0xeaca6ad0>
/usr/lib/python3.11/site-packages/numpy/core/fromnumeric.py:45: TypeError
____________________________________________________ test_python_file_large_seeks _____________________________________________________
def test_python_file_large_seeks():
def factory(filename):
return pa.PythonFile(open(filename, 'rb'))
> check_large_seeks(factory)
factory = <function test_python_file_large_seeks.<locals>.factory at 0xe13b6de8>
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/test_io.py:262:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
file_factory = <function test_python_file_large_seeks.<locals>.factory at 0xe13b6de8>
def check_large_seeks(file_factory):
if sys.platform in ('win32', 'darwin'):
pytest.skip("need sparse file support")
try:
filename = tempfile.mktemp(prefix='test_io')
with open(filename, 'wb') as f:
f.truncate(2 ** 32 + 10)
f.seek(2 ** 32 + 5)
f.write(b'mark\n')
with file_factory(filename) as f:
> assert f.seek(2 ** 32 + 5) == 2 ** 32 + 5
E assert 5 == ((2 ** 32) + 5)
E + where 5 = <bound method NativeFile.seek of <pyarrow.PythonFile closed=False own_file=False is_seekable=True is_writable=False is_readable=True>>(((2 ** 32) + 5))
E + where <bound method NativeFile.seek of <pyarrow.PythonFile closed=False own_file=False is_seekable=True is_writable=False is_readable=True>> = <pyarrow.PythonFile closed=False own_file=False is_seekable=True is_writable=False is_readable=True>.seek
f = <pyarrow.PythonFile closed=True own_file=False is_seekable=True is_writable=False is_readable=True>
file_factory = <function test_python_file_large_seeks.<locals>.factory at 0xe13b6de8>
filename = '/var/tmp/portage/dev-python/pyarrow-15.0.0/temp/test_ioj_p6zuld'
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/test_io.py:49: AssertionError
_____________________________________________________ test_memory_map_large_seeks _____________________________________________________
def test_memory_map_large_seeks():
> check_large_seeks(pa.memory_map)
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/test_io.py:1140:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/test_io.py:51: in check_large_seeks
assert f.read(5) == b'mark\n'
f = <pyarrow.MemoryMappedFile closed=True own_file=False is_seekable=True is_writable=False is_readable=True>
file_factory = <cyfunction memory_map at 0xf228e778>
filename = '/var/tmp/portage/dev-python/pyarrow-15.0.0/temp/test_iozl2wxbou'
pyarrow/io.pxi:409: in pyarrow.lib.NativeFile.read
???
pyarrow/error.pxi:154: in pyarrow.lib.pyarrow_internal_check_status
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E OSError: Read out of bounds (offset = 4294967301, size = 5) in file of size 10
pyarrow/error.pxi:91: OSError
____________________________________________ TestConvertStructTypes.test_from_numpy_nested ____________________________________________
self = <pyarrow.tests.test_pandas.TestConvertStructTypes object at 0xeb535d90>
def test_from_numpy_nested(self):
# Note: an object field inside a struct
dt = np.dtype([('x', np.dtype([('xx', np.int8),
('yy', np.bool_)])),
('y', np.int16),
('z', np.object_)])
# Note: itemsize is not a multiple of sizeof(object)
> assert dt.itemsize == 12
E AssertionError: assert 8 == 12
E + where 8 = dtype([('x', [('xx', 'i1'), ('yy', '?')]), ('y', '<i2'), ('z', 'O')]).itemsize
dt = dtype([('x', [('xx', 'i1'), ('yy', '?')]), ('y', '<i2'), ('z', 'O')])
self = <pyarrow.tests.test_pandas.TestConvertStructTypes object at 0xeb535d90>
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/test_pandas.py:2604: AssertionError
_________________________________________________________ test_schema_sizeof __________________________________________________________
def test_schema_sizeof():
schema = pa.schema([
pa.field('foo', pa.int32()),
pa.field('bar', pa.string()),
])
> assert sys.getsizeof(schema) > 30
E assert 28 > 30
E + where 28 = <built-in function getsizeof>(foo: int32\nbar: string)
E + where <built-in function getsizeof> = sys.getsizeof
schema = foo: int32
bar: string
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/test_schema.py:684: AssertionError
____________________________________________________ test_pandas_roundtrip_string _____________________________________________________
@pytest.mark.pandas
def test_pandas_roundtrip_string():
# See https://github.com/pandas-dev/pandas/issues/50554
if Version(pd.__version__) < Version("1.6"):
pytest.skip("Column.size() bug in pandas")
arr = ["a", "", "c"]
table = pa.table({"a": pa.array(arr)})
from pandas.api.interchange import (
from_dataframe as pandas_from_dataframe
)
pandas_df = pandas_from_dataframe(table)
> result = pi.from_dataframe(pandas_df)
arr = ['a', '', 'c']
pandas_df = a
0 a
1
2 c
pandas_from_dataframe = <function from_dataframe at 0xebbaa398>
table = pyarrow.Table
a: string
----
a: [["a","","c"]]
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/interchange/test_conversion.py:159:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:113: in from_dataframe
return _from_dataframe(df.__dataframe__(allow_copy=allow_copy),
allow_copy = True
df = a
0 a
1
2 c
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:136: in _from_dataframe
batch = protocol_df_chunk_to_pyarrow(chunk, allow_copy)
allow_copy = True
batches = []
chunk = <pandas.core.interchange.dataframe.PandasDataFrameXchg object at 0xda1c61f0>
df = <pandas.core.interchange.dataframe.PandasDataFrameXchg object at 0xda1c61f0>
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:182: in protocol_df_chunk_to_pyarrow
columns[name] = column_to_array(col, allow_copy)
allow_copy = True
col = <pandas.core.interchange.column.PandasColumn object at 0xda1c65b0>
columns = {}
df = <pandas.core.interchange.dataframe.PandasDataFrameXchg object at 0xda1c61f0>
dtype = <DtypeKind.STRING: 21>
name = 'a'
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:214: in column_to_array
data = buffers_to_array(buffers, data_type,
allow_copy = True
buffers = {'data': (PandasBuffer({'bufsize': 2, 'ptr': 3879523528, 'device': 'CPU'}),
(<DtypeKind.STRING: 21>, 8, 'u', '=')),
'offsets': (PandasBuffer({'bufsize': 32, 'ptr': 1530035680, 'device': 'CPU'}),
(<DtypeKind.INT: 0>, 64, 'l', '=')),
'validity': (PandasBuffer({'bufsize': 3, 'ptr': 1529980112, 'device': 'CPU'}),
(<DtypeKind.BOOL: 20>, 8, 'b', '='))}
col = <pandas.core.interchange.column.PandasColumn object at 0xda1c65b0>
data_type = (<DtypeKind.STRING: 21>, 8, 'u', '=')
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:396: in buffers_to_array
data_pa_buffer = pa.foreign_buffer(data_buff.ptr, data_buff.bufsize,
_ = (<DtypeKind.STRING: 21>, 8, 'u', '=')
allow_copy = True
buffers = {'data': (PandasBuffer({'bufsize': 2, 'ptr': 3879523528, 'device': 'CPU'}),
(<DtypeKind.STRING: 21>, 8, 'u', '=')),
'offsets': (PandasBuffer({'bufsize': 32, 'ptr': 1530035680, 'device': 'CPU'}),
(<DtypeKind.INT: 0>, 64, 'l', '=')),
'validity': (PandasBuffer({'bufsize': 3, 'ptr': 1529980112, 'device': 'CPU'}),
(<DtypeKind.BOOL: 20>, 8, 'b', '='))}
data_buff = PandasBuffer({'bufsize': 2, 'ptr': 3879523528, 'device': 'CPU'})
data_type = (<DtypeKind.STRING: 21>, 8, 'u', '=')
describe_null = (<ColumnNullType.USE_BYTEMASK: 4>, 0)
length = 3
offset = 0
offset_buff = PandasBuffer({'bufsize': 32, 'ptr': 1530035680, 'device': 'CPU'})
offset_dtype = (<DtypeKind.INT: 0>, 64, 'l', '=')
validity_buff = PandasBuffer({'bufsize': 3, 'ptr': 1529980112, 'device': 'CPU'})
validity_dtype = (<DtypeKind.BOOL: 20>, 8, 'b', '=')
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E OverflowError: Python int too large to convert to C ssize_t
pyarrow/io.pxi:1990: OverflowError
_________________________________________________ test_pandas_roundtrip_large_string __________________________________________________
@pytest.mark.pandas
def test_pandas_roundtrip_large_string():
# See https://github.com/pandas-dev/pandas/issues/50554
if Version(pd.__version__) < Version("1.6"):
pytest.skip("Column.size() bug in pandas")
arr = ["a", "", "c"]
table = pa.table({"a_large": pa.array(arr, type=pa.large_string())})
from pandas.api.interchange import (
from_dataframe as pandas_from_dataframe
)
if Version(pd.__version__) >= Version("2.0.1"):
pandas_df = pandas_from_dataframe(table)
> result = pi.from_dataframe(pandas_df)
arr = ['a', '', 'c']
pandas_df = a_large
0 a
1
2 c
pandas_from_dataframe = <function from_dataframe at 0xebbaa398>
table = pyarrow.Table
a_large: large_string
----
a_large: [["a","","c"]]
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/interchange/test_conversion.py:189:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:113: in from_dataframe
return _from_dataframe(df.__dataframe__(allow_copy=allow_copy),
allow_copy = True
df = a_large
0 a
1
2 c
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:136: in _from_dataframe
batch = protocol_df_chunk_to_pyarrow(chunk, allow_copy)
allow_copy = True
batches = []
chunk = <pandas.core.interchange.dataframe.PandasDataFrameXchg object at 0xda103a10>
df = <pandas.core.interchange.dataframe.PandasDataFrameXchg object at 0xda103a10>
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:182: in protocol_df_chunk_to_pyarrow
columns[name] = column_to_array(col, allow_copy)
allow_copy = True
col = <pandas.core.interchange.column.PandasColumn object at 0xda1033d0>
columns = {}
df = <pandas.core.interchange.dataframe.PandasDataFrameXchg object at 0xda103a10>
dtype = <DtypeKind.STRING: 21>
name = 'a_large'
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:214: in column_to_array
data = buffers_to_array(buffers, data_type,
allow_copy = True
buffers = {'data': (PandasBuffer({'bufsize': 2, 'ptr': 3879522800, 'device': 'CPU'}),
(<DtypeKind.STRING: 21>, 8, 'u', '=')),
'offsets': (PandasBuffer({'bufsize': 32, 'ptr': 1480303312, 'device': 'CPU'}),
(<DtypeKind.INT: 0>, 64, 'l', '=')),
'validity': (PandasBuffer({'bufsize': 3, 'ptr': 1478277616, 'device': 'CPU'}),
(<DtypeKind.BOOL: 20>, 8, 'b', '='))}
col = <pandas.core.interchange.column.PandasColumn object at 0xda1033d0>
data_type = (<DtypeKind.STRING: 21>, 8, 'u', '=')
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:396: in buffers_to_array
data_pa_buffer = pa.foreign_buffer(data_buff.ptr, data_buff.bufsize,
_ = (<DtypeKind.STRING: 21>, 8, 'u', '=')
allow_copy = True
buffers = {'data': (PandasBuffer({'bufsize': 2, 'ptr': 3879522800, 'device': 'CPU'}),
(<DtypeKind.STRING: 21>, 8, 'u', '=')),
'offsets': (PandasBuffer({'bufsize': 32, 'ptr': 1480303312, 'device': 'CPU'}),
(<DtypeKind.INT: 0>, 64, 'l', '=')),
'validity': (PandasBuffer({'bufsize': 3, 'ptr': 1478277616, 'device': 'CPU'}),
(<DtypeKind.BOOL: 20>, 8, 'b', '='))}
data_buff = PandasBuffer({'bufsize': 2, 'ptr': 3879522800, 'device': 'CPU'})
data_type = (<DtypeKind.STRING: 21>, 8, 'u', '=')
describe_null = (<ColumnNullType.USE_BYTEMASK: 4>, 0)
length = 3
offset = 0
offset_buff = PandasBuffer({'bufsize': 32, 'ptr': 1480303312, 'device': 'CPU'})
offset_dtype = (<DtypeKind.INT: 0>, 64, 'l', '=')
validity_buff = PandasBuffer({'bufsize': 3, 'ptr': 1478277616, 'device': 'CPU'})
validity_dtype = (<DtypeKind.BOOL: 20>, 8, 'b', '=')
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E OverflowError: Python int too large to convert to C ssize_t
pyarrow/io.pxi:1990: OverflowError
______________________________________________ test_pandas_roundtrip_string_with_missing ______________________________________________
@pytest.mark.pandas
def test_pandas_roundtrip_string_with_missing():
# See https://github.com/pandas-dev/pandas/issues/50554
if Version(pd.__version__) < Version("1.6"):
pytest.skip("Column.size() bug in pandas")
arr = ["a", "", "c", None]
table = pa.table({"a": pa.array(arr),
"a_large": pa.array(arr, type=pa.large_string())})
from pandas.api.interchange import (
from_dataframe as pandas_from_dataframe
)
if Version(pd.__version__) >= Version("2.0.2"):
pandas_df = pandas_from_dataframe(table)
> result = pi.from_dataframe(pandas_df)
arr = ['a', '', 'c', None]
pandas_df = a a_large
0 a a
1
2 c c
3 NaN NaN
pandas_from_dataframe = <function from_dataframe at 0xebbaa398>
table = pyarrow.Table
a: string
a_large: large_string
----
a: [["a","","c",null]]
a_large: [["a","","c",null]]
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/interchange/test_conversion.py:227:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:113: in from_dataframe
return _from_dataframe(df.__dataframe__(allow_copy=allow_copy),
allow_copy = True
df = a a_large
0 a a
1
2 c c
3 NaN NaN
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:136: in _from_dataframe
batch = protocol_df_chunk_to_pyarrow(chunk, allow_copy)
allow_copy = True
batches = []
chunk = <pandas.core.interchange.dataframe.PandasDataFrameXchg object at 0xda15b850>
df = <pandas.core.interchange.dataframe.PandasDataFrameXchg object at 0xda15b850>
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:182: in protocol_df_chunk_to_pyarrow
columns[name] = column_to_array(col, allow_copy)
allow_copy = True
col = <pandas.core.interchange.column.PandasColumn object at 0xda103210>
columns = {}
df = <pandas.core.interchange.dataframe.PandasDataFrameXchg object at 0xda15b850>
dtype = <DtypeKind.STRING: 21>
name = 'a'
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:214: in column_to_array
data = buffers_to_array(buffers, data_type,
allow_copy = True
buffers = {'data': (PandasBuffer({'bufsize': 2, 'ptr': 3879523744, 'device': 'CPU'}),
(<DtypeKind.STRING: 21>, 8, 'u', '=')),
'offsets': (PandasBuffer({'bufsize': 40, 'ptr': 1469510752, 'device': 'CPU'}),
(<DtypeKind.INT: 0>, 64, 'l', '=')),
'validity': (PandasBuffer({'bufsize': 4, 'ptr': 1475420176, 'device': 'CPU'}),
(<DtypeKind.BOOL: 20>, 8, 'b', '='))}
col = <pandas.core.interchange.column.PandasColumn object at 0xda103210>
data_type = (<DtypeKind.STRING: 21>, 8, 'u', '=')
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:396: in buffers_to_array
data_pa_buffer = pa.foreign_buffer(data_buff.ptr, data_buff.bufsize,
_ = (<DtypeKind.STRING: 21>, 8, 'u', '=')
allow_copy = True
buffers = {'data': (PandasBuffer({'bufsize': 2, 'ptr': 3879523744, 'device': 'CPU'}),
(<DtypeKind.STRING: 21>, 8, 'u', '=')),
'offsets': (PandasBuffer({'bufsize': 40, 'ptr': 1469510752, 'device': 'CPU'}),
(<DtypeKind.INT: 0>, 64, 'l', '=')),
'validity': (PandasBuffer({'bufsize': 4, 'ptr': 1475420176, 'device': 'CPU'}),
(<DtypeKind.BOOL: 20>, 8, 'b', '='))}
data_buff = PandasBuffer({'bufsize': 2, 'ptr': 3879523744, 'device': 'CPU'})
data_type = (<DtypeKind.STRING: 21>, 8, 'u', '=')
describe_null = (<ColumnNullType.USE_BYTEMASK: 4>, 0)
length = 4
offset = 0
offset_buff = PandasBuffer({'bufsize': 40, 'ptr': 1469510752, 'device': 'CPU'})
offset_dtype = (<DtypeKind.INT: 0>, 64, 'l', '=')
validity_buff = PandasBuffer({'bufsize': 4, 'ptr': 1475420176, 'device': 'CPU'})
validity_dtype = (<DtypeKind.BOOL: 20>, 8, 'b', '=')
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E OverflowError: Python int too large to convert to C ssize_t
pyarrow/io.pxi:1990: OverflowError
__________________________________________________ test_pandas_roundtrip_categorical __________________________________________________
@pytest.mark.pandas
def test_pandas_roundtrip_categorical():
if Version(pd.__version__) < Version("2.0.2"):
pytest.skip("Bitmasks not supported in pandas interchange implementation")
arr = ["Mon", "Tue", "Mon", "Wed", "Mon", "Thu", "Fri", "Sat", None]
table = pa.table(
{"weekday": pa.array(arr).dictionary_encode()}
)
from pandas.api.interchange import (
from_dataframe as pandas_from_dataframe
)
pandas_df = pandas_from_dataframe(table)
> result = pi.from_dataframe(pandas_df)
arr = ['Mon', 'Tue', 'Mon', 'Wed', 'Mon', 'Thu', 'Fri', 'Sat', None]
pandas_df = weekday
0 Mon
1 Tue
2 Mon
3 Wed
4 Mon
5 Thu
6 Fri
7 Sat
8 NaN
pandas_from_dataframe = <function from_dataframe at 0xebbaa398>
table = pyarrow.Table
weekday: dictionary<values=string, indices=int32, ordered=0>
----
weekday: [ -- dictionary:
["Mon","Tue","Wed","Thu","Fri","Sat"] -- indices:
[0,1,0,2,0,3,4,5,null]]
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/interchange/test_conversion.py:257:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:113: in from_dataframe
return _from_dataframe(df.__dataframe__(allow_copy=allow_copy),
allow_copy = True
df = weekday
0 Mon
1 Tue
2 Mon
3 Wed
4 Mon
5 Thu
6 Fri
7 Sat
8 NaN
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:136: in _from_dataframe
batch = protocol_df_chunk_to_pyarrow(chunk, allow_copy)
allow_copy = True
batches = []
chunk = <pandas.core.interchange.dataframe.PandasDataFrameXchg object at 0xd9e217f0>
df = <pandas.core.interchange.dataframe.PandasDataFrameXchg object at 0xd9e217f0>
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:186: in protocol_df_chunk_to_pyarrow
columns[name] = categorical_column_to_dictionary(col, allow_copy)
allow_copy = True
col = <pandas.core.interchange.column.PandasColumn object at 0xda180550>
columns = {}
df = <pandas.core.interchange.dataframe.PandasDataFrameXchg object at 0xd9e217f0>
dtype = <DtypeKind.CATEGORICAL: 23>
name = 'weekday'
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:293: in categorical_column_to_dictionary
dictionary = column_to_array(cat_column)
allow_copy = True
cat_column = <pandas.core.interchange.column.PandasColumn object at 0xda1801d0>
categorical = {'categories': <pandas.core.interchange.column.PandasColumn object at 0xda1801d0>,
'is_dictionary': True,
'is_ordered': False}
col = <pandas.core.interchange.column.PandasColumn object at 0xda180550>
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:214: in column_to_array
data = buffers_to_array(buffers, data_type,
allow_copy = True
buffers = {'data': (PandasBuffer({'bufsize': 18, 'ptr': 3659006432, 'device': 'CPU'}),
(<DtypeKind.STRING: 21>, 8, 'u', '=')),
'offsets': (PandasBuffer({'bufsize': 56, 'ptr': 1466456352, 'device': 'CPU'}),
(<DtypeKind.INT: 0>, 64, 'l', '=')),
'validity': (PandasBuffer({'bufsize': 6, 'ptr': 1477427216, 'device': 'CPU'}),
(<DtypeKind.BOOL: 20>, 8, 'b', '='))}
col = <pandas.core.interchange.column.PandasColumn object at 0xda1801d0>
data_type = (<DtypeKind.STRING: 21>, 8, 'u', '=')
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:396: in buffers_to_array
data_pa_buffer = pa.foreign_buffer(data_buff.ptr, data_buff.bufsize,
_ = (<DtypeKind.STRING: 21>, 8, 'u', '=')
allow_copy = True
buffers = {'data': (PandasBuffer({'bufsize': 18, 'ptr': 3659006432, 'device': 'CPU'}),
(<DtypeKind.STRING: 21>, 8, 'u', '=')),
'offsets': (PandasBuffer({'bufsize': 56, 'ptr': 1466456352, 'device': 'CPU'}),
(<DtypeKind.INT: 0>, 64, 'l', '=')),
'validity': (PandasBuffer({'bufsize': 6, 'ptr': 1477427216, 'device': 'CPU'}),
(<DtypeKind.BOOL: 20>, 8, 'b', '='))}
data_buff = PandasBuffer({'bufsize': 18, 'ptr': 3659006432, 'device': 'CPU'})
data_type = (<DtypeKind.STRING: 21>, 8, 'u', '=')
describe_null = (<ColumnNullType.USE_BYTEMASK: 4>, 0)
length = 6
offset = 0
offset_buff = PandasBuffer({'bufsize': 56, 'ptr': 1466456352, 'device': 'CPU'})
offset_dtype = (<DtypeKind.INT: 0>, 64, 'l', '=')
validity_buff = PandasBuffer({'bufsize': 6, 'ptr': 1477427216, 'device': 'CPU'})
validity_dtype = (<DtypeKind.BOOL: 20>, 8, 'b', '=')
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E OverflowError: Python int too large to convert to C ssize_t
pyarrow/io.pxi:1990: OverflowError
________________________________________________________ test_empty_dataframe _________________________________________________________
def test_empty_dataframe():
schema = pa.schema([('col1', pa.int8())])
df = pa.table([[]], schema=schema)
dfi = df.__dataframe__()
> assert pi.from_dataframe(dfi) == df
df = pyarrow.Table
col1: int8
----
col1: [[]]
dfi = <pyarrow.interchange.dataframe._PyArrowDataFrame object at 0xd98381d0>
schema = col1: int8
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/interchange/test_conversion.py:522:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:113: in from_dataframe
return _from_dataframe(df.__dataframe__(allow_copy=allow_copy),
allow_copy = True
df = <pyarrow.interchange.dataframe._PyArrowDataFrame object at 0xd98381d0>
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:140: in _from_dataframe
batch = protocol_df_chunk_to_pyarrow(df)
allow_copy = True
batches = []
df = <pyarrow.interchange.dataframe._PyArrowDataFrame object at 0xd96e41b0>
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:182: in protocol_df_chunk_to_pyarrow
columns[name] = column_to_array(col, allow_copy)
allow_copy = True
col = <pyarrow.interchange.column._PyArrowColumn object at 0xd96a6650>
columns = {}
df = <pyarrow.interchange.dataframe._PyArrowDataFrame object at 0xd96e41b0>
dtype = <DtypeKind.INT: 0>
name = 'col1'
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:214: in column_to_array
data = buffers_to_array(buffers, data_type,
allow_copy = True
buffers = {'data': (PyArrowBuffer({'bufsize': 0, 'ptr': 4122363392, 'device': 'CPU'}),
(<DtypeKind.INT: 0>, 8, 'c', '=')),
'offsets': None,
'validity': None}
col = <pyarrow.interchange.column._PyArrowColumn object at 0xd96a6650>
data_type = (<DtypeKind.INT: 0>, 8, 'c', '=')
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:396: in buffers_to_array
data_pa_buffer = pa.foreign_buffer(data_buff.ptr, data_buff.bufsize,
_ = (<DtypeKind.INT: 0>, 8, 'c', '=')
allow_copy = True
buffers = {'data': (PyArrowBuffer({'bufsize': 0, 'ptr': 4122363392, 'device': 'CPU'}),
(<DtypeKind.INT: 0>, 8, 'c', '=')),
'offsets': None,
'validity': None}
data_buff = PyArrowBuffer({'bufsize': 0, 'ptr': 4122363392, 'device': 'CPU'})
data_type = (<DtypeKind.INT: 0>, 8, 'c', '=')
describe_null = (<ColumnNullType.NON_NULLABLE: 0>, None)
length = 0
offset = 0
offset_buff = None
validity_buff = None
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E OverflowError: Python int too large to convert to C ssize_t
pyarrow/io.pxi:1990: OverflowError
Full build & test log (2.5M): pyarrow.txt
This is arrow 15.0.0 on Gentoo, with x86 systemd-nspawn container. I've used -O2 -march=pentium-m -mfpmath=sse -pipe
as compiler flags, to rule out i387-specific issues.
>>> pyarrow.show_info()
pyarrow version info
--------------------
Package kind : not indicated
Arrow C++ library version : 15.0.0
Arrow C++ compiler : GNU 13.2.1
Arrow C++ compiler flags : -O2 -march=pentium-m -mfpmath=sse -pipe
Arrow C++ git revision :
Arrow C++ git description :
Arrow C++ build type : relwithdebinfo
Platform:
OS / Arch : Linux x86_64
SIMD Level : avx2
Detected SIMD Level : avx2
Memory:
Default backend : system
Bytes allocated : 0 bytes
Max memory : 0 bytes
Supported Backends : system
Optional modules:
csv : Enabled
cuda : -
dataset : Enabled
feather : Enabled
flight : -
fs : Enabled
gandiva : -
json : Enabled
orc : -
parquet : Enabled
Filesystems:
GcsFileSystem : -
HadoopFileSystem : Enabled
S3FileSystem : -
Compression Codecs:
brotli : Enabled
bz2 : Enabled
gzip : Enabled
lz4_frame : Enabled
lz4 : Enabled
snappy : Enabled
zstd : Enabled
Some of these might be problems inside pandas. I'm going to file a bug about the test failures there in a minute, and link it here afterwards.
Component(s)
Python