Skip to content

[BUG-REPORT] Arrow columns in interchange dataframes have erroneous null behaviour #2083

@honno

Description

@honno

Arrow columns in a Vaex dataframe seems to have incorrect null bitmasks, although it could very well be a problem in specification/implementation.

Maybe this is just an issue with vaex.dataframe_protocol's adoption of the interchange protocol, but in any case I'll use it in the example as it's what I'm familiar with 😅 The following example shows that the null mask you can infer from an interchange column ends up falsely marking non-null elements as null.

>>> import pyarrow as pa
>>> table = pa.Table.from_pydict({"foo_col": pa.array([7, 42])})
>>> import vaex
>>> df = vaex.from_arrow_table(table)
>>> df
#  foo_col
0        7
1       42
>>> protocol_df = df.__dataframe__()
>>> col = protocol_df.get_column(0)  # i.e. foo_col
>>> col.dtype
(<_DtypeKind.INT: 0>, 64, '<i8', '=')
>>> bufinfo = col.get_buffers()
>>> col.describe_null
(3, 0)  # i.e. a bitmask represents nulls, where False indicates a missing value
>>> validity_buf, validity_dtype = bufinfo["validity"]
>>> validity_dtype
(<_DtypeKind.BOOL: 20>, 8, '|b1', '|')
>>> import ctypes
>>> data_pointer = ctypes.cast(validity_buf.ptr, ctypes.POINTER(ctypes.c_bool))
>>> import numpy as np
>>> mask = np.ctypeslib.as_array(data_pointer, shape=(2,))
>>> mask
array([False, False])  # should be array([True, True])

Logic to get mask is lifted from vaex.dataframe_protocol.buffer_to_ndarray(). Is there a chance it's doing something wrong?

I say mask should be [True, True] because assuming 0/False indicates a missing value, right now Vaex is erroneously saying all our values in df are null. I'm not familiar with Arrow and have been assuming this specification of Arrow's null representations from

if kind in (_k.INT, _k.UINT, _k.FLOAT, _k.BOOL, _k.STRING):
if self._col.dtype.is_arrow:
# arrow arrays always allow for null values
# where 0 encodes a null/missing value
null = 3
value = 0

This affects the dataframe interchange introduced to pandas in pandas-dev/pandas#46141, e.g.

>>> import pandas as pd
>>> df = pd.DataFrame({"foo_col": [7, 42]})
>>> df
   foo_col
0        7
1       42
>>> from vaex.dataframe_protocol import from_dataframe_to_vaex
>>> vaex_df = from_dataframe_to_vaex(df)
>>> vaex_df
#  foo_col
0        7
1       42
>>> from pandas.api.exchange import from_dataframe as pandas_from_dataframe
>>> roundtrip_df = pandas_from_dataframe(vaex_df)
>>> roundtrip_df
   foo_col
0      NaN
1      NaN

(I have a very WIP test suite for the interchange protocol at honno/dataframe-interchange-tests, where I originally found this bug.)

Vaex was built locally from source (upstream master) on Ubuntu 20.04. Let me know if there's any useful information I can provide!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions