Skip to content

Could NaNs not be counted as null #126

Open
@AlenkaF

Description

@AlenkaF

In pyarrow we differentiate between missing (null) values, which we define with a bitmask, and NaN float values.

From the dataframe interchange protocol specification we have understood that one can use NaN to indicate missing values but that does not need to be the case (one can use NaN as a valid value)

@property
def describe_null(self) -> Tuple[int, Any]:
"""
Return the missing value (or "null") representation the column dtype
uses, as a tuple ``(kind, value)``.
Kind:
- 0 : non-nullable
- 1 : NaN/NaT
- 2 : sentinel value
- 3 : bit mask
- 4 : byte mask
Value : if kind is "sentinel value", the actual value. If kind is a bit
mask or a byte mask, the value (0 or 1) indicating a missing value. None
otherwise.
"""
pass

There will be disceptancy between pyarrow and pandas, for example, where NaN will be turned into missing value. But we do not think it would be correct for pyarrow to change the null_count property as the information about the difference would be lost for the libraries that would benefit from it. Also the bitmask information and the information in the null_count would need to be made equal.

Is there a way a library could keep the behaviour of not treating NaNs as nulls?

(Connected issue in the arrow repo apache/arrow#34774)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions