Description
Is your feature request related to a problem?
We are looking to improve compatibility between AnnData
and xarray
(see scverse/anndata#744), and so categoricals are naturally on our roadmap. Thus, I think some sort of standard-use categoricals array would be desirable. It seems something similar has come up with netCDF, although my knowledge is limited so this issue may be more distinct than I am aware. So what comes of this issue may solve two birds with one stone, or it may work towards some common solution that can at least help both use-cases (AnnData
and netCDF
ENUM
).
Describe the solution you'd like
The goal would be a standard-use categorical data type xarray
container of some sort. I'm not sure what form this can take.
We have something functional here that inherits from ExplicitlyIndexedNDArrayMixin
and returns pandas.CategoricalDtype
. So let's say this implementation would be at least a conceptual starting point to work from (it also seems not dissimilar to what is done here for new CF types).
Some issues:
- I have no idea what a standard "return type" for an
xarray
categorical array should be (i.e.,numpy
with the categories applied,pandas
, something custom etc.). So I'm not sure if usingpandas.CategoricalDtype
type is acceptable as In do in the linked implementation. Relatedly.... - I don't think using
pandas.CategoricalDtype
really helps with the already existing CF Enum need if you want to have the return type be some sort ofnumpy
array (although again, not sure about the return type). As I understand it, though, the whole point of categoricals is to useintegers
as the base type and then only show "strings" outwardly i.e., printing, the API for equality operations, accessors etc., while the internals are based on integers. So I'm not really surenumpy
is even an option here. Maybe we roll our own solution? - I am not sure this is the right level at which to implement this (maybe it should be a
Variable
? I don't think so, but I am just a beginner here 😄 )
It seems you may want, in addition to the array container, some sort of i/o functionality for this feature (so maybe some on-disk specification?).
Describe alternatives you've considered
I think there is some route via VariableCoder
as hinted here i.e., using encode
/decode
. This would probably be more general purpose as we could encode directly to other data types if using pandas
is not desirable. Maybe this would be a way to support both netCDF
and returning a pandas.CategoricalDtype
(again, not sure what the netCDF
return type should be for ENUM
).
Additional context
So just for reference, the current behavior of to_xarray
with pandas.CategoricalDtype
is object
dtype
from numpy
:
import pandas as pd
df = pd.DataFrame({'cat': ['a', 'b', 'a', 'b', 'c']})
df['cat'] = df['cat'].astype('category')
df.to_xarray()['cat']
# <xarray.DataArray 'cat' (index: 5)>
# array(['a', 'b', 'a', 'b', 'c'], dtype=object)
# Coordinates:
# * index (index) int64 0 1 2 3 4
And as stated in the netCDF
issue, for that use-case, the information about ENUM
is lost (from what I can read).
Apologies if I'm missing something here! Feedback welcome! Sorry if this is a bit chaotic, just trying to cover my bases.