-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add basic support for categorical dtype #464
Conversation
@@ -354,6 +387,7 @@ pub fn list_from_series(data: ExSeries, env: Env) -> Result<Term, ExplorerError> | |||
DataType::Binary => { | |||
generic_binary_series_to_list(&data.resource, s.binary()?.downcast_iter(), env) | |||
} | |||
DataType::Categorical(Some(mapping)) => categorical_series_to_list(s, env, mapping), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is tricky. I wonder if we should return the categories (as strings) or their indexes in the mapping. And I think term_from_value
should return the string.
iovec_from_series
should definitely return indexes though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After further thought: let's keep it as above. This will remain as is, term_from_value
will return binaries to, and iovec_from_series
will return the indexes. We will need a function to return the mappings either as a tuple or as a map. I will change to_iovec
and to_binary
docs to mention they work exclusively with fixed-width binaries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will need a function to return the mappings either as a tuple or as a map
Do you have a name in mind for that function? We could also return a dataframe for that. WDYT?
Also, I added the branch in iovec_from_series
, but the original data is using UInt32
, which we want to avoid.
Should I cast it to Int64
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to avoid UInt32. But we want to avoid multiple representations for the same dtype. I.e. integers should have only a single representation. Therefore it is fine to return UInt32 for categorical types. :)
This is the first step to support categorical series.
130dfba
to
b25400e
Compare
This is the first step to support categorical series.
There are some limitations to what can be done in the Polars side, so I'm not sure yet how I can manipulate the categories, like adding or removing members, and how to define ordering.
And also maybe we should call it
:category
dtype. WDYT?This is related to #261