Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add basic support for categorical dtype #464

Merged
merged 7 commits into from
Jan 9, 2023

Conversation

philss
Copy link
Contributor

@philss philss commented Jan 8, 2023

This is the first step to support categorical series.

There are some limitations to what can be done in the Polars side, so I'm not sure yet how I can manipulate the categories, like adding or removing members, and how to define ordering.

And also maybe we should call it :category dtype. WDYT?

This is related to #261

lib/explorer/series.ex Outdated Show resolved Hide resolved
@@ -354,6 +387,7 @@ pub fn list_from_series(data: ExSeries, env: Env) -> Result<Term, ExplorerError>
DataType::Binary => {
generic_binary_series_to_list(&data.resource, s.binary()?.downcast_iter(), env)
}
DataType::Categorical(Some(mapping)) => categorical_series_to_list(s, env, mapping),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tricky. I wonder if we should return the categories (as strings) or their indexes in the mapping. And I think term_from_value should return the string.

iovec_from_series should definitely return indexes though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After further thought: let's keep it as above. This will remain as is, term_from_value will return binaries to, and iovec_from_series will return the indexes. We will need a function to return the mappings either as a tuple or as a map. I will change to_iovec and to_binary docs to mention they work exclusively with fixed-width binaries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need a function to return the mappings either as a tuple or as a map

Do you have a name in mind for that function? We could also return a dataframe for that. WDYT?

Also, I added the branch in iovec_from_series, but the original data is using UInt32, which we want to avoid.
Should I cast it to Int64?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to avoid UInt32. But we want to avoid multiple representations for the same dtype. I.e. integers should have only a single representation. Therefore it is fine to return UInt32 for categorical types. :)

@philss philss force-pushed the ps-add-categorical branch from 130dfba to b25400e Compare January 9, 2023 15:52
@philss philss marked this pull request as ready for review January 9, 2023 18:59
@philss philss merged commit 18d10ef into elixir-explorer:main Jan 9, 2023
@philss philss deleted the ps-add-categorical branch January 9, 2023 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants