Skip to content
This repository was archived by the owner on Dec 22, 2021. It is now read-only.
This repository was archived by the owner on Dec 22, 2021. It is now read-only.

Can we use array extensions to speed up string processing with dask #36

@birdsarah

Description

@birdsarah

Much of the analysis done on this dataset uses dask. Dask is excellent for distributed numerical computing but seems to struggle with strings.

Array extensions unfortunately can't be serialized pandas-dev/pandas#20612

https://github.com/xhochy/fletcher is an array extension that adds string processing functionality.

Doing some standard tasks for this dataset like: collecting all domains, or number of script domains per location domain compare the performance of spark, dask, and dask with extension arrays.

Metadata

Metadata

Assignees

No one assigned

    Labels

    good first issueGood for newcomersresearch questionOutstanding questions that have not been investigated yet.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions