Skip to content

ENH: Series.str.get_dummies should defer to pd.get_dummies and pass thru args #19618

Open
@randomgambit

Description

Hello the Pandas team and thanks for making this package greater day after day.

I was using the str.get_dummies method on a dataframe and I realized that by default the dummies are coded as int64.

This looks to me very inefficient because I ran into a memory error when trying to get dummies for a dataframe with several millions of rows (and about 5k dummies). I had to create the dummies by chunk, and use to_numeric() to coerce to int8.

Would it be possible to natively have the dummies in int8 format so that they take very little space? In that case NaN would be coerced to 0 but that should be fine.

What do you think?
Thanks!

Metadata

Assignees

No one assigned

    Labels

    CategoricalCategorical Data TypeEnhancementReshapingConcat, Merge/Join, Stack/Unstack, ExplodeStringsString extension data type and string data

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions