ENH: Series.str.get_dummies should defer to pd.get_dummies and pass thru args #19618
Open
Description
Hello the Pandas team and thanks for making this package greater day after day.
I was using the str.get_dummies
method on a dataframe and I realized that by default the dummies are coded as int64
.
This looks to me very inefficient because I ran into a memory error when trying to get dummies for a dataframe with several millions of rows (and about 5k dummies). I had to create the dummies by chunk, and use to_numeric()
to coerce to int8
.
Would it be possible to natively have the dummies in int8
format so that they take very little space? In that case NaN
would be coerced to 0 but that should be fine.
What do you think?
Thanks!