Change `Series.to_dummies` to never include a "dummy null" column in the output #19325

MarcoGorelli · 2024-10-20T07:06:17Z

TL;DR: change the to_dummies behaviour so that it never creates a "dummy null" column - it's very easy for users to add it themselves later if they need it

Problem statement

Say you see:

>>> s.to_dummies()
shape: (5, 3)
┌───────┬───────┬────────┐
│ a_bar ┆ a_foo ┆ a_null │
│ ---   ┆ ---   ┆ ---    │
│ u8    ┆ u8    ┆ u8     │
╞═══════╪═══════╪════════╡
│ 0     ┆ 1     ┆ 0      │
│ 1     ┆ 0     ┆ 0      │
│ 0     ┆ 1     ┆ 0      │
│ 0     ┆ 0     ┆ 1      │
│ 1     ┆ 0     ┆ 0      │
└───────┴───────┴────────┘

There's currently no way of knowing whether this was produced by

pl.Series('a', ['foo', 'bar', 'foo', None, 'bar'])
pl.Series('a', ['foo', 'bar', 'foo', 'null', 'bar'])

as they produce the same output

Furthermore, to_dummies may produce duplicate column names #19096

Possible solutions

make separator required (and not allowed to be empty), and use a protected separator for null values. This addresses the known issues, but I think there's some downsides:

users can't pass separator='' (which is needed to match pandas' output, so would make migrating harder)
if the dummy null column were to be named just 'null' (ignoring the column name), then that would mean that pl.concat([feature_1.to_dummies(), feature_2.to_dummies()], how='horizontal') would raise
if the dummy null column where named with a pattern like f'{column_name}#null' (so, using '#' as protected separator), then that would probably not be clear to users
this would be a breaking change

add a dummy_null keyword argument, as suggested in dummy_null in Series.to_dummies #19095. Unfortunately, when set to True, it still doesn't solve the problem highlighted in this issue whereby one cannot tell if a column 'a_null' corresponds to 'null' values or missing values
add a dummy_null keyword argument which behaves like it does in pandas:

if True, then it always produces a dummy null column (even if there's no nulls)
if False, then it never produces a dummy null column (even if there are nulls)
This would also be a breaking change, and the dummy_null=True could still produce duplicate column names as in Series.to_dummies can create duplicate column names #19096

never produce a dummy null column. This would also be a breaking change, but I think it has some advantages over the others:

it keeps the function very simple and clear
if somebody wants to include a dummy null column, then all they need to do is write .with_columns(s.is_null().cast(pl.UInt8).alias(f'{s.name}_null'). Users would be free to name the dummy null column how they want, so the responsibility would be on them to choose a name which wouldn't produce duplicate column names

I think all solutions would represent a breaking change for at least some users, and I think solution 4 is the simplest

Pinging @coastalwhite as we'd discussed this, curious if you have further thoughts here - thanks 🙏

The text was updated successfully, but these errors were encountered:

s-banach · 2024-10-20T23:07:13Z

If someone puts the string "null" in a nullable string column, they deserve the trouble coming to them.

deanm0000 · 2024-10-21T14:24:33Z

make separator required (and not allowed to be empty), and use a protected separator for null values.

Why not make separator optional with a default to something like _?

What about quoting string "null" so pl.Series('a', ['foo', 'bar', 'foo', None, 'bar']) stays the same and produces a_bar ┆ a_foo ┆ a_null whereas pl.Series('a', ['foo', 'bar', 'foo', 'null', 'bar']) would produce a_bar ┆ a_foo ┆ a_"null"

I don't actually have a strong opinion on this it was just something I thought of.

MarcoGorelli · 2024-10-21T14:36:10Z

thanks for your thoughts - sure, but then the string 'null' would be handled differently from other strings. I think all strings should be handled the same way

If someone puts the string "null" in a nullable string column, they deserve the trouble coming to them.

😄 ok this did make me laugh - seriously though, all data types are nullable in Polars, and 'null' seems like a legit string value to have

In terms of enabling users to write robust, non-value-dependent code, I'm hoping that there's appetite for either option 3 or 4 (with a personal preference for 4, but I think they'd both be an improvement). I wouldn't mind option 1 either - I think that would also be an improvement on the status quo - but it's not clear to me what a good choice for the reserved separator should be

my use-case for this is to support Polars in Formulaic, where they currently use pd.get_dummies(..., dummy_na=False)

CangyuanLi · 2024-10-22T00:12:32Z

Sorry for the partial hijack, but since to_dummies is getting some attention, I would love for #17236 to be considered as well. It would mimic the options in sklearn's OneHotEncoder, see github.com/scikit-learn/scikit-learn/issues/11996

MarcoGorelli added needs decision Awaiting decision by a maintainer A-api Area: changes to the public API labels Oct 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change `Series.to_dummies` to never include a "dummy null" column in the output #19325

Change `Series.to_dummies` to never include a "dummy null" column in the output #19325

MarcoGorelli commented Oct 20, 2024

s-banach commented Oct 20, 2024

deanm0000 commented Oct 21, 2024

MarcoGorelli commented Oct 21, 2024 •

edited

Loading

CangyuanLi commented Oct 22, 2024

Change Series.to_dummies to never include a "dummy null" column in the output #19325

Change Series.to_dummies to never include a "dummy null" column in the output #19325

Comments

MarcoGorelli commented Oct 20, 2024

Problem statement

Possible solutions

s-banach commented Oct 20, 2024

deanm0000 commented Oct 21, 2024

MarcoGorelli commented Oct 21, 2024 • edited Loading

CangyuanLi commented Oct 22, 2024

Change `Series.to_dummies` to never include a "dummy null" column in the output #19325

Change `Series.to_dummies` to never include a "dummy null" column in the output #19325

MarcoGorelli commented Oct 21, 2024 •

edited

Loading