Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Series.to_dummies to never include a "dummy null" column in the output #19325

Open
MarcoGorelli opened this issue Oct 20, 2024 · 4 comments
Labels
A-api Area: changes to the public API needs decision Awaiting decision by a maintainer

Comments

@MarcoGorelli
Copy link
Collaborator

TL;DR: change the to_dummies behaviour so that it never creates a "dummy null" column - it's very easy for users to add it themselves later if they need it

Problem statement

Say you see:

>>> s.to_dummies()
shape: (5, 3)
┌───────┬───────┬────────┐
│ a_bara_fooa_null │
│ ---------    │
│ u8u8u8     │
╞═══════╪═══════╪════════╡
│ 010      │
│ 100      │
│ 010      │
│ 001      │
│ 100      │
└───────┴───────┴────────┘

There's currently no way of knowing whether this was produced by

  • pl.Series('a', ['foo', 'bar', 'foo', None, 'bar'])
  • pl.Series('a', ['foo', 'bar', 'foo', 'null', 'bar'])

as they produce the same output

Furthermore, to_dummies may produce duplicate column names #19096

Possible solutions

  1. make separator required (and not allowed to be empty), and use a protected separator for null values. This addresses the known issues, but I think there's some downsides:
  • users can't pass separator='' (which is needed to match pandas' output, so would make migrating harder)
  • if the dummy null column were to be named just 'null' (ignoring the column name), then that would mean that pl.concat([feature_1.to_dummies(), feature_2.to_dummies()], how='horizontal') would raise
  • if the dummy null column where named with a pattern like f'{column_name}#null' (so, using '#' as protected separator), then that would probably not be clear to users
  • this would be a breaking change
  1. add a dummy_null keyword argument, as suggested in dummy_null in Series.to_dummies #19095. Unfortunately, when set to True, it still doesn't solve the problem highlighted in this issue whereby one cannot tell if a column 'a_null' corresponds to 'null' values or missing values
  2. add a dummy_null keyword argument which behaves like it does in pandas:
  • if True, then it always produces a dummy null column (even if there's no nulls)
  • if False, then it never produces a dummy null column (even if there are nulls)
    This would also be a breaking change, and the dummy_null=True could still produce duplicate column names as in Series.to_dummies can create duplicate column names #19096
  1. never produce a dummy null column. This would also be a breaking change, but I think it has some advantages over the others:
  • it keeps the function very simple and clear
  • if somebody wants to include a dummy null column, then all they need to do is write .with_columns(s.is_null().cast(pl.UInt8).alias(f'{s.name}_null'). Users would be free to name the dummy null column how they want, so the responsibility would be on them to choose a name which wouldn't produce duplicate column names

I think all solutions would represent a breaking change for at least some users, and I think solution 4 is the simplest

Pinging @coastalwhite as we'd discussed this, curious if you have further thoughts here - thanks 🙏

@MarcoGorelli MarcoGorelli added needs decision Awaiting decision by a maintainer A-api Area: changes to the public API labels Oct 20, 2024
@s-banach
Copy link
Contributor

If someone puts the string "null" in a nullable string column, they deserve the trouble coming to them.

@deanm0000
Copy link
Collaborator

make separator required (and not allowed to be empty), and use a protected separator for null values.

Why not make separator optional with a default to something like _?

What about quoting string "null" so pl.Series('a', ['foo', 'bar', 'foo', None, 'bar']) stays the same and produces a_bar ┆ a_foo ┆ a_null whereas pl.Series('a', ['foo', 'bar', 'foo', 'null', 'bar']) would produce a_bar ┆ a_foo ┆ a_"null"

I don't actually have a strong opinion on this it was just something I thought of.

@MarcoGorelli
Copy link
Collaborator Author

MarcoGorelli commented Oct 21, 2024

thanks for your thoughts - sure, but then the string 'null' would be handled differently from other strings. I think all strings should be handled the same way

If someone puts the string "null" in a nullable string column, they deserve the trouble coming to them.

😄 ok this did make me laugh - seriously though, all data types are nullable in Polars, and 'null' seems like a legit string value to have


In terms of enabling users to write robust, non-value-dependent code, I'm hoping that there's appetite for either option 3 or 4 (with a personal preference for 4, but I think they'd both be an improvement). I wouldn't mind option 1 either - I think that would also be an improvement on the status quo - but it's not clear to me what a good choice for the reserved separator should be


my use-case for this is to support Polars in Formulaic, where they currently use pd.get_dummies(..., dummy_na=False)

@CangyuanLi
Copy link

Sorry for the partial hijack, but since to_dummies is getting some attention, I would love for #17236 to be considered as well. It would mimic the options in sklearn's OneHotEncoder, see github.com/scikit-learn/scikit-learn/issues/11996

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-api Area: changes to the public API needs decision Awaiting decision by a maintainer
Projects
None yet
Development

No branches or pull requests

4 participants