QST: "Dummy" is rooted in ableist language #35724

RollingStar · 2020-08-14T16:01:46Z

I have searched the [pandas] tag on StackOverflow for similar questions.
I have asked my usage related question on StackOverflow.

Question about pandas

Although extremely common in the industry, "dummy" has some unfortunate history. One current use is for substitutes - mannequins, stand-ins, etc. This use grew from its original definition, "mute person". Mute people are not substitutes or stand-ins and I would prefer Pandas to not contribute to this view. There are other words, like "indicator", for statistics.

Pandas currently uses "get_dummies" as a function name, with the documentation referencing "indicator" as a synonym.

Citations:

TomAugspurger · 2020-08-14T16:20:56Z

Thanks for opening an issue.

How should we balance this against the cost of changing it to something like get_indicators or onehot_encode (the deprecation warnings users would see, and need to update for)? I'm having trouble weighing the two in my head.

MarcoGorelli · 2020-08-16T15:13:10Z

Granted I'm punching above my weight by commenting on API design, but would it be possible to make get_indicators an alias of get_dummies, so it can be used whilst not breaking other people's code?

Given how painfully common it is to see warnings.filterwarnings("ignore"), I fear deprecation warnings would be ignored by many

@galipremsagar

FIxes: #7031 This PR introduces array-like inputs support in `cudf.get_dummies`. I think in near future we will have to deprecate and adapt new name for `get_dummies`: pandas-dev/pandas#35724 Authors: - GALI PREM SAGAR (@galipremsagar) Approvers: - Keith Kraus (@kkraus14) URL: #7181

galipremsagar · 2021-01-25T05:34:08Z

+1 for get_indicators

MarcoGorelli · 2022-02-09T13:34:20Z

@pandas-dev/pandas-core anyone have any thoughts/objections on going through a deprecation cycle to rename get_dummies to get_indicators in version 2.0?

simonjayhawkins · 2022-02-09T14:00:04Z

we should be consistent with the rest of the ecosystem. What are other projects doing?

MarcoGorelli · 2022-02-09T14:12:10Z

sklearn calls it OneHotEncoder

get_onehotencoding?

datapythonista · 2022-02-09T15:42:31Z

My understanding is that one hot encode and dummies are almost the same but not completly. OHE has a column per category in the results, while dummies has one less to avoid redundancy.

I'm +1 in using the actual names and avoid confusion by finding equivalents. Unless there are reports of people or communities getting personally offended by our wording, which I don't think it's the case.

But no big deal if the rest of devs have consensus in renaming. Just I think we're wasting our users time but a renaming for a reason I personally don't see being insulting or a problem to anyone (I may be wrong).

toobaz · 2022-02-09T15:49:14Z

Can't judge on whether to do drop "dummies" (English is not my mother tongue and I had never associated a pejorative effect to the "mannequin" meaning of "dummy"), but on what to replace it with, I'm pretty sure "OneHotEncoding" would sound weird to most people I know (social scientists). We'd rather then go for get_booleans - which at least is a term pandas users are likely to already be accustomed with (although I do see the downside that the returned dtype is int, not bool). Or even just "categories" (despite the returned dtype not begin categorical :-D ). I would still consider "indicators" better than "OneHotEncoding".

By the way: I don't see references to ableism, or even just to controversy in naming, in Wikipedia ... any link to better understand the issue?

MarcoGorelli · 2022-02-09T16:03:50Z

OHE has a column per category in the results, while dummies has one less to avoid redundancy.

By default, get_dummies also has one column per category. There is a drop_first argument, but the default value is False.
Likewise, OHE has one column per category by default, but has an argument drop with which you can drop the first value.

By the way: I don't see references to ableism, or even just to controversy in naming, in Wikipedia ... any link to better understand the issue?

For a start it's discouraged in the Google Developer documentation style guide

toobaz · 2022-02-09T16:10:13Z

For a start it's discouraged in the Google Developer documentation style guide

(True, but without any references and any mention of ableism... and surrounded by dozens of discouraged and definitely not offensive terms, and ironically linking to the Wikipedia page with the discouraged name)

datapythonista · 2022-02-09T16:12:41Z

And I wouldn't take as a reference of morality a company in the business of mass surveilance, censorship, monopoly abuses, political inference and brainwashing. ;)

Dr-Irv · 2022-02-09T16:23:50Z

The term "dummy variable" is in wide use in how people learn about encoding categorical data and is not specific to software. The references on the Wikipedia article refer to a paper from 1957 that uses the term. It probably appears in statistics textbooks. SPSS uses the term in their documentation. IMHO, until the statistics/data science community at large decides to deprecate the language, it's not our responsibility to take the lead in doing so.

Having said that, having an alternate name such as get_indicators() is appropriate, but I think we should not deprecate get_dummies() and just leave it there, but no longer document it.

I also have to wonder how the publishers of the "XYZ for Dummies" series of books have handled this issue.

Finally I found it interesting to contrast the order of the definitions of the word "dummy" shown in these three references:
https://www.merriam-webster.com/dictionary/dummy
https://www.dictionary.com/browse/dummy
https://dictionary.cambridge.org/us/dictionary/english/dummy

For Merriam-Webster, the first category is related to not speaking or being stupid.
For dictionary.com, the first category is "a representation or copy of something, as for displaying to indicate appearance:"
For the Cambridge dictionary, the first category is "a large model of a human, especially one used to show clothes in a store"

One of the definitions reminded me that the term is also used as a word for a baby's pacifier.

MarcoGorelli · 2022-02-09T16:29:23Z

Hey @toobaz - what kind of reference are you looking for? You can find articles like this one which refers to it as an ableist term if you look. Likewise this one from HBR. I won't overload you with links though, as I'm not sure what you're looking for or what kind of source you'd accept

I think we should not deprecate get_dummies() and just leave it there, but no longer document it.

Agreed, I'd suggested this at the top - rename it, but continue to silently support the current name and not break people's code

bashtage · 2022-02-09T16:31:41Z

I also feel that dummy variable is so wide spread that renaming would create a lot of confusion. It is much more commonly taught that one hot encoding. The wiki article for one-hot mentions dummy in the first line - the reverse is not true and one-hot only makes it into a footnote.

I think there isn't a settled alternative to dummy, and until there is no change should be made. One the world converges into something that mostly stops using dummy variable, then that should be adopted. More or less how the master->main change worked in pandas.

toobaz · 2022-02-09T16:35:07Z

You can find articles like this one which refers to it as an ableist term if you look. Likewise this one from HBR.

Are you sure? I can't find the word "dummy" in any of the two, even less a reference to the "mannequin" meaning. But yes, these would have been otherwise somewhat better references than the Google documentation style guide.

MarcoGorelli · 2022-02-09T16:46:52Z

OK, they use "dumb", from which "dummy" comes from (https://www.etymonline.com/word/dummy)

I'm OK with not doing this if others would prefer not to anyway, it just seemed like the moment the bring this up as else it'll be a while until 3.0

MarcoGorelli · 2022-02-13T10:03:55Z

Doesn't look like there's much support for renaming, so let's close for now to keep the queue down - the discussion can always be reopened in the future if necessary

Thanks anyway @RollingStar for the suggestion!

davidcavazos · 2022-08-25T19:43:59Z

From #48250 to keep the discussion in one place

The word "dummy" from the pd.get_dummies function can be offensive to some people and should be renamed.

It's marked as a word that should not be used by Google's inclusive language word list.

@TheNeuralBit commented:

A non-Google reference for "dummy" being non-inclusive: https://itconnect.uw.edu/guides-by-topic/identity-diversity-inclusion//inclusive-language-guide/

Why it’s problematic:
The origin of the word, “dummy,” is a person who cannot speak. Because the use of this word is often negatively associated with a disability, implying a person is worthless, ineffective or incapable, an alternative word should be used.

Some other sources which flag the use of the word "dummy":

Another document mentioning how it causes harm:

“Dummy” and similar terms stigmatize mental disabilities. The alternatives are clearer.

MarcoGorelli · 2022-08-25T19:47:21Z

Thanks @davidcavazos

Reopening then - perhaps we can discuss this in the next dev meeting (which btw anyone is welcome to attend)

davidcavazos · 2022-08-25T19:49:13Z

So far, some alternatives are:

get_indicators: Short and concise
get_indicator_variables: A little longer but more explicit
one_hot_encode: Different term, but explicit as well and used by other frameworks as well

Maybe we could open a voting to finalize the name.

davidcavazos · 2022-08-25T19:53:08Z

Thanks @davidcavazos

Reopening then - perhaps we can discuss this in the next dev meeting (which btw anyone is welcome to attend)

Thanks, I've added it to my calendar

TheNeuralBit · 2022-08-25T20:01:13Z

Another point I'd like to bring over from #48250 (comment):

My takeaway from [the discussion in #35724] is that adding a separate get_indicators (or some other agreed upon alternative) would be amenable. From there we could either:

Deprecate and ultimately remove get_dummies, or

Prefer get_indicators in documentation to nudge users there

It seems the former was rejected, but the latter could be acceptable. Could we pursue that approach?

jbrockmendel · 2022-08-25T20:09:18Z

can be offensive to some people

Is there evidence on this?

Some other sources which flag the use of the word "dummy":

I do not find these compelling. They also suggest replacing "normal" with "typical". Should scipy/statsmodels deprecate references to the Normal Distribution?

https://twitter.com/jbarro/status/1467250971361386505

“inclusive language” — that is, the creation of a long list of weird required adjustments to language, separating those who know and subscribe to all the latest rules from those who don’t — is not actually inclusive.

davidcavazos · 2022-08-25T20:43:17Z

Nobody expects anyone to know all the words, but there's also a long historical background of poor choices of words which convey a negative context or are sensitive to groups of people (like master, slave, kill, etc). Fortunately there are people who have invested the time of compiling these words into lists to make them more searchable. Many fall into gray areas, but there are some which make sense to change. That's why GitHub renamed the main branch name from master, even if that was pretty disruptive at the time.

attack68 · 2022-08-25T21:14:33Z

"Master and slave" is such an unequivocal and obvious corporate reputational risk that it had to be changed.

In my opinion, "dummy" in the context of dummy variables offers no offensive connotation. Dictionary definitions of dummy variables make no reference to it, the wikipedia article on dummy variables makes no reference and the widespread use of it in scientific papers suggests to me anyone finding that particular use offensive in that context is overly sensitive. I consider the language to have evolved.

I am -1 on changing for the sole purpose of sensibility. Other mentions for including other functions names if they are synchronised with other libraries I am +0.5 for.

toobaz · 2022-08-25T21:46:37Z

I agree with @attack68, and let me add that "master and slave" is computer science jargon from which programmers are agreeing to transition on, including pandas programmers.

"Dummy" is established, technical jargon from statistics that pandas has adopted from, not imposed to, its users, who are mostly not involved in its development. We do not decide how statisticians (or whoever follows a statistics class) talk, and as of now, I have no doubt that our users will find "get_dummy" more understandable than the alternatives.

Now, I would never say "let's stick to what this mass of people do, whatever the harm we cause to users". But as mentioned by others, there is is no indication that get_dummies is causing harm to any group/community. The technical use is derived from a meaning unrelated to disabilities ("mannequin") that itself is well established in common parlance since almost two centuries.

Dr-Irv · 2022-08-25T21:52:14Z

We do not decide how statisticians (or whoever follows a statistics class) talk, and as of now, I have no doubt that our users will find "get_dummy" more understandable than the alternatives.

I said something similar above: #35724 (comment) The usage is well-established in the statistics literature and in packages like SAS and SPSS.

As a compromise, I'd like to suggest that we should create a get_indicators() method that is the same as get_dummies(), document get_indicators(), but leave get_dummies() in the API and just remove any documentation of it.

Since in Britain, a baby's pacifier is called a dummy, maybe this suggestion will pacify those who object to the current method name.

toobaz · 2022-08-25T22:00:53Z

The usage is well-established in the statistics literature and in packages like SAS and SPSS.

... and Stata, and R... the latter goes as far as providing a dummify function.

As a compromise, I'd like to suggest that we should create a get_indicators() method that is the same as get_dummies(), document get_indicators(), but leave get_dummies() in the API and just remove any documentation of it.

If we decided to go this route, the get_dummies() docstring should at least be "See get_indicators()"

mroeschke · 2022-08-25T22:58:22Z

Noting that 1.5 just added from_dummies, so that method would need the same treatment as well: https://pandas.pydata.org/docs/dev/whatsnew/v1.5.0.html#from-dummies

jreback · 2022-08-25T23:05:08Z

-0.5 on any change; as indicated this is a very common term

not completely averse though as this is a one hot encoding operation ; we could name similarly to sklearn

kennknowles · 2022-08-26T01:07:27Z

I'm someone who isn't deeply involved with statistics or whatever realm this odd use of the term comes from. So at first glance it doesn't even make sense. By far the more common usage is "placeholder". So for people like me, which I think is most people in this case, the term is also esoteric or misleading, even aside from insensitivity. That's probably why the official docs immediately clarify with an alternate term "indicator" that is more common and sensible. Adding get_indicators and leaving get_dummies undocumented just for backwards-compatibility will improve the library for everyone.

kennknowles · 2022-08-26T01:12:15Z

When this was brought up to me, I had to look up what it did, and was surprised at what a terrible name this is for the function. But from this thread I do understand it is stats jargon. So my take is just an external view, that this particular piece of jargon is exceptionally badly chosen and there are multiple better choices in even broader use.

toobaz · 2022-08-26T07:06:45Z

By far the more common usage is "placeholder".

What are you basing your statements on? 4 different software packages were named from which people often move to pandas, and they all use "dummy". Sklearn has "one hot encoding". Then for sure pandas isn't perfectly equivalent to any of these, but I don't know anyone or anything that uses "placeholder". Wikipedia, in the page "dummy variable" (yes) does provide 6 alternatives: "indicator variable" is the first, "placeholder" is not one of them.

bashtage · 2022-08-26T07:12:38Z

In stats, I would say in order of commonality (with 3 and 4 being much rarer than 3 and 4):

dummy
indicator
binary
dichotomous

In ML, one hot encoding is common, although this is not a description of the variable rather than the method used to create the dichotomous values.

bashtage · 2022-08-26T07:14:25Z

I feel like some of the confusion is based on the usage of dummy in comp sci, which is often a simple version of something complex. The usage of dummy in statistics is not the same, and IME the intent of the word dummy in the context of the statistics is not the same as it is in comp sci.

kennknowles · 2022-08-26T14:11:37Z

Yes, the wikipedia page on the stats use of the term lists the stats use of the term first :-)

I'm referring to the use of the term beyond stats, just to offer an outside perspective FWIW. The "comp sci" use is much more widespread than computing, in my experience. But I'm certainly not advocating for that use, either. It is also insensitive and not descriptive.

RollingStar added Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Aug 14, 2020

TomAugspurger added API Design and removed Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Aug 14, 2020

MarcoGorelli added the Needs Discussion Requires discussion from core team before further action label Aug 15, 2020

galipremsagar mentioned this issue Jan 21, 2021

[REVIEW] Add support for array-like inputs in cudf.get_dummies rapidsai/cudf#7181

Merged

beckernick mentioned this issue Feb 17, 2021

[BUG] get_dummies fails in dask-cudf due to dask categorical type checking rapidsai/cudf#7111

Closed

mroeschke added the Enhancement label Aug 10, 2021

MarcoGorelli closed this as completed Feb 13, 2022

MarcoGorelli mentioned this issue Aug 25, 2022

ENH: Rename get_dummies to more inclusive language #48250

Closed

3 tasks

MarcoGorelli reopened this Aug 25, 2022

gdalle mentioned this issue Jul 5, 2023

New package: FixedRNGs v0.0.1 JuliaRegistries/General#86858

Closed

MyreMylar mentioned this issue Nov 4, 2023

Reduce amount of usage of "dummy" in codebase to minimum pygame-community/pygame-ce#2547

Merged

kafitzgerald mentioned this issue Nov 20, 2023

More inclusive language NCAR/geocat-examples#570

Open

d33bs mentioned this issue Mar 19, 2024

Add linear model results from concat plates WayScience/nf1_schwann_cell_painting_data#40

Merged

zachaysan mentioned this issue Apr 24, 2024

chore(saas/hubspot): create contacts with default domain Flagsmith/flagsmith#3830

Merged

5 tasks

BCerki mentioned this issue May 13, 2024

1222 vitest widgets bcgov/cas-registration#1605

Merged

QST: "Dummy" is rooted in ableist language #35724

QST: "Dummy" is rooted in ableist language #35724

Comments

RollingStar commented Aug 14, 2020

Question about pandas

TomAugspurger commented Aug 14, 2020

MarcoGorelli commented Aug 16, 2020

galipremsagar commented Jan 25, 2021

MarcoGorelli commented Feb 9, 2022

simonjayhawkins commented Feb 9, 2022

MarcoGorelli commented Feb 9, 2022

datapythonista commented Feb 9, 2022

toobaz commented Feb 9, 2022 • edited Loading

MarcoGorelli commented Feb 9, 2022

toobaz commented Feb 9, 2022

datapythonista commented Feb 9, 2022

Dr-Irv commented Feb 9, 2022

MarcoGorelli commented Feb 9, 2022

bashtage commented Feb 9, 2022 • edited Loading

toobaz commented Feb 9, 2022 • edited Loading

MarcoGorelli commented Feb 9, 2022

MarcoGorelli commented Feb 13, 2022

davidcavazos commented Aug 25, 2022

MarcoGorelli commented Aug 25, 2022

davidcavazos commented Aug 25, 2022 • edited Loading

davidcavazos commented Aug 25, 2022

TheNeuralBit commented Aug 25, 2022

jbrockmendel commented Aug 25, 2022

davidcavazos commented Aug 25, 2022 • edited Loading

attack68 commented Aug 25, 2022

toobaz commented Aug 25, 2022

Dr-Irv commented Aug 25, 2022

toobaz commented Aug 25, 2022

mroeschke commented Aug 25, 2022

jreback commented Aug 25, 2022

kennknowles commented Aug 26, 2022

kennknowles commented Aug 26, 2022

toobaz commented Aug 26, 2022 • edited Loading

bashtage commented Aug 26, 2022

bashtage commented Aug 26, 2022

kennknowles commented Aug 26, 2022

toobaz commented Feb 9, 2022 •

edited

Loading

bashtage commented Feb 9, 2022 •

edited

Loading

toobaz commented Feb 9, 2022 •

edited

Loading

davidcavazos commented Aug 25, 2022 •

edited

Loading

davidcavazos commented Aug 25, 2022 •

edited

Loading

toobaz commented Aug 26, 2022 •

edited

Loading