Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QST: "Dummy" is rooted in ableist language #35724

Open
2 tasks done
RollingStar opened this issue Aug 14, 2020 · 36 comments
Open
2 tasks done

QST: "Dummy" is rooted in ableist language #35724

RollingStar opened this issue Aug 14, 2020 · 36 comments
Labels
API Design Enhancement Needs Discussion Requires discussion from core team before further action

Comments

@RollingStar
Copy link

  • I have searched the [pandas] tag on StackOverflow for similar questions.

  • I have asked my usage related question on StackOverflow.


Question about pandas

Although extremely common in the industry, "dummy" has some unfortunate history. One current use is for substitutes - mannequins, stand-ins, etc. This use grew from its original definition, "mute person". Mute people are not substitutes or stand-ins and I would prefer Pandas to not contribute to this view. There are other words, like "indicator", for statistics.

Pandas currently uses "get_dummies" as a function name, with the documentation referencing "indicator" as a synonym.

Citations:

  1. https://www.etymonline.com/word/dummy
  2. https://www.etymonline.com/word/dumb
  3. https://www.etymonline.com/word/indicator
@RollingStar RollingStar added Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Aug 14, 2020
@TomAugspurger
Copy link
Contributor

Thanks for opening an issue.

How should we balance this against the cost of changing it to something like get_indicators or onehot_encode (the deprecation warnings users would see, and need to update for)? I'm having trouble weighing the two in my head.

@TomAugspurger TomAugspurger added API Design and removed Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Aug 14, 2020
@MarcoGorelli MarcoGorelli added the Needs Discussion Requires discussion from core team before further action label Aug 15, 2020
@MarcoGorelli
Copy link
Member

Granted I'm punching above my weight by commenting on API design, but would it be possible to make get_indicators an alias of get_dummies, so it can be used whilst not breaking other people's code?

Given how painfully common it is to see warnings.filterwarnings("ignore"), I fear deprecation warnings would be ignored by many

rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue Jan 21, 2021
FIxes: #7031 

This PR introduces array-like inputs support in `cudf.get_dummies`. I think in near future we will have to deprecate and adapt new name for `get_dummies`: pandas-dev/pandas#35724

Authors:
  - GALI PREM SAGAR (@galipremsagar)

Approvers:
  - Keith Kraus (@kkraus14)

URL: #7181
@galipremsagar
Copy link

+1 for get_indicators

@MarcoGorelli
Copy link
Member

@pandas-dev/pandas-core anyone have any thoughts/objections on going through a deprecation cycle to rename get_dummies to get_indicators in version 2.0?

@simonjayhawkins
Copy link
Member

we should be consistent with the rest of the ecosystem. What are other projects doing?

@MarcoGorelli
Copy link
Member

sklearn calls it OneHotEncoder

get_onehotencoding?

@datapythonista
Copy link
Member

My understanding is that one hot encode and dummies are almost the same but not completly. OHE has a column per category in the results, while dummies has one less to avoid redundancy.

I'm +1 in using the actual names and avoid confusion by finding equivalents. Unless there are reports of people or communities getting personally offended by our wording, which I don't think it's the case.

But no big deal if the rest of devs have consensus in renaming. Just I think we're wasting our users time but a renaming for a reason I personally don't see being insulting or a problem to anyone (I may be wrong).

@toobaz
Copy link
Member

toobaz commented Feb 9, 2022

Can't judge on whether to do drop "dummies" (English is not my mother tongue and I had never associated a pejorative effect to the "mannequin" meaning of "dummy"), but on what to replace it with, I'm pretty sure "OneHotEncoding" would sound weird to most people I know (social scientists). We'd rather then go for get_booleans - which at least is a term pandas users are likely to already be accustomed with (although I do see the downside that the returned dtype is int, not bool). Or even just "categories" (despite the returned dtype not begin categorical :-D ). I would still consider "indicators" better than "OneHotEncoding".

By the way: I don't see references to ableism, or even just to controversy in naming, in Wikipedia ... any link to better understand the issue?

@MarcoGorelli
Copy link
Member

OHE has a column per category in the results, while dummies has one less to avoid redundancy.

By default, get_dummies also has one column per category. There is a drop_first argument, but the default value is False.
Likewise, OHE has one column per category by default, but has an argument drop with which you can drop the first value.

By the way: I don't see references to ableism, or even just to controversy in naming, in Wikipedia ... any link to better understand the issue?

For a start it's discouraged in the Google Developer documentation style guide

@toobaz
Copy link
Member

toobaz commented Feb 9, 2022

For a start it's discouraged in the Google Developer documentation style guide

(True, but without any references and any mention of ableism... and surrounded by dozens of discouraged and definitely not offensive terms, and ironically linking to the Wikipedia page with the discouraged name)

@datapythonista
Copy link
Member

And I wouldn't take as a reference of morality a company in the business of mass surveilance, censorship, monopoly abuses, political inference and brainwashing. ;)

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Feb 9, 2022

The term "dummy variable" is in wide use in how people learn about encoding categorical data and is not specific to software. The references on the Wikipedia article refer to a paper from 1957 that uses the term. It probably appears in statistics textbooks. SPSS uses the term in their documentation. IMHO, until the statistics/data science community at large decides to deprecate the language, it's not our responsibility to take the lead in doing so.

Having said that, having an alternate name such as get_indicators() is appropriate, but I think we should not deprecate get_dummies() and just leave it there, but no longer document it.

I also have to wonder how the publishers of the "XYZ for Dummies" series of books have handled this issue.

Finally I found it interesting to contrast the order of the definitions of the word "dummy" shown in these three references:
https://www.merriam-webster.com/dictionary/dummy
https://www.dictionary.com/browse/dummy
https://dictionary.cambridge.org/us/dictionary/english/dummy

For Merriam-Webster, the first category is related to not speaking or being stupid.
For dictionary.com, the first category is "a representation or copy of something, as for displaying to indicate appearance:"
For the Cambridge dictionary, the first category is "a large model of a human, especially one used to show clothes in a store"

One of the definitions reminded me that the term is also used as a word for a baby's pacifier.

@MarcoGorelli
Copy link
Member

Hey @toobaz - what kind of reference are you looking for? You can find articles like this one which refers to it as an ableist term if you look. Likewise this one from HBR. I won't overload you with links though, as I'm not sure what you're looking for or what kind of source you'd accept

I think we should not deprecate get_dummies() and just leave it there, but no longer document it.

Agreed, I'd suggested this at the top - rename it, but continue to silently support the current name and not break people's code

@bashtage
Copy link
Contributor

bashtage commented Feb 9, 2022

I also feel that dummy variable is so wide spread that renaming would create a lot of confusion. It is much more commonly taught that one hot encoding. The wiki article for one-hot mentions dummy in the first line - the reverse is not true and one-hot only makes it into a footnote.

I think there isn't a settled alternative to dummy, and until there is no change should be made. One the world converges into something that mostly stops using dummy variable, then that should be adopted. More or less how the master->main change worked in pandas.

@toobaz
Copy link
Member

toobaz commented Feb 9, 2022

You can find articles like this one which refers to it as an ableist term if you look. Likewise this one from HBR.

Are you sure? I can't find the word "dummy" in any of the two, even less a reference to the "mannequin" meaning. But yes, these would have been otherwise somewhat better references than the Google documentation style guide.

@MarcoGorelli
Copy link
Member

OK, they use "dumb", from which "dummy" comes from (https://www.etymonline.com/word/dummy)

I'm OK with not doing this if others would prefer not to anyway, it just seemed like the moment the bring this up as else it'll be a while until 3.0

@MarcoGorelli
Copy link
Member

Doesn't look like there's much support for renaming, so let's close for now to keep the queue down - the discussion can always be reopened in the future if necessary

Thanks anyway @RollingStar for the suggestion!

@davidcavazos
Copy link

From #48250 to keep the discussion in one place

The word "dummy" from the pd.get_dummies function can be offensive to some people and should be renamed.

It's marked as a word that should not be used by Google's inclusive language word list.

@TheNeuralBit commented:

A non-Google reference for "dummy" being non-inclusive: https://itconnect.uw.edu/guides-by-topic/identity-diversity-inclusion//inclusive-language-guide/

Why it’s problematic:
The origin of the word, “dummy,” is a person who cannot speak. Because the use of this word is often negatively associated with a disability, implying a person is worthless, ineffective or incapable, an alternative word should be used.

Some other sources which flag the use of the word "dummy":

Another document mentioning how it causes harm:

“Dummy” and similar terms stigmatize mental disabilities. The alternatives are clearer.

@MarcoGorelli
Copy link
Member

Thanks @davidcavazos

Reopening then - perhaps we can discuss this in the next dev meeting (which btw anyone is welcome to attend)

@MarcoGorelli MarcoGorelli reopened this Aug 25, 2022
@davidcavazos
Copy link

davidcavazos commented Aug 25, 2022

So far, some alternatives are:

  • get_indicators: Short and concise
  • get_indicator_variables: A little longer but more explicit
  • one_hot_encode: Different term, but explicit as well and used by other frameworks as well

Maybe we could open a voting to finalize the name.

@davidcavazos
Copy link

Thanks @davidcavazos

Reopening then - perhaps we can discuss this in the next dev meeting (which btw anyone is welcome to attend)

Thanks, I've added it to my calendar

@TheNeuralBit
Copy link
Contributor

Another point I'd like to bring over from #48250 (comment):

My takeaway from [the discussion in #35724] is that adding a separate get_indicators (or some other agreed upon alternative) would be amenable. From there we could either:

  • Deprecate and ultimately remove get_dummies, or
  • Prefer get_indicators in documentation to nudge users there

It seems the former was rejected, but the latter could be acceptable. Could we pursue that approach?

@jbrockmendel
Copy link
Member

can be offensive to some people

Is there evidence on this?

Some other sources which flag the use of the word "dummy":

I do not find these compelling. They also suggest replacing "normal" with "typical". Should scipy/statsmodels deprecate references to the Normal Distribution?

https://twitter.com/jbarro/status/1467250971361386505

“inclusive language” — that is, the creation of a long list of weird required adjustments to language, separating those who know and subscribe to all the latest rules from those who don’t — is not actually inclusive.

@davidcavazos
Copy link

davidcavazos commented Aug 25, 2022

Nobody expects anyone to know all the words, but there's also a long historical background of poor choices of words which convey a negative context or are sensitive to groups of people (like master, slave, kill, etc). Fortunately there are people who have invested the time of compiling these words into lists to make them more searchable. Many fall into gray areas, but there are some which make sense to change. That's why GitHub renamed the main branch name from master, even if that was pretty disruptive at the time.

@attack68
Copy link
Contributor

"Master and slave" is such an unequivocal and obvious corporate reputational risk that it had to be changed.

In my opinion, "dummy" in the context of dummy variables offers no offensive connotation. Dictionary definitions of dummy variables make no reference to it, the wikipedia article on dummy variables makes no reference and the widespread use of it in scientific papers suggests to me anyone finding that particular use offensive in that context is overly sensitive. I consider the language to have evolved.

I am -1 on changing for the sole purpose of sensibility. Other mentions for including other functions names if they are synchronised with other libraries I am +0.5 for.

@toobaz
Copy link
Member

toobaz commented Aug 25, 2022

I agree with @attack68, and let me add that "master and slave" is computer science jargon from which programmers are agreeing to transition on, including pandas programmers.

"Dummy" is established, technical jargon from statistics that pandas has adopted from, not imposed to, its users, who are mostly not involved in its development. We do not decide how statisticians (or whoever follows a statistics class) talk, and as of now, I have no doubt that our users will find "get_dummy" more understandable than the alternatives.

Now, I would never say "let's stick to what this mass of people do, whatever the harm we cause to users". But as mentioned by others, there is is no indication that get_dummies is causing harm to any group/community. The technical use is derived from a meaning unrelated to disabilities ("mannequin") that itself is well established in common parlance since almost two centuries.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Aug 25, 2022

We do not decide how statisticians (or whoever follows a statistics class) talk, and as of now, I have no doubt that our users will find "get_dummy" more understandable than the alternatives.

I said something similar above: #35724 (comment) The usage is well-established in the statistics literature and in packages like SAS and SPSS.

As a compromise, I'd like to suggest that we should create a get_indicators() method that is the same as get_dummies(), document get_indicators(), but leave get_dummies() in the API and just remove any documentation of it.

Since in Britain, a baby's pacifier is called a dummy, maybe this suggestion will pacify those who object to the current method name.

@toobaz
Copy link
Member

toobaz commented Aug 25, 2022

The usage is well-established in the statistics literature and in packages like SAS and SPSS.

... and Stata, and R... the latter goes as far as providing a dummify function.

As a compromise, I'd like to suggest that we should create a get_indicators() method that is the same as get_dummies(), document get_indicators(), but leave get_dummies() in the API and just remove any documentation of it.

If we decided to go this route, the get_dummies() docstring should at least be "See get_indicators()"

@mroeschke
Copy link
Member

Noting that 1.5 just added from_dummies, so that method would need the same treatment as well: https://pandas.pydata.org/docs/dev/whatsnew/v1.5.0.html#from-dummies

@jreback
Copy link
Contributor

jreback commented Aug 25, 2022

-0.5 on any change; as indicated this is a very common term

not completely averse though as this is a one hot encoding operation ; we could name similarly to sklearn

@kennknowles
Copy link

I'm someone who isn't deeply involved with statistics or whatever realm this odd use of the term comes from. So at first glance it doesn't even make sense. By far the more common usage is "placeholder". So for people like me, which I think is most people in this case, the term is also esoteric or misleading, even aside from insensitivity. That's probably why the official docs immediately clarify with an alternate term "indicator" that is more common and sensible. Adding get_indicators and leaving get_dummies undocumented just for backwards-compatibility will improve the library for everyone.

@kennknowles
Copy link

When this was brought up to me, I had to look up what it did, and was surprised at what a terrible name this is for the function. But from this thread I do understand it is stats jargon. So my take is just an external view, that this particular piece of jargon is exceptionally badly chosen and there are multiple better choices in even broader use.

@toobaz
Copy link
Member

toobaz commented Aug 26, 2022

By far the more common usage is "placeholder".

What are you basing your statements on? 4 different software packages were named from which people often move to pandas, and they all use "dummy". Sklearn has "one hot encoding". Then for sure pandas isn't perfectly equivalent to any of these, but I don't know anyone or anything that uses "placeholder". Wikipedia, in the page "dummy variable" (yes) does provide 6 alternatives: "indicator variable" is the first, "placeholder" is not one of them.

@bashtage
Copy link
Contributor

In stats, I would say in order of commonality (with 3 and 4 being much rarer than 3 and 4):

  1. dummy
  2. indicator
  3. binary
  4. dichotomous

In ML, one hot encoding is common, although this is not a description of the variable rather than the method used to create the dichotomous values.

@bashtage
Copy link
Contributor

I feel like some of the confusion is based on the usage of dummy in comp sci, which is often a simple version of something complex. The usage of dummy in statistics is not the same, and IME the intent of the word dummy in the context of the statistics is not the same as it is in comp sci.

@kennknowles
Copy link

Yes, the wikipedia page on the stats use of the term lists the stats use of the term first :-)

I'm referring to the use of the term beyond stats, just to offer an outside perspective FWIW. The "comp sci" use is much more widespread than computing, in my experience. But I'm certainly not advocating for that use, either. It is also insensitive and not descriptive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests