-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QST: "Dummy" is rooted in ableist language #35724
Comments
Thanks for opening an issue. How should we balance this against the cost of changing it to something like |
Granted I'm punching above my weight by commenting on API design, but would it be possible to make Given how painfully common it is to see |
FIxes: #7031 This PR introduces array-like inputs support in `cudf.get_dummies`. I think in near future we will have to deprecate and adapt new name for `get_dummies`: pandas-dev/pandas#35724 Authors: - GALI PREM SAGAR (@galipremsagar) Approvers: - Keith Kraus (@kkraus14) URL: #7181
+1 for |
@pandas-dev/pandas-core anyone have any thoughts/objections on going through a deprecation cycle to rename |
we should be consistent with the rest of the ecosystem. What are other projects doing? |
sklearn calls it OneHotEncoder
|
My understanding is that one hot encode and dummies are almost the same but not completly. OHE has a column per category in the results, while dummies has one less to avoid redundancy. I'm +1 in using the actual names and avoid confusion by finding equivalents. Unless there are reports of people or communities getting personally offended by our wording, which I don't think it's the case. But no big deal if the rest of devs have consensus in renaming. Just I think we're wasting our users time but a renaming for a reason I personally don't see being insulting or a problem to anyone (I may be wrong). |
Can't judge on whether to do drop "dummies" (English is not my mother tongue and I had never associated a pejorative effect to the "mannequin" meaning of "dummy"), but on what to replace it with, I'm pretty sure "OneHotEncoding" would sound weird to most people I know (social scientists). We'd rather then go for By the way: I don't see references to ableism, or even just to controversy in naming, in Wikipedia ... any link to better understand the issue? |
By default,
For a start it's discouraged in the Google Developer documentation style guide |
(True, but without any references and any mention of ableism... and surrounded by dozens of discouraged and definitely not offensive terms, and ironically linking to the Wikipedia page with the discouraged name) |
And I wouldn't take as a reference of morality a company in the business of mass surveilance, censorship, monopoly abuses, political inference and brainwashing. ;) |
The term "dummy variable" is in wide use in how people learn about encoding categorical data and is not specific to software. The references on the Wikipedia article refer to a paper from 1957 that uses the term. It probably appears in statistics textbooks. SPSS uses the term in their documentation. IMHO, until the statistics/data science community at large decides to deprecate the language, it's not our responsibility to take the lead in doing so. Having said that, having an alternate name such as I also have to wonder how the publishers of the "XYZ for Dummies" series of books have handled this issue. Finally I found it interesting to contrast the order of the definitions of the word "dummy" shown in these three references: For Merriam-Webster, the first category is related to not speaking or being stupid. One of the definitions reminded me that the term is also used as a word for a baby's pacifier. |
Hey @toobaz - what kind of reference are you looking for? You can find articles like this one which refers to it as an ableist term if you look. Likewise this one from HBR. I won't overload you with links though, as I'm not sure what you're looking for or what kind of source you'd accept
Agreed, I'd suggested this at the top - rename it, but continue to silently support the current name and not break people's code |
I also feel that dummy variable is so wide spread that renaming would create a lot of confusion. It is much more commonly taught that one hot encoding. The wiki article for one-hot mentions dummy in the first line - the reverse is not true and one-hot only makes it into a footnote. I think there isn't a settled alternative to dummy, and until there is no change should be made. One the world converges into something that mostly stops using dummy variable, then that should be adopted. More or less how the master->main change worked in pandas. |
Are you sure? I can't find the word "dummy" in any of the two, even less a reference to the "mannequin" meaning. But yes, these would have been otherwise somewhat better references than the Google documentation style guide. |
OK, they use "dumb", from which "dummy" comes from (https://www.etymonline.com/word/dummy) I'm OK with not doing this if others would prefer not to anyway, it just seemed like the moment the bring this up as else it'll be a while until 3.0 |
Doesn't look like there's much support for renaming, so let's close for now to keep the queue down - the discussion can always be reopened in the future if necessary Thanks anyway @RollingStar for the suggestion! |
The word "dummy" from the pd.get_dummies function can be offensive to some people and should be renamed. It's marked as a word that should not be used by Google's inclusive language word list. @TheNeuralBit commented:
Some other sources which flag the use of the word "dummy":
Another document mentioning how it causes harm:
|
Thanks @davidcavazos Reopening then - perhaps we can discuss this in the next dev meeting (which btw anyone is welcome to attend) |
So far, some alternatives are:
Maybe we could open a voting to finalize the name. |
Thanks, I've added it to my calendar |
Another point I'd like to bring over from #48250 (comment):
|
Is there evidence on this?
I do not find these compelling. They also suggest replacing "normal" with "typical". Should scipy/statsmodels deprecate references to the Normal Distribution? https://twitter.com/jbarro/status/1467250971361386505
|
Nobody expects anyone to know all the words, but there's also a long historical background of poor choices of words which convey a negative context or are sensitive to groups of people (like master, slave, kill, etc). Fortunately there are people who have invested the time of compiling these words into lists to make them more searchable. Many fall into gray areas, but there are some which make sense to change. That's why GitHub renamed the main branch name from master, even if that was pretty disruptive at the time. |
"Master and slave" is such an unequivocal and obvious corporate reputational risk that it had to be changed. In my opinion, "dummy" in the context of dummy variables offers no offensive connotation. Dictionary definitions of dummy variables make no reference to it, the wikipedia article on dummy variables makes no reference and the widespread use of it in scientific papers suggests to me anyone finding that particular use offensive in that context is overly sensitive. I consider the language to have evolved. I am -1 on changing for the sole purpose of sensibility. Other mentions for including other functions names if they are synchronised with other libraries I am +0.5 for. |
I agree with @attack68, and let me add that "master and slave" is computer science jargon from which programmers are agreeing to transition on, including pandas programmers. "Dummy" is established, technical jargon from statistics that pandas has adopted from, not imposed to, its users, who are mostly not involved in its development. We do not decide how statisticians (or whoever follows a statistics class) talk, and as of now, I have no doubt that our users will find "get_dummy" more understandable than the alternatives. Now, I would never say "let's stick to what this mass of people do, whatever the harm we cause to users". But as mentioned by others, there is is no indication that |
I said something similar above: #35724 (comment) The usage is well-established in the statistics literature and in packages like SAS and SPSS. As a compromise, I'd like to suggest that we should create a Since in Britain, a baby's pacifier is called a dummy, maybe this suggestion will pacify those who object to the current method name. |
... and Stata, and R... the latter goes as far as providing a
If we decided to go this route, the |
Noting that 1.5 just added |
-0.5 on any change; as indicated this is a very common term not completely averse though as this is a one hot encoding operation ; we could name similarly to sklearn |
I'm someone who isn't deeply involved with statistics or whatever realm this odd use of the term comes from. So at first glance it doesn't even make sense. By far the more common usage is "placeholder". So for people like me, which I think is most people in this case, the term is also esoteric or misleading, even aside from insensitivity. That's probably why the official docs immediately clarify with an alternate term "indicator" that is more common and sensible. Adding |
When this was brought up to me, I had to look up what it did, and was surprised at what a terrible name this is for the function. But from this thread I do understand it is stats jargon. So my take is just an external view, that this particular piece of jargon is exceptionally badly chosen and there are multiple better choices in even broader use. |
What are you basing your statements on? 4 different software packages were named from which people often move to pandas, and they all use "dummy". Sklearn has "one hot encoding". Then for sure pandas isn't perfectly equivalent to any of these, but I don't know anyone or anything that uses "placeholder". Wikipedia, in the page "dummy variable" (yes) does provide 6 alternatives: "indicator variable" is the first, "placeholder" is not one of them. |
In stats, I would say in order of commonality (with 3 and 4 being much rarer than 3 and 4):
In ML, one hot encoding is common, although this is not a description of the variable rather than the method used to create the dichotomous values. |
I feel like some of the confusion is based on the usage of dummy in comp sci, which is often a simple version of something complex. The usage of dummy in statistics is not the same, and IME the intent of the word dummy in the context of the statistics is not the same as it is in comp sci. |
Yes, the wikipedia page on the stats use of the term lists the stats use of the term first :-) I'm referring to the use of the term beyond stats, just to offer an outside perspective FWIW. The "comp sci" use is much more widespread than computing, in my experience. But I'm certainly not advocating for that use, either. It is also insensitive and not descriptive. |
I have searched the [pandas] tag on StackOverflow for similar questions.
I have asked my usage related question on StackOverflow.
Question about pandas
Although extremely common in the industry, "dummy" has some unfortunate history. One current use is for substitutes - mannequins, stand-ins, etc. This use grew from its original definition, "mute person". Mute people are not substitutes or stand-ins and I would prefer Pandas to not contribute to this view. There are other words, like "indicator", for statistics.
Pandas currently uses "get_dummies" as a function name, with the documentation referencing "indicator" as a synonym.
Citations:
The text was updated successfully, but these errors were encountered: