This repository contains all code and data necessary to replicate all results in the main article and the appendix of our paper When do Word Embeddings Accurately Reflect Surveys on our Beliefs about People?. If you use the new data we collected, please cite our paper as follows:
@inproceedings{joseph_when_2020,
title = {{W}hen do {W}ord {E}mbeddings {A}ccurately {R}eflect {S}urveys on our {Beliefs} about {P}eople?},
booktitle = {Proceedings of the 58th {{Annual Meeting}} of the {{Association}} for {{Computational Linguistics}} ({{ACL}}'2020)},
author = {Joseph, Kenneth and Morgan, Jonathan M.},
year = {2020}
}
Note: We also use data from several other papers! If you use their data or ideas, please cite them as referenced below!
generate_embedding_measures.ipynb
- This python jupyter notebook generates all embedding-based measures of beliefs used in the paperpaper_results.R
- All results presented in the main portion of the paper, as well as Figures 5-9 in the appendix.survey_data_statistics_for_appendix.R
- This file generates plots that summarize the data in the appendix (Figures 1-4). Probably the best place to start if you're just interested in our survey data.
We use several different survey datasets, including two we collected ourselves and three that were collected by others. We describe our data in detail here, and then point to the requisite references for the data we used that was collected by others.
The file our_belief_data_clean.csv
contains one row per annotator/dimension/identity combination. See the paper for full details on the questions, annotator demographics, etc.
qname
- A question identifier. You likely will not need thisqtype
- The type of dimension; eitheraffective
,association
, ortrait
dimension
- Which of the 17 different dimensions of stereotype was rated by this individualecode
- An anonymous indicator for specific annotatorsvalue
- The raw value returned on the original scalerescaled_value
- The raw value rescaled to [0,1] according to the minimum/maximum of the original scaleidentity
- The social identity being rated.
The file our_labeling_data.csv
contains each question and response posed to annotators for our identity labeling task. See the paper for more details!
Responseid
- Unique (anonymized) ID of the respondentQuestionid
- Unique ID for the question being askedQuestiontype
- Type of question - "IsA" or "SeenWith"Query
- The identity presented in the question textAnswer1
- The first answer presented for this questionAnswer2
- The second answer presented for this questionAnswer3
- The third answer presented for this questionAnswer4
- The fourth answer presented for this questionAnswer5
- The fifth answer presented for this questionAnswer
- The answer selected by the survey respondentEmbeddeddata
- IrrelevantStartdate
- Time the respondent started the surveyEnddate
- Time the respondent ended the surveyGender
- Gender of the respondentAge
- Age of the respondnetHispanic
- Is this respondent of Hispanic descent?Race1
- Race/Ethnicity of respondentRace2
- Optional answer for additional Race/Ethnicity of respondentBorninus
- Was the respondent born in the U.S.?Percentinus
- What percentage of the respondent's life has been lived in the U.S.?Wherelivedlongest
- Where has the respondent lived the longest?Wherelivedrecently
- Where has the respondent lived the most recently?Education
- Level of education of the respondentPolitical
- Political leaning of the respondent
The file garg_mturk_stereotypes.csv
is a direct copy of the MTurk data from Garg et al. (2018). If you use this data, please cite their paper as follows:
@article{garg_word_2018,
title = {Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes},
author = {Garg, Nikhil and Schiebinger, Londa and Jurafsky, Dan and Zou, James},
year = {2018},
month = apr,
volume = {115},
pages = {E3635-E3644},
journal = {Proceedings of the National Academy of Sciences},
language = {en},
number = {16},
pmid = {29615513}
}
The file FullCleanUGAData.dta
is a Stata file that contains results from an EPA study conducted by Smith-Lovin and Robinson (2015). If you use this data, please cite their work as follows:
@article{smith-lovin_interpreting_2015,
title = {Interpreting and {{Responding}} to {{Events}} in {{Arabic Culture}}},
journal = {Final Report to Office of Naval Research, Grant N00014-09-1-0556},
author = {{Smith-Lovin}, L. and Robinson, Dawn T.},
year = {2015}
}
The directory personality-bias_survey
is a direct copy of the raw data directory from the personality-based surveys of Agarwal et al. (2018). If you use this data, please cite their paper as follows:
@inproceedings{agarwal_word_2019,
title = {Word {{Embeddings}} ({{Also}}) {{Encode Human Personality Stereotypes}}},
booktitle = {Proceedings of the {{Eighth Joint Conference}} on {{Lexical}} and {{Computational Semantics}} (*{{SEM}} 2019)},
author = {Agarwal, Oshin and Durup\i{}nar, Funda and Badler, Norman I. and Nenkova, Ani},
year = {2019},
month = jun,
pages = {205--211},
language = {en-us}
}
We use embeddings from the following four sources:
We convert them to more efficient formats using the save_embeds.py utility function from Hila Gonen's github repository.
The embedding we use can be downloaded from here (note, the tar is ~16GB). Pull this tar file down and run the following at the command line to extract the files (or just run the relevant code to download and untarr in generate_embedding_measures.ipynb
).