This talk is about visualizing how language differs across groups and categories.
Two Python libraries I've developed, Scattertext and AgeFromName, are heavily used in the talk.
Please install them and if you have time, star them on Github:
`$ pip install scattertext agefromname
It's news you can use: six Python 3.6 Jupyter notebooks showing how to visualize the following data:
- Introduction to Scattertext
- Analyzing President Trump's Twitter stream before and after the election
- Using Scattertext to visualize Emoji usage by gender and heritage on Twitter
- Using Scattertext to visualize Emoji usage by language (e.g., English or Spanish) on Twitter
- Visualizing how Emojis vary by nationality
Imputing demographics from text is a long-standing problem in computational linguistics. Argamon et al. (2009) provides an overview of earlier psycholinguistic work on language and demography by Pennebaker and Argamon.
Bergsma et al. (2013) built highly accurate race, gender, language and ethnicity classifiers, all using a Twitter user's first and last names as input.
Lee et al. (2017) use lists of Olympic athletes to train an RNN to predict nationality and ethnicity from an athlete's name. Ethnicity is imputed from nationality in this work, with the majority ethnicity from each nation treated as a label, reducing the veracity of data derived from diverse countries.
Dong et al. (2013) predict age from the content of a Twitter user's tweets. Users were manually annotated for age-range and gender. Annotators looked through a user's social media profiles (e.g., their Linkedin profiles) to figure out a user's approximate age or gender. They found that the language used in a user's tweets is predictive of life-stages, and that language varied by gender.
Montasser et al. (2017) accurately predict gender/race/ethnicity demographics of a US census tract, block, blockgroup, or county from bag-of-words features from geotagged tweet-content.
Shlomo Argamon, Moshe Koppel, James W. Pennebaker, Jonathan Schler .Automatically Profiling the Author of an Anonymous Text. CACM. 2009. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.136.9952&rep=rep1&type=pdf
Shane Bergsma, Mark Dredze, Benjamin Van Durme, Theresa Wilson, David Yarowsky. Broadly Improving User Classification via Communication-Based Name and Location Clustering on Twitter. NAACL-HLT 2013. http://www.aclweb.org/anthology/N/N13/N13-1121.pdf
Omar Montasser, Daniel Kifer. Predicting Demographics of High-Resolution Geographies with Geotagged Tweets. AAAI. 2017.
Jinhyuk Lee, Hyunjae Kim, Miyoung Ko, Donghee Choi, Jaehoon Choi, Jaewoo Kang. Name Nationality Classification with Recurrent Neural Networks. IJCAI-17. 2017. https://www.ijcai.org/proceedings/2017/0289.pdf
Dong Nguyen, Rilana Gravel, Dolf Trieschnigg, Theo Meder. “How Old Do You Think I Am?”: A Study of Language and Age in Twitter. ICWSM. 2013. http://www.dongnguyen.nl/publications/nguyen-icwsm2013.pdf