Team Members
- Stephanie Ramos Gomez
- Kelsey Anderson
- Jesica Ramirez Toscano
- Summary
- Set up
- Data Collection and Pre-processing
- Data Exploration
- Model Building
- bi-RNN
- CNN
- Logistic Regression
- Model Interpretability: Integrated gradients
- Final Paper and Acknowledgements
Media bias has been a topic of interest for policy makers and politicians particu- larly since the 2020 US presidential election. Society appears to be increasingly polarized and hostile in online interactions. Online spaces have become the target of criticism and fears about cyberbullying, trolling, fake news and the potential for vulnerable individuals to become violently radicalized. With this in mind, we built different supervised machine learning models to test if the text extracted from tweets posted by news media accounts is predictive of the polarization in the comments they receive. We used logistic regression, recurrent neural networks and convolutional neural networks to make our predictions. Using these models and data extracted from Twitter, we were able to predict polarization in comments with 65% accuracy. Our research aims to contribute to creating a more civil and less polarized space for discourse on social media.
The following modules were used for this analysis:
torch
nltk
gensim
sklearn
vaderSentiment
networkx
pandas
numpy
matplotlib
seaborn
snscraper
camptum
To pull, clean and run models using Twitter data scraped with snscraper, run, in order:
- tweet_scraping.ipynb
- Clean tweets.ipynb
- topics_twitter.ipynb
- sentiment_scores.ipynb
- Model of Choice:
- Log Reg.ipynb
- rnn_model.ipynb
- cnn_development.ipynb
To pull, clean and run models using downloaded Kaggle NYT articles data, run, in order:
- download data from New York Times Comments on Kaggle
- Data Cleaning.ipynb
- topics_articles.ipynb
- sentiment_scores.ipynb
- Model of Choice:
- Log Reg.ipynb
- rnn_model.ipynb
- cnn_development.ipynb
The following files are used in data collection and text preprocessing.
lines of code: 850
Uses snscraper to retrieve original posts from NYT, FoxNews and Reuters Twitter feeds. The file name and date ranges for the search can be specified in the notebook parameters. After gathering original tweets, it attempts to retrieve all subsequent replies based on the ConversationID of the original tweet.
Saves one file for all original tweets: data/tweets_MMYYYY.csv
and a separate file for each news source's replies: data/replies_MMYYYY_nyt.csv
, data/replies_MMYYYY_fox.csv
and data/replies_MMYYYY_reu.csv
.
Warning: This file can take nearly 12 hours to complete a datapull for a single month span of time.
lines of code: 564
lines of code: 93
Same as Clean tweets.ipynb, but built to accept Kaggle NYT data.
lines of code: 677
Combines replies into a measure of setiment variance, merges with the original post dataset and outputs for use in our machine learning models. Can accept either the Kaggle NYT dataset or a Twitter dataset by commented/uncommenting parameters at the beginning of the notebook.
We attempted numerous ways to bin the sentiment scoring data. Current state is using a simplified count method. Please read our project paper for more detials on our methodolgy and experiements with this.
This notebook uses VADER to determine and score positive, negative and neutral tweet sentiment.
Warning: If the replies_SUFFIX.csv
is large, evaluating sentiment can take a considerable time period. Average time for our full dataset is approximately 20-30 minutes.
The following files were used in data exploration. Topic modeling files can be included in the data cleaning pipeline, or skipped, as desired. It was originally intended as a way to filter data fed into the models, however, due to lack of data quantity, it is not currently so used by any of our models.
lines of code: 1451
lines of code: 1980
Same as topics_twitter.ipynb, but built to accept Kaggle NYT data.
lines of code: 527
lines of code: 171+
Trains and tests an LSTM RNN model on the desired dataset.
Datafile can be specified in the parameters of the notebook, but requires a cleaned text feature column named text
and a binary target column called vaderCat
.
If your target column is labelled differently, this will need to be adjusted inside of the train_test_datasets.py
sub-module.
lines of code: 81 of total 171
This section is found near the end of the notebook. It produces an estimation of feature importance over the test dataset after the model is trained.
-
train_test_datasets.py
lines of code: 120Imports and splits the full dataset into train.csv, validate.csv and test.csv to be received by the dataloaders internal to rnn_model.ipynb. Based on CAPP 30255 Homework 4.
-
lstm.py
lines of code: 53Contains the PyTorch RNN model object specifications. This is modularized so other model specifications could be tested. The best performing specifications are saved here. Based on Fake News Detection.
-
evaluate.py
lines of code: 199
Contains training and test evaluation functions used in rnn_model.ipynb
.
Based on Fake News Detection.
lines of code: 412 + 109 (IG section)
-
dataclassCNN.py
lines of code: 35File to create a Custom Data Class and Collate Function for PyTorch.
lines of code: 927
To get a sense of the feature importance in the ligistic regression model we ordered the coefficients by magnitude. The code is at the end of the ligistic regression notebook:
- Log Reg.ipynb
Integrated Gradient is an interpretability algorithm used to understand how neural networks work. This algorithm, proposed by Sundararajan et al. (2017), attributes the prediction of a neural network to its input features.
The code to apply integrated gradients/feature importannce) is at the end of the notebook for each model:
- rnn_model.ipynb
- cnn_development.ipynb
The research, analysis and results of this project are documented in Predicting_Polarization.pdf. We want to thank professor Amitabh Chaudhary for his incredible support and feedback. Also, huge thanks to the Teaching Assistants: Jenny Long and Rosa Zhou.