Wrangling and analysis of Tweets from WeRateDogs (@dogrates) and visualization of insights with Python in Jupyter Notebook. Project was motivated by and thus focuses on the data wrangling process that covers gathering, assessing and cleaning data. Various methods including programmatic approaches such as querying Twitter API with Python's Tweepy package were used to collect Tweets and relevant metadata.
- conda 4.6.3 or similar versions
- python 3.7.2 (or python 3)
- Packages
- pandas
- requests
- os
- tweepy 3.7.0
git clone git://github.com/tweepy/tweepy.git cd tweepy python setup.py install
- timeit.default_timer
- json
- numpy
- copy
- datetime
- matplotlib.pyplot
- seaborn
- statsmodels.api
- TextBlob
pip install -U textblob python -m textblob.download_corpora
twitter_api.py
file- The file returns Twitter API Wrapper for querying Twitter's API in Section 3 of Gathering Data with the Tweet IDs obtained from the first dataset in Section 1 of Gathering Data.
- The file was imported to the notebook and was not tracked in the repository in order to prevent disclosure of private keys and tokens.
import tweepy def twitter_api(): # Keys and Tokens consumer_key = 'CONSUMER KEY' consumer_secret = 'CONSUMER SECRET' access_token = 'ACCESS TOKEN' access_secret = 'ACCESS SECRET' # OAuthHandler instance equipped with an access token for OAuth Authentication auth = tweepy.OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_secret) # twitter API wrapper return tweepy.API(auth_handler = auth, wait_on_rate_limit = True, wait_on_rate_limit_notify = True)
- Content of the first part of the project including all code blocks for data wrangling is documented in the Jupyter Notebook file,
analyze_tweet_1_wrangle.ipynb
. The HTML fileanalyze_tweet_part1_wrangle.html
was published from this notebook file. - Three raw datasets gathered in the first step of data wrangling and the clean version of the master dataset obtained after the last step of data wrangling are all available in the
\data
directory.- Raw Datasets
twitter_archive_enhanced.csv
image-predictions.tsv
tweet_json.txt
- Merged Dataset:
twitter_archive_master.csv
- Raw Datasets
- Enhanced Twitter Archive
- Udacity was provided with WeRateDogs Twitter Archive which contains basic tweet data for all 5,000+ tweets.
- Udacity enhanced this original dataset by extracting dog ratings, dog names, and dog "stage" from the tweets' text data.
- The enhanced Twitter archive was made available as
twitter_archive_enhanced.csv
for manual download and is assigned to the objectdf_archive
.
- Image Predictions
- Udacity ran the images included in the tweets from the enhanced Twitter archive through a neural network and made top three predictions of each dog's breed.
- The
image-predictions.tsv
file hosted in Udacity's server is programmatically downloaded by using the Requests library to submit a request to the URL and is assigned to the objectdf_image
.
- Additional Tweet Data
- Additional tweet data which were omitted during the process of enhancing the twitter archive are gathered by using Python's Tweepy library to query Twitter's API.
- The JSON data of each tweet is dumped in the
tweet_json.txt
file. - Only the re-tweet and favorite counts for each tweet are extracted and assigned to the object
df_json
.
- Each of the three datasets gathered above are assessed for quality and tidiness issues.
- Only these observations from reviewing the datasets, which render cleaning necessary in the next section, are documented.
- Each assessment from the Assessing Data section is addressed in three sequential steps: define, code, and test.
- The clean versions of the three datasets are merged to create
df_archive_master
, which is stored as a separate.csv
file,twitter_archive_master.csv
.
- Content of the second part of the project including all code blocks for analysis is documented in the Jupyter Notebook file,
analyze_tweet_2_eda.ipynb
. The HTML fileanalyze_tweet_part2_eda.html
was published from this notebook file. - Investigation of the following four topics are discussed.
- Time of the Day when WeRateDogs (@dogrates) Shows Most Activity
- Each day of the week is divided into 24-hour increments.
- Tweet activity is quantified by the percentage of tweets in each increment for a given day.
- Bar plots, box plots and statistics (mean, standard deviation, quartiles) of the 24 percentages for each day are presented.
- Correlation between Favorite Counts and Retweet Counts
- Correlation between two variables are discussed with a scatter plot and their correlation coefficient.
- Results from linear regression are applied to further investigate the correlation.
- R-squared value and its significance are presented.
- Linear fit is built and added to the scatter plot.
- Comparison of Dog Ratings and Sentiment of Tweets
- Numeric dog ratings are categorized into low, medium, and high ratings.
- Polarity scores of the text data of Tweets are gathered from sentiment analysis.
- Three histograms of the polarity scores, one for each category of dog ratings, are presented.
- Accuracy and Precision of Predicting Dog's Breeds from Images
- The performance of the neural network in recognizing images and predicting each dog's breed is assessed with both statistics and visualizations.
- Mean proportion of predictions with dog breeds for each level of prediction
- Histogram for the distribution of confidence levels for each level of prediction and its center (median)
- The performance of the neural network in recognizing images and predicting each dog's breed is assessed with both statistics and visualizations.
- Time of the Day when WeRateDogs (@dogrates) Shows Most Activity
Jong Min (Jay) Lee [jmlee5629@gmail.com]
- This project was completed as a mandatory requirement for the Data Wrangling unit from the Data Analyst Nanodegree program at Udacity.
- Step-by-step guidance from the Get Started page in Twitter Developers site was referenced to create an App, generate keys and tokens, and query Twitter API.
- Tweepy documentation was referenced to search for and understand the methods from the Tweepy package and their specifications and arguments which were applicable to querying Twitter API and gathering JSON data.
- Assessment and cleaning of untidy data were motivated by the extensive nature this process in data analysis (Dasu and Johnson 2003) and the framework for tidying data (Wickham 2014).
- Dasu T, Johnson T (2003). Exploratory Data Mining and Data Cleaning. John Wiley & Sons.
- Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10). doi:10.18637/jss.v059.i10
- TextBlob documentation was referenced to create a
TextBlob
with text data and obtain polarity score from thesentiment
property.