You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Predict which Tweets are about real disasters and which ones are not
Abstract:
Twitter has become an important communication channel in times of emergency. The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programmatically monitoring Twitter (i.e., disaster relief organizations and news agencies). But, it’s not always clear whether a person’s words are actually announcing a disaster.
In this notebook, I've built a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t.
*Also, I've created two notebook, one of them is commented and fully detailed and easy to read and train, And other one is functional implementation of first one and more clean.
Dataset:
The dataset that I've used in this notebook is available on Kaggle site. [link]
In this notebook I've performed some steps such as :
1) EDA: a little look at dataset with some graphs
2) Clean data in two steps:
Remove duplicated tweets
Find similar tweets and drop them
3) Extract some features such as:
Length of tweets
Count of words in tweets
Count of numbers in tweets
Count of sentences in tweets
Count of hashtags in tweets
Text of hashtags
Count of mentions in tweets
Text of Mentions
Count of links in tweets
Word per length of tweet
Punctuation count per tweet length
Uppercase letters count per tweet length
MinMaxScaling for numeric columns
4) Process Tweets such as:
Lowercase tweets
Remove URLs
Remove Punctuation
Remove Short words <=2 chars
Remove Stopwords
Lemmatization
GetDummy for keyword column
5) TF-IDF:
TF-IDF on tweets
TF-IDF on text of hashtags
TF-IDF on text of mentions
6) Train and test models:
GradientBoostingClassifier
NaiveBayes
LogisticRegression
SVM
And I've got approx. 0.8 score on main test set on kaggle site.