Skip to content

Latest commit

 

History

History
58 lines (54 loc) · 2.42 KB

README.md

File metadata and controls

58 lines (54 loc) · 2.42 KB

Natural Language Processing with Disaster Tweets

Predict which Tweets are about real disasters and which ones are not


Abstract:

Twitter has become an important communication channel in times of emergency. The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programmatically monitoring Twitter (i.e., disaster relief organizations and news agencies). But, it’s not always clear whether a person’s words are actually announcing a disaster.
In this notebook, I've built a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t.
*Also, I've created two notebook, one of them is commented and fully detailed and easy to read and train, And other one is functional implementation of first one and more clean.

Dataset:

The dataset that I've used in this notebook is available on Kaggle site. [link]

In this notebook I've performed some steps such as :

1) EDA: a little look at dataset with some graphs

2) Clean data in two steps:

  • Remove duplicated tweets
  • Find similar tweets and drop them

3) Extract some features such as:

  • Length of tweets
  • Count of words in tweets
  • Count of numbers in tweets
  • Count of sentences in tweets
  • Count of hashtags in tweets
  • Text of hashtags
  • Count of mentions in tweets
  • Text of Mentions
  • Count of links in tweets
  • Word per length of tweet
  • Punctuation count per tweet length
  • Uppercase letters count per tweet length
  • MinMaxScaling for numeric columns

4) Process Tweets such as:

  • Lowercase tweets
  • Remove URLs
  • Remove Punctuation
  • Remove Short words <=2 chars
  • Remove Stopwords
  • Lemmatization
  • GetDummy for keyword column

5) TF-IDF:

  • TF-IDF on tweets
  • TF-IDF on text of hashtags
  • TF-IDF on text of mentions

6) Train and test models:

  • GradientBoostingClassifier
  • NaiveBayes
  • LogisticRegression
  • SVM

And I've got approx. 0.8 score on main test set on kaggle site.