Sentiment Analysis with Tone Detection

Overview

This project performs sentiment analysis on tweets mentioning major U.S. airlines, classifying them into Positive, Neutral, and Negative categories. It also categorizes negative tweets by specific reasons like "late flight" or "rude service." The model, trained using a Naïve Bayes classifier, is evaluated using metrics such as F1-score, Precision, Recall, and Accuracy. To address class imbalance, SMOTE is applied.

The system incorporates Tone detection (Sarcasm) to identify nuanced tones and provides real-time predictions for user-inputted tweets. Tweets are preprocessed using techniques like stopword removal, URL stripping, and emoji handling before being analyzed.

Objective

The primary objective of this project is to identify the sentiment of tweets generated by users on Twitter. The program trains a model that can classify tweets into Positive, Neutral, or Negative sentiments with high accuracy. This analysis helps airlines understand customer feedback in real-time and improve their services accordingly.

Motivation

Understanding the sentiment behind trending hashtags on Twitter can be challenging. While some hashtags may trend for positive reasons, others may trend due to negative feedback. This project aims to provide a data-driven method to analyze customer feedback in real-time, helping airlines respond to negative comments promptly. Additionally, the insights gained from this analysis can be used to improve operational efficiency, enhance customer experience, and protect brand value.

Dataset

Source of Data:

Kaggle Dataset: The dataset used in this project is titled "Twitter US Airline Sentiment" and was downloaded directly from Kaggle.

Dataset Details:

Number of Tweets: 14,640
Sentiment Categories: Positive, Neutral, Negative
Additional Features:
- text: The content of the tweet.
- airline_sentiment: Sentiment classification of the tweet.
- negative_reason: Reasons for negative sentiment (if applicable).
- Metadata such as tweet_created, airline, and user_timezone.

Challenges:

Class Imbalance: The dataset exhibits a significant class imbalance, with negative tweets dominating. To address this, we applied SMOTE (Synthetic Minority Oversampling Technique) to balance the classes.

Methodology

Data Cleaning

To ensure high-quality input for the model, the following preprocessing steps were applied:

Stopword Removal: Common words like "the", "is", "in" were removed.
URL Removal: URLs were stripped from the text.
Punctuation Removal: Punctuation marks like periods, commas, and exclamation marks were removed.
HTML Tag Removal: HTML tags were removed from retweets or quoted content.
Username Removal: Mentions of usernames (e.g., @username) were removed.
Emoji Removal: Emojis were removed to simplify text processing.
Text Abbreviation Expansion: Abbreviations like "can't" were expanded to "cannot".
Number Removal: Numbers were removed as they generally don't contribute to sentiment.
Handling Repeated Characters: Words with repeated characters (e.g., "hiiiiii") were normalized.

Exploratory Data Analysis (EDA)

Key insights from EDA include:

Sentiment Distribution:
- Negative: 9,178 tweets
- Neutral: 3,099 tweets
- Positive: 2,363 tweets
- The dataset is heavily skewed towards negative tweets, necessitating the use of SMOTE for balancing.
Tweet Length Analysis:
- Average tweet length: 67 characters.
- Tweet lengths are unimodal and roughly symmetric, with no significant outliers.
Hashtag Analysis:
- Common hashtags like #fail, #help, and airline-specific names were identified, indicating user concerns about delays and service quality.
Word Frequency Analysis:
- Frequent words like "delayed," "thank," and "service" were identified, reflecting common complaints or expressions of gratitude.

Feature Engineering

Feature Scaling: Tweet length and polarity were standardized using a scaler.
Mutual Information: Applied to identify the most informative features for sentiment classification.
TF-IDF Vectorization: Used to convert text data into numerical feature vectors.

Model Performance

Three models were evaluated using stratified K-Fold cross-validation with SMOTE to address class imbalance:

Model	Accuracy	Precision	Recall	F1-Score
MultinomialNB	84.60%	80.01%	76.28%	77.63%
GaussianNB	50.71%	47.36%	50.61%	45.52%
KNN	29.60%	54.10%	48.27%	30.78%

Multinomial Naïve Bayes (MultinomialNB) achieved the highest performance across all metrics, making it the chosen model for deployment.

Real-Time Prediction Workflow

Input Preprocessing:
- User input is cleaned using the same techniques applied during training (stopword removal, URL removal, etc.).
Feature Vectorization:
- The preprocessed text is transformed into numerical feature vectors using TF-IDF vectorization.
Sentiment Prediction:
- The MultinomialNB model predicts the sentiment of the input text as one of three classes: Negative (0), Neutral (1), or Positive (2).
Tone Detection:
- A separate sarcasm detection model is applied to identify potentially sarcastic inputs.

How to Use

Clone the Repository:
```
git clone https://github.com/yourusername/sentiment-analysis-airline-tweets.git
```
Make sure to load Sarcasm and Twitter datasets with ipynb file for seamless execution of code.

Interact with the Model

Enter a tweet or text for sentiment analysis.
The system will preprocess the input and return the sentiment prediction along with the probability distribution.

Example:

Input:
"The flight was delayed, but the crew was very polite."

Output:

Sentiment Verdict: Positive
Tone Prediction: Not Sarcastic

Exit

Type "quit" to exit the application.

Limitations

Class Imbalance: Despite applying SMOTE, the dataset remains imbalanced, which could affect model performance.
Sarcasm Detection: Sarcasm is difficult to detect accurately, and the current model may not always capture nuanced tones.
Temporal Trends: The dataset is from 2015, and trends in customer sentiment may have evolved since then.

References

Lopamudra Dey, Sanjay Chakraborty, Anuraag Biswas, Beepa Bose, Sweta Tiwari, "Sentiment Analysis of Review Datasets Using Naïve Bayes' and K-NN Classifier", International Journal of Information Engineering and Electronic Business (IJIEEB), Vol.8, No.4, pp.54-62, 2016.
Yohanssen Pratama et al 2019 J. Phys.: Conf. Ser. 1175 012102
M.Govindarajan, December 2013, Sentiment Analysis of Movie Reviews using Hybrid

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
Sarcasm_Headlines_Dataset.json		Sarcasm_Headlines_Dataset.json
Sentiment_Analysis_of_Airline_Reviews_with_sarcasm_detection.ipynb		Sentiment_Analysis_of_Airline_Reviews_with_sarcasm_detection.ipynb
Tweets.csv		Tweets.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sentiment Analysis with Tone Detection

Overview

Table of Contents

Objective

Motivation

Dataset

Source of Data:

Dataset Details:

Challenges:

Methodology

Data Cleaning

Exploratory Data Analysis (EDA)

Feature Engineering

Model Performance

Real-Time Prediction Workflow

How to Use

Interact with the Model

Example:

Exit

Limitations

References

Images

Tweet Analysis:

Reason Distribution:

Confusion Matrix (Average across 5-folds):

Model Performance Comparision:

Key Metrics:

User Interaction:

About

Uh oh!

Releases

Packages

Languages

Charugundlavipul/Sentiment-Analysis-with-Tone-Detection

Folders and files

Latest commit

History

Repository files navigation

Sentiment Analysis with Tone Detection

Overview

Table of Contents

Objective

Motivation

Dataset

Source of Data:

Dataset Details:

Challenges:

Methodology

Data Cleaning

Exploratory Data Analysis (EDA)

Feature Engineering

Model Performance

Real-Time Prediction Workflow

How to Use

Interact with the Model

Example:

Exit

Limitations

References

Images

Tweet Analysis:

Reason Distribution:

Confusion Matrix (Average across 5-folds):

Model Performance Comparision:

Key Metrics:

User Interaction:

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages