1) Word Translation (Machine Translation)

Goal:

A program that translates English to French

Approach:

The word embeddings data for English and French words
Translations
LSH and document search
Looking up the tweets
Approximate K-NN

Data:

English embeddings from Google code archive word2vec look for GoogleNews-vectors-negative300.bin.gz
and the French embeddings from cross_lingual_text_classification.

2) Auto-correct System

Goal:

Implement a model that corrects words that are 1 and 2 edit distances away

Approach:

Data Preprocessing
String Manipulations
Combining the edits
Minimum Edit distance (dynamic programming)

3) Multi-Class Classification and Rating Prediction of Yelp Dataset (Kaggle's Yelp Business Rating Prediction competition)

Goal:

Predict the star rating of a review using only the review text

Approach:

Split data into training and testing sets, using the review text as the only feature and the star rating as the response
Use CountVectorizer to create document-term matrices from X_train and X_test
Use multinomial Naive Bayes to predict the star rating for the reviews, calculate the accuracy by confusion matrix
Calculate the null accuracy (classification accuracy that could be achieved by always predicting the most frequent class)
Evaluating a classification model by "false positives" and "false negatives"
Calculate which 10 tokens are the most predictive of 5-star reviews, and which 10 tokens are the most predictive of 1-star reviews
Repeat the model building process using all reviews (instead of binary classification: 1-star and 5-star), 5-class classification

Results:

Binary Classification:

Accuracy: 92% (null accuracy 82%)
False positive: Model is reacting to the words "good", "impressive", "nice"
False negative: Model is reacting to the words "complain", "crowds", "rushing", "pricey", "scum"

5-Class Classification:

Accuracy: 47%
47% accuracy is quite impressive, given that humans would also have a hard time precisely identifying the star rating for these reviews
Precision: 54%
Recall: 30%
F1 score: 38%

Interpretation:

Confusion matrix comments: Almost all 4-star and 5-star reviews are classified as 4 or 5 stars, but they are hard for the model to distinguish between, 1-star, 2-star, and 3-star reviews are most commonly classified as 4 stars, probably because it's the predominant class in the training data
Classification report comments: Class 1 has low recall, meaning that the model has a hard time detecting the 1-star reviews, but high precision, meaning that when the model predicts a review is 1-star, it's usually correct, Class 5 has high recall and precision, probably because 5-star reviews have polarized language, and because the model has a lot of observations to learn from

4) Sentiment Analysis on Tweets (Binary Classification)

Goal:

Given a tweet, decide if it has a positive sentiment or a negative one

Approach:

Naive Bayes and Logistic Regression

Data:

The twitter_samples contains subsets of 5,000 positive tweets, 5,000 negative tweets, and the full set of 10,000 tweets.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Machine_translation .ipynb		Machine_translation .ipynb
Naive_bayes.ipynb		Naive_bayes.ipynb
README.md		README.md
Yelp reviews.ipynb		Yelp reviews.ipynb
auto-correct system.ipynb		auto-correct system.ipynb
logistic_reg.ipynb		logistic_reg.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

1) Word Translation (Machine Translation)

Goal:

Approach:

Data:

2) Auto-correct System

Goal:

Approach:

3) Multi-Class Classification and Rating Prediction of Yelp Dataset (Kaggle's Yelp Business Rating Prediction competition)

Goal:

Approach:

Results:

Binary Classification:

5-Class Classification:

Interpretation:

4) Sentiment Analysis on Tweets (Binary Classification)

Goal:

Approach:

Data:

Results:

Logistic Regression:

Naive Bayes:

About

Uh oh!

Releases

Packages

Languages

MerEsf/NLP_Text_Mining

Folders and files

Latest commit

History

Repository files navigation

1) Word Translation (Machine Translation)

Goal:

Approach:

Data:

2) Auto-correct System

Goal:

Approach:

3) Multi-Class Classification and Rating Prediction of Yelp Dataset (Kaggle's Yelp Business Rating Prediction competition)

Goal:

Approach:

Results:

Binary Classification:

5-Class Classification:

Interpretation:

4) Sentiment Analysis on Tweets (Binary Classification)

Goal:

Approach:

Data:

Results:

Logistic Regression:

Naive Bayes:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages