Skip to content

A simple Bernoulli Naive Bayes model achieved 94% accuracy on the IMDB sentiment analysis task. Despite its simplicity, it performed exceptionally well.

Notifications You must be signed in to change notification settings

farshad257/NLP_IMDB

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

🔧 NLP Pipeline for Text Classification

Welcome to this powerful and streamlined Natural Language Processing (NLP) project, where deep insights meet structured engineering. This notebook builds a comprehensive pipeline for text preprocessing, feature engineering, and classification, showcasing modern NLP tools in action.


🚀 Project Overview

This project demonstrates a complete NLP workflow designed to handle raw textual data, process it using a variety of techniques, and classify it into appropriate categories using machine learning models.

We cover everything from the foundations of text cleaning to the deployment of classification algorithms, making it a one-stop solution for applied NLP tasks.


✔️ Features and Workflow

Here's what this project offers:

✅ 1. Text Preprocessing

  • Lowercasing and punctuation removal
  • Tokenization and stopwords filtering
  • Lemmatization using spaCy

📊 2. Feature Extraction

  • Bag of Words (BoW)
  • TF-IDF Vectorization

💻 3. Modeling

  • Multiple ML models tested:
    • Multinomial Naive Bayes
    • Support Vector Machine (SVM)
    • Logistic Regression

📈 4. Evaluation

  • Train-test splitting
  • Accuracy and classification reports
  • Model comparison and selection

📁 Dataset

IMDB


🔍 Dependencies

Make sure to install the following dependencies before running the notebook:

pip install pandas numpy matplotlib seaborn scikit-learn spacy
python -m spacy download en_core_web_sm

📂 How to Use

  1. Clone this repository or download the notebook.
  2. Open the notebook in JupyterLab or Google Colab.
  3. Execute each cell sequentially.
  4. Analyze the final performance metrics and results.

🌟 Highlights

  • Clean, modular code with comments and visualizations.
  • Easy to extend with deep learning or other NLP models.
  • Suitable for binary or multiclass text classification tasks.

💡 Potential Improvements

  • Integrate deep learning using transformers (BERT, RoBERTa)
  • Add hyperparameter tuning with GridSearchCV
  • Deploy as an API using Flask or FastAPI

🙋 Author

Developed with precision and passion by:

Farshad Tofighi [farshad257]
📧 Email: farshadtfgh@gmail.com

If you use or find this helpful, feel free to reach out for collaboration, discussion, or feedback.


📜 License

This project is released under the MIT License – feel free to use, adapt, and share!


Releases

No releases published

Packages

No packages published