A comprehensive implementation of traditional and modern NLP techniques, demonstrating the evolution from classical machine learning to transformer-based approaches.
This project implements NLP techniques across three major areas: text preprocessing and vectorization, traditional machine learning models, and modern transformer-based architectures using Hugging Face. The implementations use three datasets: children's books, movie reviews, and movie metadata with director information.
βββ section3_assignments_working.ipynb # Text preprocessing and vectorization
βββ section4_assignments_working.ipynb # Traditional ML NLP (VADER, Naive Bayes, NMF)
βββ section7_assignments_working.ipynb # Modern NLP with transformers
Dataset: Children's books (childrens_books.csv)
Preprocessing Pipeline:
- Text normalization with pandas: lowercase conversion, Unicode character removal (
\xa0), punctuation stripping - Tokenization and lemmatization with spaCy (
en_core_web_sm) - Custom preprocessing function
clean_and_normalizefrommaven_text_preprocessing
Vectorization:
- CountVectorizer: Implemented with stop word removal and minimum document frequency threshold (10%)
- TF-IDF Vectorizer: Applied with identical parameters for comparison
- Generated document-term matrices and identified top 10 most/least common terms
- Visualized term frequencies using horizontal bar charts with matplotlib
Dataset: Movie reviews with ratings, genres, directors, and critic consensus (movie_reviews.csv)
- Applied VADER (Valence Aware Dictionary and sEntiment Reasoner) to movie descriptions
- Extracted compound sentiment scores ranging from -1 to +1
- Identified movies with highest positive sentiment (e.g., "Breakthrough": 0.9915) and most negative sentiment (e.g., "Charlie Says": -0.9706)
- Target Variable: Director gender prediction (male/female)
- Features: Cleaned and normalized movie descriptions using spaCy
- Models Implemented:
- Multinomial Naive Bayes
- Logistic Regression
- Vectorization with TF-IDF (min_df=0.1)
- Model comparison using accuracy scores and classification reports
- Identified movies most likely to be directed by women based on model predictions
- Algorithm: Non-Negative Matrix Factorization (NMF)
- Parameters: 6 components, 500 max iterations, random_state=42
- Preprocessing: TF-IDF vectorization with min_df=0.02, max_df=0.2
- Generated interpretable topic labels:
- Family films
- True stories
- Friends narratives
- Award winners
- Adventure
- Horror
- Custom
display_topicsfunction to extract top 10 terms per topic
Datasets:
- Movie reviews with VADER sentiment scores (
movie_reviews_sentiment.csv) - Children's books (
childrens_books.csv)
All transformer models configured with Metal Performance Shaders (MPS) device acceleration on Apple Silicon.
- Compared transformer-based sentiment analysis against VADER baseline
- Pipeline implementation for zero-setup inference
- Model:
dbmdz/bert-large-cased-finetuned-conll03-english - Aggregation Strategy: SIMPLE
- Applied to children's book descriptions
- Extracted person entities (PER) and filtered to exclude authors
- Generated unique list of character names from book descriptions
- Model:
facebook/bart-large-mnli - Categories: Adventure & fantasy, animals & nature, mystery, humor, non-fiction
- Applied to children's book descriptions without training data
- Validated classification results through manual spot-checking
- Model:
facebook/bart-large-cnn - Parameters: min_length=10, max_length=50, early_stopping=True, length_penalty=0.8
- Generated abstractive summaries of book descriptions
- Example: "Where the Wild Things Are" description (78 words) β summary (33 words)
- Model:
sentence-transformers/all-MiniLM-L6-v2 - Technique: Feature extraction to generate 384-dimensional embeddings
- Computed cosine similarity between "Harry Potter and the Sorcerer's Stone" and all books
- Identified top 5 most similar books:
- Harry Potter and the Sorcerer's Stone (1.0000)
- Harry Potter and the Prisoner of Azkaban (0.8726)
- Harry Potter and the Chamber of Secrets (0.8554)
- The Witches (0.7991)
- The Wonderful Wizard of Oz (0.7885)
- spaCy 3.8.0: Tokenization, lemmatization, linguistic annotations
- transformers (Hugging Face): Pre-trained transformer models and pipelines
- vaderSentiment: Rule-based sentiment analysis
- scikit-learn:
- Vectorization:
CountVectorizer,TfidfVectorizer - Models:
MultinomialNB,LogisticRegression,NMF - Metrics:
cosine_similarity
- Vectorization:
- PyTorch: Backend for transformer models
- pandas: Data manipulation and analysis
- numpy: Numerical operations and array handling
- matplotlib: Data visualization
| Task | Model | Source |
|---|---|---|
| NER | BERT-large-cased CoNLL03 | dbmdz/bert-large-cased-finetuned-conll03-english |
| Zero-Shot | BART-large MNLI | facebook/bart-large-mnli |
| Summarization | BART-large CNN | facebook/bart-large-cnn |
| Embeddings | MiniLM-L6-v2 | sentence-transformers/all-MiniLM-L6-v2 |
Three separate conda environments were used:
# Text preprocessing environment
conda create -n nlp_preprocessing python=3.12
conda activate nlp_preprocessing
pip install pandas spacy matplotlib
python -m spacy download en_core_web_sm
# Traditional ML environment
conda create -n nlp_machine_learning python=3.12
conda activate nlp_machine_learning
pip install pandas scikit-learn vaderSentiment spacy
# Transformers environment
conda create -n nlp_transformers python=3.12
conda activate nlp_transformers
pip install pandas transformers torchVADER and transformer-based approaches showed different strengths:
- VADER: Fast, interpretable, good for social media text
- Transformers: More nuanced understanding, better context handling
Logistic Regression and Naive Bayes both achieved competitive accuracy in predicting director gender from movie descriptions, demonstrating that linguistic patterns correlate with directorial gender.
NMF successfully identified coherent topics from unlabeled movie descriptions, with clear thematic separation between genres (family films vs. horror, true stories vs. adventure).
Sentence transformers effectively captured semantic similarity, correctly identifying Harry Potter sequels as most similar to the first book, followed by other magical adventure stories (The Witches, Wizard of Oz).
- Lowercase normalization before vectorization
- Unicode character handling (
\xa0removal) - Punctuation stripping
- Stop word removal for dimensionality reduction
- Minimum document frequency thresholds to eliminate rare terms
- Used both Count and TF-IDF vectorization for comparison
- TF-IDF consistently provided better features for classification tasks
- Document frequency thresholds (min_df, max_df) crucial for performance
- All transformers configured with
logging.set_verbosity_error()to reduce output - MPS device acceleration on Apple Silicon:
device='mps' - Pipeline abstraction for simplified inference
- Aggregation strategies in NER to merge subword tokens
- 100 books ranked by popularity
- Features: Title, Author, Year, Rating (1-5), Description
- Publication years: 1947-2014
- Average rating: 4.3/5.0
- 160 movies from 2019
- Features: Title, Rating, Genre, Release Date, Description, Directors, Director Gender, Tomatometer Rating, Audience Rating, Critics Consensus
- Includes VADER sentiment scores in later versions
- Director gender distribution provides binary classification target
- spaCy processing: Minimal (CPU sufficient)
- VADER sentiment: Extremely fast (rule-based)
- Transformer inference: GPU/MPS recommended for batch processing
- Feature extraction: Most computationally intensive (384-dim embeddings for 100+ documents)
- Document-term matrices kept sparse for efficiency
- Transformer models loaded individually (1-2GB each)
- Embedding matrices: ~380KB for 100 documents (float32)
All notebooks use pandas.set_option('display.max_colwidth', None) for full text display and include inline visualizations with matplotlib.
Processing order:
- Data loading
- Text cleaning
- Vectorization/Model application
- Analysis and visualization
- Results interpretation
All implementations use Python 3.12.11 via miniforge3 (Apple Silicon).
These implementations are based on assignments from the Natural Language Processing in Python course, covering the progression from foundational text processing to state-of-the-art transformer architectures.