Clickbait Detector: An AI-Powered Headline Classification System

Welcome to the Clickbait Detector, a sophisticated machine learning project designed to classify news headlines as "Clickbait" or "Non-Clickbait" using advanced natural language processing (NLP) techniques. This project showcases the seamless integration of multiple technologies—web scraping, transformer-based NLP models, real-time data fetching, data visualization, and interactive web applications—to deliver a robust and user-friendly solution. Developed as a portfolio piece to demonstrate expertise in AI, data science, and full-stack development.

Project Overview

The Clickbait Detector leverages a fine-tuned DistilBERT model to analyze headlines with high accuracy, distinguishing sensationalist clickbait from legitimate news. The system integrates a pipeline of tools for data collection, model training, real-time analysis, and user interaction, making it a comprehensive showcase of modern AI and web technologies. Key features include:

Web Scraping: Dynamically extracts headlines from news websites using Selenium and BeautifulSoup.
Real-Time Data Fetching: Retrieves trending headlines via the News API for up-to-date analysis.
NLP Model: Employs a DistilBERT transformer model for precise headline classification.
Interactive Web App: Built with Streamlit, allowing users to predict headlines, correct labels, and visualize insights.
Data Visualization: Generates word clouds and label distribution charts using Plotly and WordCloud.
Dataset Management: Combines scraped, corrected, and original data into a unified dataset with deduplication.
Demo on YouTube : Link

Features

Headline Classification: Predicts whether a headline is clickbait or non-clickbait with confidence scores using a fine-tuned DistilBERT model.
Dynamic Web Scraping: Collects headlines from sites like BBC, Reuters, and BuzzFeed, with robust error handling and site-specific selectors.
Trending News Analysis: Fetches and classifies trending headlines in real-time using the News API.
User Interaction: Allows users to input custom headlines, correct model predictions, and save corrections via a Streamlit app.
Exploratory Data Analysis (EDA): Visualizes dataset insights with interactive bar charts and word clouds.
Automated Workflow: Orchestrates scraping, dataset updates, model training, and app launch through a single script.
Robust Error Handling: Includes comprehensive logging to ensure reliability across components.

Technology Stack

The Clickbait Detector leverages Python-based ML/NLP (DistilBERT, PyTorch, scikit-learn), web scraping tools (Selenium, BeautifulSoup, requests), and data processing libraries (pandas, NumPy). It provides an interactive Streamlit app with visualizations (Plotly, WordCloud), integrates with the News API, and uses supporting tools for logging, workflow, and system setup.

Project Structure

The project is organized into modular scripts, each handling a specific component of the pipeline:

main.py: Orchestrates the entire workflow, from scraping to launching the Streamlit app.
scrape_headlines.py: Scrapes headlines from news websites using Selenium and BeautifulSoup.
update_dataset.py: Combines scraped and corrected headlines into the main dataset, ensuring no duplicates.
clickbait_detector.py: Trains and deploys the DistilBERT model for headline classification.
fetch_trending.py: Fetches trending headlines via News API and predicts their labels.
eda.py: Generates word clouds and label distribution charts for dataset insights.
app.py: Runs the Streamlit app for user interaction and visualization.

Data Files:

headlines_dataset.csv: Main dataset containing headlines and labels.
scraped_headlines.csv: Temporary storage for scraped headlines.
corrected_headlines.csv: Stores user-corrected labels from the Streamlit app.
trending_headlines.csv: Contains fetched trending headlines with predicted labels.
clickbait_wordcloud.png & non_clickbait_wordcloud.png: Visualizations of headline content.

Installation

To run the Clickbait Detector locally, follow these steps:

Clone the Repository:

git clone https://github.com/chaw-thiri/clickbait-detector.git
cd clickbait-detector

Install Dependencies: Ensure Python 3.8+ is installed, then install required packages:

pip install -r requirements.txt

Or manually install:

pip install streamlit pandas torch transformers requests beautifulsoup4 selenium webdriver_manager wordcloud matplotlib plotly

Prepare the Dataset: Ensure headlines_dataset.csv exists in the project root with at least some initial data (format: headline,label).
Obtain a News API Key:
- Sign up at News API to get an API key.
- Update main.py and fetch_trending.py with your API key, or leave it blank to use local fallback.
Run the Project: Execute the main workflow script:
```
python main.py
```
This will:
- Scrape headlines from news websites.
- Update the dataset.
- Train the model (if needed).
- Fetch trending headlines.
- Generate EDA visuals.
- Launch the Streamlit app (accessible at http://localhost:8501).

Usage

Streamlit App

The Streamlit app provides an intuitive interface for interacting with the Clickbait Detector:

Predict a Headline: Enter a headline to get a clickbait/non-clickbait prediction with a confidence score.
Fetch Trending Headlines: Input a News API key to retrieve and classify trending news, or use the local trending_headlines.csv.
Correct Predictions: Adjust model predictions and save corrections to corrected_headlines.csv.
View Dataset Insights: Explore label distribution charts and word clouds for clickbait and non-clickbait headlines.

Command-Line Usage

Individual scripts can be run for specific tasks:

Scrape headlines: python scrape_headlines.py
Update dataset: python update_dataset.py
Train model: python clickbait_detector.py
Fetch trending headlines: python fetch_trending.py
Generate EDA visuals: python eda.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Clickbait Detector: An AI-Powered Headline Classification System

Project Overview

Features

Technology Stack

Project Structure

Installation

Usage

Streamlit App

Command-Line Usage

About

Uh oh!

Releases 2

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.vscode		.vscode
__pycache__		__pycache__
README.markdown		README.markdown
app.py		app.py
clickbait_detector.py		clickbait_detector.py
clickbait_wordcloud.png		clickbait_wordcloud.png
debug_www.buzzfeed.com.html		debug_www.buzzfeed.com.html
eda.py		eda.py
fetch_trending.py		fetch_trending.py
headlines_dataset.csv		headlines_dataset.csv
image.png		image.png
main.py		main.py
non_clickbait_wordcloud.png		non_clickbait_wordcloud.png
requirements.txt		requirements.txt
scrape_headlines.py		scrape_headlines.py
scraped_headlines.csv		scraped_headlines.csv
trending_headlines.csv		trending_headlines.csv
update_dataset.py		update_dataset.py

chaw-thiri/clickbait_detection

Folders and files

Latest commit

History

Repository files navigation

Clickbait Detector: An AI-Powered Headline Classification System

Project Overview

Features

Technology Stack

Project Structure

Installation

Usage

Streamlit App

Command-Line Usage

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages