Welcome to the Clickbait Detector, a sophisticated machine learning project designed to classify news headlines as "Clickbait" or "Non-Clickbait" using advanced natural language processing (NLP) techniques. This project showcases the seamless integration of multiple technologies—web scraping, transformer-based NLP models, real-time data fetching, data visualization, and interactive web applications—to deliver a robust and user-friendly solution. Developed as a portfolio piece to demonstrate expertise in AI, data science, and full-stack development.
The Clickbait Detector leverages a fine-tuned DistilBERT model to analyze headlines with high accuracy, distinguishing sensationalist clickbait from legitimate news. The system integrates a pipeline of tools for data collection, model training, real-time analysis, and user interaction, making it a comprehensive showcase of modern AI and web technologies. Key features include:
- Web Scraping: Dynamically extracts headlines from news websites using Selenium and BeautifulSoup.
- Real-Time Data Fetching: Retrieves trending headlines via the News API for up-to-date analysis.
- NLP Model: Employs a DistilBERT transformer model for precise headline classification.
- Interactive Web App: Built with Streamlit, allowing users to predict headlines, correct labels, and visualize insights.
- Data Visualization: Generates word clouds and label distribution charts using Plotly and WordCloud.
- Dataset Management: Combines scraped, corrected, and original data into a unified dataset with deduplication.
- Demo on YouTube : Link
- Headline Classification: Predicts whether a headline is clickbait or non-clickbait with confidence scores using a fine-tuned DistilBERT model.
- Dynamic Web Scraping: Collects headlines from sites like BBC, Reuters, and BuzzFeed, with robust error handling and site-specific selectors.
- Trending News Analysis: Fetches and classifies trending headlines in real-time using the News API.
- User Interaction: Allows users to input custom headlines, correct model predictions, and save corrections via a Streamlit app.
- Exploratory Data Analysis (EDA): Visualizes dataset insights with interactive bar charts and word clouds.
- Automated Workflow: Orchestrates scraping, dataset updates, model training, and app launch through a single script.
- Robust Error Handling: Includes comprehensive logging to ensure reliability across components.
- The Clickbait Detector leverages Python-based ML/NLP (DistilBERT, PyTorch, scikit-learn), web scraping tools (Selenium, BeautifulSoup, requests), and data processing libraries (pandas, NumPy). It provides an interactive Streamlit app with visualizations (Plotly, WordCloud), integrates with the News API, and uses supporting tools for logging, workflow, and system setup.
The project is organized into modular scripts, each handling a specific component of the pipeline:
main.py: Orchestrates the entire workflow, from scraping to launching the Streamlit app.scrape_headlines.py: Scrapes headlines from news websites using Selenium and BeautifulSoup.update_dataset.py: Combines scraped and corrected headlines into the main dataset, ensuring no duplicates.clickbait_detector.py: Trains and deploys the DistilBERT model for headline classification.fetch_trending.py: Fetches trending headlines via News API and predicts their labels.eda.py: Generates word clouds and label distribution charts for dataset insights.app.py: Runs the Streamlit app for user interaction and visualization.
Data Files:
headlines_dataset.csv: Main dataset containing headlines and labels.scraped_headlines.csv: Temporary storage for scraped headlines.corrected_headlines.csv: Stores user-corrected labels from the Streamlit app.trending_headlines.csv: Contains fetched trending headlines with predicted labels.clickbait_wordcloud.png&non_clickbait_wordcloud.png: Visualizations of headline content.
To run the Clickbait Detector locally, follow these steps:
-
Clone the Repository:
git clone https://github.com/chaw-thiri/clickbait-detector.git cd clickbait-detector -
Install Dependencies: Ensure Python 3.8+ is installed, then install required packages:
pip install -r requirements.txt
Or manually install:
pip install streamlit pandas torch transformers requests beautifulsoup4 selenium webdriver_manager wordcloud matplotlib plotly
-
Prepare the Dataset: Ensure
headlines_dataset.csvexists in the project root with at least some initial data (format:headline,label). -
Obtain a News API Key:
- Sign up at News API to get an API key.
- Update
main.pyandfetch_trending.pywith your API key, or leave it blank to use local fallback.
-
Run the Project: Execute the main workflow script:
python main.py
This will:
- Scrape headlines from news websites.
- Update the dataset.
- Train the model (if needed).
- Fetch trending headlines.
- Generate EDA visuals.
- Launch the Streamlit app (accessible at
http://localhost:8501).
The Streamlit app provides an intuitive interface for interacting with the Clickbait Detector:
- Predict a Headline: Enter a headline to get a clickbait/non-clickbait prediction with a confidence score.
- Fetch Trending Headlines: Input a News API key to retrieve and classify trending news, or use the local
trending_headlines.csv. - Correct Predictions: Adjust model predictions and save corrections to
corrected_headlines.csv. - View Dataset Insights: Explore label distribution charts and word clouds for clickbait and non-clickbait headlines.
Individual scripts can be run for specific tasks:
- Scrape headlines:
python scrape_headlines.py - Update dataset:
python update_dataset.py - Train model:
python clickbait_detector.py - Fetch trending headlines:
python fetch_trending.py - Generate EDA visuals:
python eda.py