readme = """
This project is a machine learning-based text classification system that predicts whether a news article is real or fake.
I explored different models and preprocessing techniques, and selected Naive Bayes as the final model based on its performance on unseen test samples.
In an era of misinformation, it's important to develop tools that can help detect fake news articles. This project aims to classify news articles as either fake or real using natural language processing (NLP) and machine learning.
The dataset used is a combination of:
Fake.csvTrue.csv
Source: https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset
Each news article is labeled:
0β Fake1β Real
- Python
- NLTK (for text preprocessing)
- Scikit-learn
- TfidfVectorizer
- Naive Bayes Classifier
- Matplotlib & Seaborn (for data visualization)
The following preprocessing steps were applied:
- Lowercasing text
- Removing punctuation and numbers
- Removing stopwords
- Stemming using
PorterStemmer - TF-IDF vectorization (top 5000 features)
I initially tested both:
- Logistic Regression
- Multinomial Naive Bayes (Final choice)
Although Logistic Regression gave good accuracy, Naive Bayes gave better generalization on real-world samples.
- Accuracy: ~98.8%
- Evaluation:
- Precision, Recall, F1-score
- Confusion Matrix
- Successfully tested on real and fake news examples
test_news = """In a press briefing held at the National Press Club, the finance minister
announced a 12% increase in education budget for the fiscal year 2025."""
print(predict_news(test_news))
# π¨ Real News- Clone the repo or download the
.ipynbnotebook - Install dependencies (NLTK, scikit-learn, pandas)
- Download stopwords via NLTK
- Run all cells in Jupyter/Colab
Rabeeha Kamran
Undergraduate | Machine Learning Enthusiast
β¨ Dreaming big, building small β¨