Project for CS5362 Data Mining, Spring 2019
Group Members:
- Mahdokht Afravi
- Jonathan Avila
- Cristian Ayub
- Gerardo Cervantes
We used the publicly available FakeNewsCorpus dataset that consists of articles labelled as one of 11 types. We used articles labelled as 'fake', 'reliable', and 'conspiracy' to extract features. We convert the data to numerical data by using a spare matrix representation and using term frequency. With this feature set, we apply clustering algorithms to the dataset, and are able to find useful information about about the data. We train a model using linear regression that would predict whether an article is 'fake' or 'reliable' and find modest results. Our project site is a GitHub web page that can be found on GitHub Pages which contains the source code, a PDF copy of the presentation slides, and a PDF copy of this report.
How to Run
This section describes how to run each script in the python environment equipped with the 'Prerequisites' stated below.
Reads `news_cleaned_2018_02_13.csv` and writes rows matching article types supplied with `-article_types`. For a complete list of article types (tags), see this page.
For example, to write 'fake' articles and 'reliable' articles into `fake.csv` and `reliable.csv` respectively,
data_filter.py -article_types fake reliable
Creates a sparse matrix of documents and word frequency. Default vocabulary size is 40,000.
data_preprocessing.py -filename="fake.csv" -article_limit=1000 -vocabulary_size=20000
Runs DBSCAN and calculates outliers' distances to nearest cluster as measure of cluster fit. Reports on clusters made and noise articles. The file names are hardcoded to a specific directory (D:\dm_dataset\) where all files must be located in order to run this script.
Runs k-means and reports random articles for manual analysis of clusters. The file names are hardcoded to a specific directory (D:\dm_dataset\) where all files must be located in order to run this script. k is also hardcoded.
Runs linear regression and reports whether an article is fake or reliable. The file names are hardcoded to a specific directory (D:\dm_dataset\) where all files must be located in order to run this script.
Runs naive bayes. This script is a draft that is incomplete for this project.
scipy
installed
nltk
installed
nltk.download('stopwords')
nltk.download('punkt')
Resources
Fake News Corpus is available on GitHub.
Downloadables
Visit the Releases page of the project on GitHub to download a ZIP of all source code, report (as a PDF), and presentation slides (as a PDF).