Mining for Credibility

Project for CS5362 Data Mining, Spring 2019
Group Members:

Mahdokht Afravi
Jonathan Avila
Cristian Ayub
Gerardo Cervantes

Abstract

We used the publicly available FakeNewsCorpus dataset that consists of articles labelled as one of 11 types. We used articles labelled as 'fake', 'reliable', and 'conspiracy' to extract features. We convert the data to numerical data by using a spare matrix representation and using term frequency. With this feature set, we apply clustering algorithms to the dataset, and are able to find useful information about about the data. We train a model using linear regression that would predict whether an article is 'fake' or 'reliable' and find modest results. Our project site is a GitHub web page that can be found on GitHub Pages which contains the source code, a PDF copy of the presentation slides, and a PDF copy of this report.

How to Run

This section describes how to run each script in the python environment equipped with the 'Prerequisites' stated below.

datafilter.py

Reads `news_cleaned_2018_02_13.csv` and writes rows matching article types supplied with `-article_types`. For a complete list of article types (tags), see this page.

For example, to write 'fake' articles and 'reliable' articles into `fake.csv` and `reliable.csv` respectively,

data_filter.py -article_types fake reliable

data_preprocessing.py

Creates a sparse matrix of documents and word frequency. Default vocabulary size is 40,000. data_preprocessing.py -filename="fake.csv" -article_limit=1000 -vocabulary_size=20000

dbscan.py

Runs DBSCAN and calculates outliers' distances to nearest cluster as measure of cluster fit. Reports on clusters made and noise articles. The file names are hardcoded to a specific directory (D:\dm_dataset\) where all files must be located in order to run this script.

kmeans_algorithm.py

Runs k-means and reports random articles for manual analysis of clusters. The file names are hardcoded to a specific directory (D:\dm_dataset\) where all files must be located in order to run this script. k is also hardcoded.

linear_regression.py

Runs linear regression and reports whether an article is fake or reliable. The file names are hardcoded to a specific directory (D:\dm_dataset\) where all files must be located in order to run this script.

naive_bayes.py

Runs naive bayes. This script is a draft that is incomplete for this project.

Prerequisites

scipy installed

nltk installed

nltk.download('stopwords')

nltk.download('punkt')

Resources

Fake News Corpus is available on GitHub.

Downloadables

Visit the Releases page of the project on GitHub to download a ZIP of all source code, report (as a PDF), and presentation slides (as a PDF).

Links

Code

Slides

Report

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
docs		docs
.gitignore		.gitignore
README.md		README.md
_config.yml		_config.yml
csv_const.py		csv_const.py
data_filter.py		data_filter.py
data_preprocessing.py		data_preprocessing.py
dbscan.py		dbscan.py
kmeans_algorithm.py		kmeans_algorithm.py
linear_regression.py		linear_regression.py
naive_bayes.py		naive_bayes.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mining for Credibility

Abstract

datafilter.py

data_preprocessing.py

dbscan.py

kmeans_algorithm.py

linear_regression.py

naive_bayes.py

Prerequisites

About

Releases 1

Packages

Contributors 3

Languages

mahdafr/19s_cs5362-dm

Folders and files

Latest commit

History

Repository files navigation

Mining for Credibility

Abstract

datafilter.py

data_preprocessing.py

dbscan.py

kmeans_algorithm.py

linear_regression.py

naive_bayes.py

Prerequisites

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Languages

Packages