- a binary classifier (Logistic Regression + SVM) to mark emails as spam or not spam
Download Apache Spark 2.4.6 distribution pre-built for Apache Hadoop 2.7 link.
- unpack the archive
- set the
$SPARK_HOME
environment variableexport SPARK_HOME=$(pwd)
- navigate to
PyCharm → Preferences ... → Project spark-demo → Project Structure → Add Content Root
in the main menu - select all
.zip
files from$SPARK_HOME/python/lib
- click apply and save changes
- navigate to
Run → Edit Configurations → + → Python
in the main menu - select
email_spam_filter.py
forScript
- name it
email_spam_filter
PYSPARK_PYTHON=python3
PYTHONPATH=$SPARK_HOME/python
PYTHONUNBUFFERED=1
- the training data
nospam_training.txt
,spam_training.txt
, as well as the testing datanospam_testing.txt
,pam_testing.txt
need to be under../spam-datasets/*.txt
relative to the script path
- click
Run → Run 'email_spam_filter'
in the main menu