--
I just did a course about PySpark and this notebook is my first attempt at working with it and learn how it can be used for EDA and machine learning.
PySpark is an interface for Apache Spark in Python that allows you to write Spark applications using Python APIs and is helpful for working with real-time and large-scale data.
This project is based on the Titanic dataset provided on the Titanic ML challenge on Kaggle. Its task is to build a machine learning model that can tell us if passengers were more likely to survive or not according to their data, such as socio-economic class, age, and gender.
The evaluation method for this model will be the accuracy score i.e the total percentage of correctly predicted passengers.
This is a binary classification problem and the classes used for predications are 1 for survived and 0 for deceased.
I used PySpark for an exploratory data analysis, data cleansing and to build logistic regression, random forest classifier and GBTClassifier models.
- PySpark
You can also see this notebook on Kaggle. Just click here to see it.
Luís Fernando Torres