Skip to content

suvasishm/spark-ml-model-kickstart

Repository files navigation

Spark ML Model Kickstart

Initial Setup

  1. PySpark In Jupyter
  2. Install NumPy: $ pip3 install numpy
!! Exception: Java gateway process exited before sending its port number !!

If the above error occurs make sure $JAVA_HOME is either set globally or for the project.

What I did here

titanic_survival_model

Build a model to predict survival on the Titanic based on the training data (data/training.csv) only. A subset of traning data are used to train the model and remaing data are used as test data.

The notebook contains steps:

  1. Load the training data
  2. Prepare the dataset for Spark ML library
  3. Split the dataset into training_data and test_data
  4. Build the model and fit the training dataset into the model
  5. Use the test dataset against the model to get the predictions
  6. Finally calculate accuracy of the predictions

titanic_kaggle_prediction

This is similar to the above model. Only difference is here two seperate datasets are used - training data (data/training.csv) to train the model. Then using that model to predict survival of the passengers in the test data (data/test.csv).

The notebook contains steps:

  1. Load the training & test data
  2. Prepare both the datasets for Spark ML library
  3. Build the model and fit the training dataset into the model
  4. Use the test dataset against the model to get the predictions
  5. Submitted the prediction in kaggle; Scored: 0.77033!

Ref:

  1. https://www.kaggle.com/c/titanic/overview
  2. https://towardsdatascience.com/your-first-apache-spark-ml-model-d2bb82b599dd

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published