- PySpark In Jupyter
- Install NumPy:
$ pip3 install numpy
If the above error occurs make sure $JAVA_HOME is either set globally or for the project.
Build a model to predict survival on the Titanic based on the training data (data/training.csv
) only. A subset of traning data are used to train the model and remaing data are used as test data.
The notebook contains steps:
- Load the training data
- Prepare the dataset for Spark ML library
- Split the dataset into training_data and test_data
- Build the model and fit the training dataset into the model
- Use the test dataset against the model to get the predictions
- Finally calculate accuracy of the predictions
This is similar to the above model. Only difference is here two seperate datasets are used - training data (data/training.csv
) to train the model. Then using that model to predict survival of the passengers in the test data (data/test.csv
).
The notebook contains steps:
- Load the training & test data
- Prepare both the datasets for Spark ML library
- Build the model and fit the training dataset into the model
- Use the test dataset against the model to get the predictions
- Submitted the prediction in kaggle; Scored: 0.77033!
Ref: