Titanic: Spark and Machine Learning from Disaster

About

Titanic: Spark and Machine Learning from Disaster is a project of Data Intensive Computing (ID2221), course of Data Science held at KTH - Royal Institute of Technology (Period 1, 2023/2024).

Professor Amir Hossein Payberah

The Team

Table of content

Specifications
Documentation
Running the project
Software used

Specifications

The goal of this project is to process and query data regarding the historical Titanic wreckage and implement a predictive analysis to classify which of the passenger on board will survive the tragedy, based on various parameters (name, age, gender, social status, etc.). The structure in Spark for pre-processing and querying for the model training will be the main output of this project.

The initial project proposal can be found here.

Documentation

A full and complete report of the classification algorithm can be found here.

Running the project

The project was developed on the Databricks cloud data platform. The free license we used is the Databricks Community Edition. Once you have signed up and logged in in the platform, the first thing you need to do is entering the Data Science and Engineering section, going into the menu Compute and create and start your own server cluster.

Once the server cluster is running, you have to go in the menu Workspace, create a new folder where to run your project and import in it the notebook main.scala present on our repository, through the option Import > File. When you have opened it, open the tab Connect and load your notebook on the cluster by selecting the cluster server you have created.

Now, into the menu Data you have to import the file dataset.csv present in our repository, through the option Create Table $>$ Upload File. Once you have selected the file dataset.csv, you need to press on Create Table with UI then select the cluster you have created above. Now, in the table preview window, you need to select first row is header and *infer schema * and \textbf{to rename the file as dataset. Now, pressing on Create Table you will have the access to the necessary data to run the notebook.

Now you can return on the notebook main.scala and run the all project, by pressing on the Run All button.

Software used

Databricks Community edition - main developing environment

GitKraken - github

OneDrive - file sharing

Overleaf - LaTeX editor

Visual Studio Code - Markdown editor

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
deliverables		deliverables
src		src
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Titanic: Spark and Machine Learning from Disaster

About

The Team

Table of content

Specifications

Documentation

Running the project

Software used

About

Releases

Packages

Languages

License

Silemo/dic-2023-manfredi-meneghin

Folders and files

Latest commit

History

Repository files navigation

Titanic: Spark and Machine Learning from Disaster

About

The Team

Table of content

Specifications

Documentation

Running the project

Software used

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages