Skip to content

Silemo/dic-2023-manfredi-meneghin

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Titanic: Spark and Machine Learning from Disaster

license: GPL3 latest commit

About

Titanic: Spark and Machine Learning from Disaster is a project of Data Intensive Computing (ID2221), course of Data Science held at KTH - Royal Institute of Technology (Period 1, 2023/2024).

Professor Amir Hossein Payberah

The Team

Table of content

Specifications

The goal of this project is to process and query data regarding the historical Titanic wreckage and implement a predictive analysis to classify which of the passenger on board will survive the tragedy, based on various parameters (name, age, gender, social status, etc.). The structure in Spark for pre-processing and querying for the model training will be the main output of this project.

The initial project proposal can be found here.

Documentation

A full and complete report of the classification algorithm can be found here.

Running the project

The project was developed on the Databricks cloud data platform. The free license we used is the Databricks Community Edition. Once you have signed up and logged in in the platform, the first thing you need to do is entering the Data Science and Engineering section, going into the menu Compute and create and start your own server cluster.

Once the server cluster is running, you have to go in the menu Workspace, create a new folder where to run your project and import in it the notebook main.scala present on our repository, through the option Import > File. When you have opened it, open the tab Connect and load your notebook on the cluster by selecting the cluster server you have created.

Now, into the menu Data you have to import the file dataset.csv present in our repository, through the option Create Table $>$ Upload File. Once you have selected the file dataset.csv, you need to press on Create Table with UI then select the cluster you have created above. Now, in the table preview window, you need to select first row is header and *infer schema * and \textbf{to rename the file as dataset. Now, pressing on Create Table you will have the access to the necessary data to run the notebook.

Now you can return on the notebook main.scala and run the all project, by pressing on the Run All button.

Software used

Databricks Community edition - main developing environment

GitKraken - github

OneDrive - file sharing

Overleaf - LaTeX editor

Visual Studio Code - Markdown editor

About

Data Intensive computing project, 2023

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published