Skip to content

titanic_tutorial

Manlio Morini edited this page Apr 6, 2019 · 64 revisions

Machine learning from disaster

RMS Titanic departing Southampton on 10 April 1912

As a starting point to understand the framework, we'll consider the contest, launched by Kaggle in 2012, asking to complete the analysis of what sorts of people were likely to survive the sinking of the RMS Titanic.

On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children and the upper-class.

Dataset

The original training dataset has 11 features (passenger id, passenger class, name, sex, age, number of siblings/spouses aboard, number of parents/children aboard, ticket number, passenger fare, cabin, port of embarkation) but we are going to use a derived training set (titanic_train.csv).

Modifications are due to:

  • DATA FORMAT COMPLIANCE. VITA training set format is quite simple. Think of this file as a table, with each row representing one example and commas separating columns. The first column contains the class the example falls into while additional columns are features (see dataset format for a complete description). In order to comply, we had to:

    • remove the header row (NO HEADER ROW is allowed);
    • remove PassengerId column (first column of the dataset must represent the value of the example);
    • map Survived feature to a string: 0 => "no", 1 => "yes" (if the first column is a string this is a classification model).
  • DATA PREPROCESSING. Preprocessing, though not mandatory, is often fundamental to improve performances. For instance in titanic_train.csv:

    • Name feature has been removed;
    • missing data have been filled in with plausible values.

Start to code

This is the basic code for a search:

#include "kernel/vita.h"

int main()
{
  vita::src_problem titanic("titanic_train.csv",  // training set
                            vita::src_problem::default_symbols);

  vita::src_search<> s(titanic);
  const auto summary(s.run());                    // go searching
  std::cout << summary.best.solution << '\n';     // print search result
}

All the Vita classes and functions are placed into the vita namespace.

Now compiling and executing the example (for your ease the above code is in the examples/titanic01.cc file):

$ make titanic01

... compiling ...

$ cd examples
$ ./titanic01

you should see something like (actual values will differ):

[INFO] Reading dataset titanic_train.csv...
[INFO] ....dataset read. Examples: 891, categories: 4, features: 9, classes: 2
[INFO] Setting up default symbol set...
[INFO] ...symbol set ready. Symbols: 19
[INFO] Number of layers set to 1
[INFO] Population size set to 1880
Run 0.     0 (  0%): fitness (-534.676)
Run 0.     0 (  0%): fitness (-447.057)
Run 0.     0 (  0%): fitness (-248.699)
Run 0.     0 (  1%): fitness (-248.699)
Run 0.     0 ( 39%): fitness (-248.699)
...
...
Run 0.    97 ( 57%): fitness (-185.236)
Run 0.    98 ( 47%): fitness (-185.232)
[INFO] Elapsed time: 90.819s
[INFO] Training fitness: (-185.232)

[00,0] SIFE "male" "male" [22,0] [74,0]
[22,0] FMOD 1.0 [52,0]
[52,0] SIFE X8 "E" [55,0] [59,0]
[55,0] FMOD [69,0] [75,0]
[59,0] SIFE X2 "female" [83,0] [72,0]
[69,0] FSUB [80,0] X4
[72,0] FDIV [74,0] 9.0
[74,0] FADD X3 [87,0]
[75,0] FMUL [77,0] X4
[77,0] FMOD X3 3.0
[80,0] FLN 9.0
[83,0] FMOD X6 X4
[87,0] FADD [91,0] X4
[91,0] SIFE "male" "female" 9.0 X4

At last we have a program to classify training examples! Not too difficult, but what's going on under the hood?

Line by line description


  vita::src_problem titanic("titanic_train.csv",  // training set
                            vita::src_problem::default_symbols);

src_problem class is a specialization of the problem class for symbolic regression and classification tasks. The constructor reads a dataset (file titanic_train.csv) and automatically sets up the standard dictionary (vita::src_problem::default_symbols) for classification tasks (the assembly-like SIFE, FADD, FMUL... instructions).


vita::src_search<> s(titanic);

The search of solutions is entirely driven by the master vita::src_search class. It uses titanic to access data and specific parameters/constraints.

src_search is a template class, its template parameters allow to choose various search algorithms. For now we stick with the default (<>).


const auto summary(s.run());

The run method performs the search and returns a summary including the best program found (it can be an individual or a team, depending on the src_search template parameters). Since we do not specify any environment parameter (e.g. population size, number of runs...) they're automatically tuned (e.g. see the information line [INFO] Population size set to 1880).


std::cout << summary.best.solution << '\n';

If you take a look at the solution you can see terminals (X1, X2, X3 are features from the dataset; "female", "male", 6.0, 3.0 are constants) and functions (FMOD, SIFE, FABS, FMUL).

The solution can be simplified (e.g. FABS 6.0 is just 6.0 as well as SIFE "T" "T" 6.0 X1). This is typical of evolutionary algorithms (it's a sort of protection against dangerous mutations). The framework provides some functions to carry out part of this simplification but, in general, user intervention is worthwhile before the system is put into production.

The framework has autonomously chosen a basic symbol set for the classification task (see setup_symbols method in kernel/src/problem.cc).

Fitness

Fitness is a scalar/vector value assigned to an individual which reflects how well the individual solves the task.

In the literature there are four measures of fitness:

  • raw fitness
  • standardized fitness
  • adjusted fitness
  • normalized fitness

but we are mostly interested in the first two.

The raw fitness is the measurement of fitness that is stated in the natural terminology of the problem itself, so the better value may be either smaller or larger.

For example in an optimal control problem, one may be trying to minimize some cost measure, so a lesser value of raw fitness is better.

Because raw fitness is so loosely-defined, what is calculated and used in Vita is the standardized fitness. The only requirement for standardized fitness is that bigger values represent better individuals (this may differ in other frameworks).

Many of the fitness functions available in Vita (see class evaluator and the many specializations) define the optimal fitness as scalar value 0 / vector (0, ... 0) and use negative values for sub-optimal solutions but this is not mandatory (e.g. see examples/example6.cc for a different range).

In the above example, fitness is improving from fitness (-534.676) to fitness (-185.232) toward 0. Usually it's a steady increase until the search finds the optimal solution or gets stuck in local optima (enabling specific algorithms, e.g. DSS, you'd see periods of steady increase followed by a 'fall').

It's clear that, excluding the simplest cases, fitness alone is not enough to understand goodness and flaws of the candidate solution.

Some changes to code

int main()
{
  vita::src_problem titanic("titanic_train.csv", src_problem::default_symbols);

  vita::src_search<> s(titanic, vita::metric_flags::accuracy);  // <-- 1
  const auto summary(s.run());

  std::cout << summary.best.solution << '\n'
            << summary.best.score.accuracy << '\n';             // <-- 2
}

(the above code is in the examples/titanic02.cc file)

There are two changes:

  1. vita::src_search<> s(titanic, vita::metric_flags::accuracy)

    We're asking the program to perform accuracy calculation during the search. Additional metrics aren't enabled by default since many of them require expensive computations.

  2. The program now prints the accuracy of the classification.

So:

$ make titanic02

... compiling ...

$ cd examples
$ ./titanic02
[INFO] Number of layers set to 1
[INFO] Population size set to 1880
Run 0.     0 (  0%): fitness (-534.676)
...
...
Run 0.    97 ( 57%): fitness (-185.236)
Run 0.    98 ( 47%): fitness (-185.232)
[INFO] Elapsed time: 158.849s          
[INFO] Training fitness: (-185.232)
[INFO] Training accuracy: 82.0426%
[00,0] SIFE "male" "male" [22,0] [74,0]
[22,0] FMOD 1.0 [52,0]
[52,0] SIFE X8 "E" [55,0] [59,0]
[55,0] FMOD [69,0] [75,0]
[59,0] SIFE X2 "female" [83,0] [72,0]
[69,0] FSUB [80,0] X4
[72,0] FDIV [74,0] 9.0
[74,0] FADD X3 [87,0]
[75,0] FMUL [77,0] X4
[77,0] FMOD X3 3.0
[80,0] FLN 9.0
[83,0] FMOD X6 X4
[87,0] FADD [91,0] X4
[91,0] SIFE "male" "female" 9.0 X4

0.820426

The candidate solution has an accuracy ≅ 82.04%.

How good is the candidate solution?

There are some points about accuracy you should consider:

  • it's just one of many evaluation metrics and sometimes it's quite misleading but since this an introduction, we'll consider higher accuracy as a better result... at least accuracy seems more understandable than fitness;
  • for simplicity it's calculated on the training set. This is far from optimal (Vita also supports validation strategies).

How to find a better candidate solution?

Tweaking the search parameters is the usual first approach.

Searching longer (more generations), increasing the number of runs, increasing the population size are some ways of trading time for quality (this can be done via the environment class):

int main()
{
  vita::src_problem titanic("titanic_train.csv", src_problem::default_symbols);

  titanic.env.mep.code_length =  130;  // <-- 1
  titanic.env.individuals     = 3000;  // <-- 2
  titanic.env.generations     =  200;  // <-- 3

  vita::src_search<> s(titanic, vita::metric_flags::accuracy);
  const auto summary(s.run(5));        // <-- 4

  std::cout << summary.best.solution << '\n'
            << summary.best.score.accuracy << '\n';
}

(the above code is in the examples/titanic03.cc file)

  1. titanic.env.mep.code_length = 130;

    I.e. longer individuals. There is more space for a complex solution but also a higher risk of over-fitting.

  2. titanic.env.individuals = 3000;

    A larger population probably permits a broader exploration of the search space (large populations require more computing power / memory).

  3. titanic.env.generations = 200;

    More time to improve a candidate solution but also higher risk of over-fitting and a possible waste of resources in case of premature convergence.

  4. s.run(5)

    We want a five-runs search.

$ make titanic03

... compiling ...

$ cd examples
$ ./titanic03
[INFO] Number of layers set to 1
Run 0.     0 (  0%): fitness (-248.699)
...
...
Run 4.   182 ( 46%): fitness (-180.113)
Run 4.   187 ( 23%): fitness (-178.686)
[INFO] Elapsed time: 626.656s          
[INFO] Training fitness: (-178.686)
[INFO] Training accuracy: 82.716%
[000,0] FLN [020,0]
[020,0] FMUL [071,0] [043,0]
[043,0] FLN [057,0]
[057,0] FSUB [083,0] [058,0]
[058,0] FSUB 9.0 [115,0]
[071,0] FADD [101,0] [072,0]
[072,0] FDIV [084,0] [102,0]
[083,0] SIFE "male" X2 [100,0] [099,0]
[084,0] SIFE "B" X8 X7 X4
[099,0] FMUL X1 [116,0]
[100,0] SIFE "D" X8 X5 X3
[101,0] FADD [104,0] [107,0]
[102,0] SIFE X8 "E" [115,0] [112,0]
[104,0] FMUL [109,0] [118,0]
[107,0] FADD 8.0 [109,0]
[109,0] FADD X5 8.0
[112,0] SIFE "male" X2 [119,0] X7
[115,0] FMUL X1 X4
[116,0] FABS X5
[118,0] FMUL X1 7.0
[119,0] FSUB 6.0 7.0

0.837262

As expected the search now spans over multiple runs (i.e. independent evolution cycles).

The solution of every run is checked against the training set and the best of the candidate solutions is selected as "winner".

In the example it has an accuracy ≅ 83.72% and seems a marginally better model (anyway further analysis is required to confirm an effective improvement).

How to make predictions

Now let's backtrack a little bit. Somehow we have obtained an interesting individual, the accuracy and/or other performance measurements are promising and we would like to use it to make predictions.

Example examples/titanic04.cc explains how to proceed:

const auto model(s.lambdify(summary.best.solution));
const auto example(random::element(titanic.data()));
const auto result(model->tag(example));

std::cout << "Correct class: " << label(example)
          << "   Prediction: " << result.label
          << "   Sureness: " << result.sureness << '\n';

The vita::src_search class provides lambdify(individual) member function which returns a functor exploitable for symbolic regression and classification tasks.

The key definition is:

const auto model(s.lambdify(summary.best.solution));

The model object is a std::unique_ptr<some strange object> smart pointer but this isn't very important. Rather it accepts an example (dataframe::example) and gives the predicted class ((*model)(example)).

Even better it has the tag member functions:

const auto result(model->tag(example));

result is a struct containing the predicted class (result.label) and the sureness of the prediction (result.sureness, varies in the [0, 1] interval):

std::cout << "Correct class: " << label(example)
          << "   Prediction: " << result.label
          << "   Sureness: " << result.sureness << '\n';

The model can be made persistent:

std::stringstream ss;
serialize::save(ss, model);

// ... and reload it when needed.
const auto model2(serialize::lambda::load(ss, titanic.sset));
const auto result2(model2->tag(example));
std::cout << "   Prediction: " << result2.label
          << "   Sureness: " << result2.sureness << '\n';
assert(result2.label == result.label);

The individual at the core of the model can be exported in several languages:

std::cout << "\nC LANGUAGE\n" << std::string(40, '-') << '\n'
          << out::print_format(out::c_language_f) << summary.best.solution
          << "\n\nGRAPHVIZ FORMAT\n" << std::string(40, '-') << '\n'
          << out::print_format(out::graphviz_f) << summary.best.solution
          << "\n\nLIST (DEBUG) FORMAT\n" << std::string(40, '-') << '\n'
          << out::print_format(out::list_f) << summary.best.solution
          << '\n';

but in this case it's not so useful. In effect the individual acts as a discriminant function and so it's just a part of the classification model.

Clone this wiki locally