-
Notifications
You must be signed in to change notification settings - Fork 6
titanic_tutorial
As a starting point to understand the framework, we'll consider the contest, launched by Kaggle in 2012, asking to complete the analysis of what sorts of people were likely to survive the sinking of the RMS Titanic.
On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children and the upper-class.
The original training dataset has 11 features (passenger id, passenger class, name, sex, age, number of siblings/spouses aboard, number of parents/children aboard, ticket number, passenger fare, cabin, port of embarkation) but we are going to use a derived training set (titanic_train.csv).
Modifications are due to:
-
DATA FORMAT COMPLIANCE. VITA training set format is quite simple. Think of this file as a table, with each row representing one example and commas separating columns. The first column contains the class the example falls into while additional columns are features (see dataset format for a complete description). In order to comply, we had to:
- remove the header row (NO HEADER ROW is allowed);
- remove
PassengerId
column (first column of the dataset must represent the value of the example); - map
Survived
feature to a string: 0 => "no", 1 => "yes" (if the first column is a string this is a classification model).
-
DATA PREPROCESSING. Preprocessing, though not mandatory, is often fundamental to improve performances. For instance in
titanic_train.csv
:-
Name
feature has been removed; - missing data have been filled in with plausible values.
-
This is the basic code for a search:
#include "kernel/vita.h"
int main()
{
vita::src_problem titanic("titanic_train.csv", // training set
vita::src_problem::default_symbols);
vita::src_search<> s(titanic);
const auto summary(s.run()); // go searching
std::cout << summary.best.solution << '\n'; // print search result
}
All the Vita classes and functions are placed into the vita namespace
.
Now compiling and executing the example (for your ease the above code is in the examples/titanic01.cc file):
$ make examples/titanic01
... compiling ...
$ cd build
$ ./titanic01
you should see something like (actual values will differ):
[INFO] Reading dataset titanic_train.csv...
[INFO] ....dataset read. Examples: 891, categories: 4, features: 9, classes: 2
[INFO] Setting up default symbol set...
[INFO] ...symbol set ready. Symbols: 19
[INFO] Number of layers set to 1
[INFO] Population size set to 1880
Run 0. 0 ( 0%): fitness (-534.676)
Run 0. 0 ( 0%): fitness (-447.057)
Run 0. 0 ( 0%): fitness (-248.699)
Run 0. 0 ( 1%): fitness (-248.699)
Run 0. 0 ( 39%): fitness (-248.699)
...
...
Run 0. 97 ( 57%): fitness (-185.236)
Run 0. 98 ( 47%): fitness (-185.232)
[INFO] Elapsed time: 90.819s
[INFO] Training fitness: (-185.232)
[00,0] SIFE "male" "male" [22,0] [74,0]
[22,0] FMOD 1.0 [52,0]
[52,0] SIFE X8 "E" [55,0] [59,0]
[55,0] FMOD [69,0] [75,0]
[59,0] SIFE X2 "female" [83,0] [72,0]
[69,0] FSUB [80,0] X4
[72,0] FDIV [74,0] 9.0
[74,0] FADD X3 [87,0]
[75,0] FMUL [77,0] X4
[77,0] FMOD X3 3.0
[80,0] FLN 9.0
[83,0] FMOD X6 X4
[87,0] FADD [91,0] X4
[91,0] SIFE "male" "female" 9.0 X4
At last we have a program to classify training examples! Not too difficult, but what's going on under the hood?
vita::src_problem titanic("titanic_train.csv", // training set
vita::src_problem::default_symbols);
src_problem
class is a specialization of the problem
class for symbolic regression and classification tasks. The constructor reads a dataset (file titanic_train.csv
) and automatically sets up the standard symbol dictionary for classification tasks.
vita::src_search<> s(titanic);
The search of solutions is driven entirely by the master vita::src_search
class, that uses titanic
to access data and specific parameters/constraints. src_search
is a template class, its template parameters allow to choose various search algorithms. For now we stick with the default (<>
).
const auto summary(s.run());
The run method performs the search and returns a summary including the best program found (it can be an individual or a team, depending on the src_search
template parameters).
Since we do not specify any environment parameter (e.g. population size, number of runs...) they're automatically tuned (see the output line [INFO] Population size set to 1880
).
std::cout << summary.best.solution << '\n';
If you take a look at the solution you can see terminals (X1
, X2
, X3
are features from the dataset; "female"
, "male"
, 6.0
, 3.0
are constants) and functions (FMOD
, SIFE
, FABS
, FMUL
).
The solution can be simplified (e.g. FABS 6.0
is just 6.0
as well as SIFE "T" "T" 6.0 X1
). This is typical of evolutionary algorithms (it's a sort of protection against dangerous mutations). The framework provides some functions to carry out part of this simplification but, in general, user intervention is worthwhile before the system is put into production.
The framework has autonomously chosen a basic symbol set for the classification task (see setup_symbols
method in kernel/src/problem.cc).
Fitness is a scalar/vector value assigned to an individual which reflects how well the individual solves the task.
In the literature there are four measures of fitness:
- raw fitness
- standardized fitness
- adjusted fitness
- normalized fitness
but we are mostly interested in the first two.
The raw fitness is the measurement of fitness that is stated in the natural terminology of the problem itself, so the better value may be either smaller or larger.
For example in an optimal control problem, one may be trying to minimize some cost measure, so a lesser value of raw fitness is better.
Because raw fitness is so loosely-defined, what is calculated and used in Vita is the standardized fitness. The only requirement for standardized fitness is that bigger values represent better individuals (this may differ in other frameworks).
Many of the fitness functions available in Vita (see class evaluator
and the many specializations) define the optimal fitness as value 0
/ vector (0, ... 0)
and use negative values for sub-optimal solutions but this is not mandatory (e.g. see examples/example6.cc for a different range).
In the above output, fitness is increasing from fitness (-163.5)
to fitness (-123.352)
toward 0
. Usually it's a steady increase until the search finds the optimal solution (or gets stuck in local optima). Enabling specific algorithms (e.g. DSS) you'd see periods of steady increase followed by a 'fall'.
The "solution" found has a fitness of -123.352
on the training set (Vita also supports validation strategies).
Is this a good solution? How to improve?
int main()
{
vita::src_problem titanic("titanic_train.csv");
vita::src_search<> s(titanic, vita::metric_flags::accuracy); // <-- 1
const auto summary(s.run(10)); // <-- 2
std::cout << summary.best.solution << '\n'
<< summary.best.score.accuracy << '\n'; // <-- 3
}
(the above code is in the examples/titanic02.cc file)
There are three changes:
-
vita::src_search<> s(titanic, vita::metric_flags::accuracy)
The program must perform accuracy calculation during the search. This isn't enabled by default since many metrics require expensive computations.
-
s.run(10)
We want a ten-runs search.
-
The program now prints the accuracy of the classification.
So:
$ make examples/titanic02
... compiling ...
$ cd build
$ ./titanic02
[INFO] DSS set to 1
[INFO] Number of layers set to 6
[INFO] Population size set to 313
[INFO] Validation percentage set to 20%
Run 0. 0 ( 0%): fitness (-163.5)
...
...
Run 9. 100 ( 24%): fitness (-117.757)
Run 9. 100 ( 29%): fitness (-112.619)
[INFO] Elapsed time: 6.06s
[INFO] Validation fitness: (-67.2574)
[INFO] Validation accuracy: 79.2135%
[00,0] SIFE X2 "female" X4 [19,0]
[19,0] FMOD [53,0] [71,0]
[53,0] FADD [56,0] [90,0]
[56,0] SIFE "" "C" [63,0] [71,0]
[63,0] FDIV X4 [64,0]
[64,0] FMOD 5.0 4.0
[71,0] FADD X3 [86,0]
[86,0] SIFE "G" X8 9.0 [87,0]
[87,0] FSUB [88,0] 9.0
[88,0] FMUL X4 5.0
[90,0] SIFE X2 "female" 5.0 7.0
0.803371
As expected the search now spans over ten different runs (i.e. ten independent evolution cycles).
The solution of every run is checked against the validation set and the best of the ten individuals found is selected as "winner" (in the example it has an accuracy ≅ 80.34%).
Now we want a better result. Of course accuracy is just one of many evaluation metrics and sometimes it's quite misleading but since this an introduction, we'll consider higher accuracy as a better result.
What options do we have? Tweaking the search parameters is the usual first approach.
Searching longer (more generations), increasing the number of runs, increasing the population size are some ways of trading time for quality (this can be done via the environment class).
int main()
{
vita::src_problem titanic("titanic_train.csv");
if (!titanic)
return EXIT_FAILURE;
titanic.env.mep.code_length = 130; // <-- 1
titanic.env.individuals = 1000; // <-- 2
titanic.env.generations = 200; // <-- 3
vita::src_search<> s(titanic, vita::metric_flags::accuracy);
const auto summary(s.run(10));
std::cout << summary.best.solution << '\n'
<< summary.best.score.accuracy << '\n';
}
(the above code is in the examples/titanic03.cc file)
-
titanic.env.mep.code_length = 130;
30% longer individuals. There is more space for a complex solution but also a higher risk of over-fitting. -
titanic.env.individuals = 1000;
a three-times larger population. Probably a better exploration of the search space (more computing power / memory are required to support a larger population). -
titanic.env.generations = 200;
more time to improve a candidate solution but also higher risk of overfitting and a possible waste of resources in case of premature convergence.
$ make examples/titanic03
... compiling ...
$ cd build
$ ./titanic03
[INFO] DSS set to 1
[INFO] Number of layers set to 6
[INFO] Validation percentage set to 20%
Run 0. 0 ( 0%): fitness (-137)
...
...
Run 9. 200 ( 35%): fitness (-110.844)
Run 9. 200 ( 49%): fitness (-105.997)
[INFO] Elapsed time: 51.217s
[INFO] Validation fitness: (-63.6899)
[INFO] Validation accuracy: 83.7079%
[000,0] SIFE X2 "female" [015,0] [122,0]
[015,0] FADD X1 [035,0]
[035,0] FMOD [037,0] X1
[037,0] FMOD [065,0] [059,0]
[059,0] FABS X7
[065,0] SIFE "S" X9 [093,0] [124,0]
[093,0] FABS 8.0
[122,0] SIFE "E" X8 [125,0] [124,0]
[124,0] FMOD 6.0 [128,0]
[125,0] FMOD X3 6.0
[128,0] FADD X3 X3
0.837079
It seems a marginally better model (accuracy is now 83.71%, anyway further analysis is required to confirm an effective improvement).
Now let's backtrack a little bit. Somehow we have obtained an interesting individual, the accuracy and/or other performance measurements are promising and we would like to use it to make predictions.
Example examples/titanic04.cc explains how to proceed:
const auto model(s.class_lambdify(summary.best.solution));
const auto example(random::element(titanic.data()));
const auto result(model->tag(example));
std::cout << "Correct class: " << label(example)
<< " Prediction: " << result.label
<< " Sureness: " << result.sureness << '\n';
The vita::src_search
class has the lambdify(individual)
member function which returns a functor exploitable for symbolic regression and classification tasks.
The key definition is:
const auto model(s.class_lambdify(summary.best.solution));
The model
object is a std::unique_ptr<some strange object>
smart pointer but this isn't very important. Rather it accepts an example (dataframe::example
) and gives the predicted class ((*model)(example)
).
Even better it has the tag
member functions:
const auto result(model->tag(example));
result
is a struct
containing the predicted class (result.label
) and the sureness of the prediction (result.sureness
, varies in the [0, 1]
interval).