-
Notifications
You must be signed in to change notification settings - Fork 44
Description
Hi,
I'm currently working on a student project aiming to evaluate the relevance of ProbLog (and probabilistic programming in general) in machine learning. To do so, i'm trying to solve the Kaggle challenge : Titanic, machine learning from a disaster. I did not find any forums to discuss my questions about ProbLog so i thought i would give it a shot here.
I'm really not sure about my approach, and encountered multiple issues while trying to implement my classifier.
First variant
I initially used only discrete features (Sex, and Passenger Class) to implement a simpler version of the classifier. I use a python script to generate the input files and handle the outputs of ProbLog.
% sex(S) : Male (0) or Female (1)
t(_)::sex(0);t(_)::sex(1).
% pclass(P) : Class 1, 2 or 3
t(_)::pclass(1);t(_)::pclass(2);t(_)::pclass(3).
% person(ID, Sex, PassengerClass) : The passenger with the given id has the given sex and passenger class
person(X, S, PClass) :- sex(S), pclass(PClass).
% survived(+PassengerId, Survived) the given passenger survived is Survived equals 1
t(_)::survived(X, 1); t(_)::survived(X, 0) :- person(X,S,P).
The input files are generated from a "training set" and are structured as such :
evidence(person(1, 1, 1)). % Passenger with id 1 is a Woman in first class
evidence(survived(1, 1)). % Passenger with id 1 survived
---
% More evidences
And i use ProbLogs lfi modality to generate a learned model.
Using the model
I'm currently using a second file to classify the data from my test set as such :
:-consult('learned_model.pl'). % Load the learned model
person(863,1,1). % Add passenger with ID 863 who is a woman in first class to the persons list
query(survived(863, 1)). % Query whether the passenger survived
% More of the above, for each passenger of the test set
I initially ran the model using ProbLogs sample modality, but i wasn't satisfied with the results. I didn't find how the learned models were supposed to be used while reading the docs, but i found out that ProbLogs mpe modality gives good (and consistent) results. I am however not really sure whether this is the intended use for learned models.
Second variant
I also found out while reading the tutorial a second time that i could ground each of the 'person' predicates parameters in the variable probabilities of the 'survived' predicate as such :
% sex(S) : Male (0) or Female (1)
t(_)::sex(0);t(_)::sex(1).
% pclass(P) : Class 1, 2 or 3
t(_)::pclass(1);t(_)::pclass(2);t(_)::pclass(3).
% person(ID, Sex, PassengerClass) characterizes a titanic passenger
% with their passenger id, sex, passenger and class
person(_ID, Sex, PassengerClass) :-
sex(Sex),
pclass(PassengerClass).
% survived(+PassengerId, Survived) the given passenger survived is Survived equals 1
t(_, Sex, PassengerClass,)::survived(PassengerId, 1); t(_, Sex, PassengerClass)::survived(PassengerId, 0) :-
person(PassengerId, Sex, PassengerClass).
This variant gave me a more coherent learned model, that actually made use of the given features.
Third variant
I realized while reading the documentation that ProbLog doesn't support continuous values, and i tried to circumvent this limitation for my use-case. I had three options in mind :
- Divide the passengers in age groups
- "Discretize" the values by rounding them to the nearest integer value
- Use one of ProbLogs extensions (namely DC-ProbLog)
When trying out my first and second options, i realized that the model learning process was quite slower (from ~30 seconds to several minutes). I used annotated disjunctions like so :
% First option
t(_)::age(0);t(_)::age(1);t(_)::age(2);t(_)::age(3).
% Second option
t(_)::age(0);t(_)::age(1);t(_)::age(2);t(_)::age(3);%more of the same%t(_)::age(80).
I also altered my initial model this way :
[...]
% person(ID, Sex, PassengerClass, Age) characterizes a titanic passenger
% with their passenger id, sex, passenger class and age
person(_ID, Sex, PassengerClass,Age) :-
sex(Sex),
pclass(PassengerClass),
age(Age).
t(_, Sex, PassengerClass, Age)::survived(PassengerId, 1); t(_, Sex, PassengerClass, Age)::survived(PassengerId, 0) :-
person(PassengerId, Sex, PassengerClass, Age).
However the second option (having 80 variants for the age predicate) was too slow to even use (no results after an hour).
My question here is the following : have i reached a technical limitation or am i using ProbLogs features in a non-optimal way ?
Also, is there any documentation concerning DC-ProbLog asides from the official thesis ?