This is an updated version of our initial submission to the Fragile Families Challenge. It is mostly written in Python, though the imputation script is written in R.
See maltemoeser/ffc-R for an example using R only.
- Clone this repository
- Create a new virtual python environment:
virtualenv --python=/usr/bin/python2.7 venv
- Activate the virtualenv:
source venv/bin/activate
- Install requirements:
pip install -r requirements.txt
- Add missing directories:
logs
predictions
data
(put the FFC data files here)
To receive the data set you will need to apply for the FFC challenge and agree to their terms of service. Note that the terms of service forbid us to provide you with the data directly.
You can use the imputation/impute.R
script to create an imputed version of the FFC data.
Note that removing highly correlated columns will require at least 8GB of free memory, and might remove columns that are of interest to you if you follow a social scientists approach.
Make sure to check the FFC website for imputation scripts in other languages.
The code provides multiple options for feature selection, including Lasso and Elastic Net regression as well as recursive feature elimination.
We provide the necessary boolean masks in the folder featuremasks
, but you are free to create your own.
Check out data.py
for a convenient way to load the data.
You can modify existing classification and regression methods or add your own in predictions.py
.
To use XGBoost, follow the installation instructions here instead of installing it with pip.