This is a collaborative predictive modeling project built on the ballet framework.
The Fragile Families Challenge (FFC) is a recent attempt to better connect to the social science research community to new tools in data science and machine learning. This challenge aimed to spur the development of predictive models for life outcomes from data collected as part of the Fragile Families and Child Wellbeing Study (FFCWS), which collects detailed longitudinal records on a set of disadvantaged children and their families. Organizers released anonymized and merged data on a set of 4,242 families with data collected from the birth of the child until age 9. Participants in the challenge were then tasked with predicting six life outcomes of the child or family at age 15: child grade point average, child grit, household eviction, household material hardship, primary caregiver layoff, and primary caregiver participation in job training. The FFC was run over a four month period in 2017 and received 160 submissions from social scientists, machine learning practitioners, students, and others.
In this project, we ask, by collaborating rather than competing, can we develop impactful solutions to the FFC? Participants in the FFC were competing against each other to produce the best performing models, at the expense of collaboration across teams.
Your task is to create and submit feature definitions to our shared project that help us in predicting these key life outcomes.
Are you interested in joining the collaboration?
- Apply for access to the dataset and then register yourself with us.
- Read/skim the Ballet Contributor Guide.
- Read/skim the Ballet Feature Engineering Guide.
- Learn more about the Fragile Families dataset.
- Read/skim the data documentation.
- Skim additional resources.
- Browse the currently accepted features in the contributed features directory (
src/fragile_families/features/contrib
). - Launch an interactive Jupyter Lab session to hack on this repository:
The data underlying the Fragile Families Challenge, which we are using in this collaboration, is sensitive and requires registration to access.
If you are already authorized to access the data, you can look over Data Documentation below.
You must apply to Princeton's Office of Population Research (OPR) for access to the Fragile Families Challenge dataset.
The Fragile Families Challenge dataset contains sensitive information. You should keep this dataset secure and protect the privacy of the individuals, and abide by the data access agreement which requires you not to share your copy of the dataset.
You must register with us to join the collaboration, once you have been granted access to the data from Princeton OPR (or if you had already had access to the data from prior research). (This is step 7 in the instructions above, so don't repeat it if you already filled out the form.)
Your AWS access key ID/secret will be automatically detected from standard locations (such as environment variables or credentials files).
If you are working in a notebook without access to other methods of configuration (such as using Assemblé) you can do the following in a code cell:
import os
os.environ['AWS_ACCESS_KEY_ID'] = 'your access key id'
os.environ['AWS_SECRET_ACCESS_KEY'] = 'your secret access key'
Alternatively, if you are working locally, you can create a new AWS profile in ~/.aws/credentials
:
[bff]
aws_access_key_id = your-access-key-id
aws_secret_access_key = your-secret-access-key
Then you can use this profile when you are developing features for this project, by exporting the environment variable AWS_PROFILE=bff
(or using the os.environ
approach similar to above).
The full challenge dataset contains a "background" table of 4,242 rows (one per child in the training set) and 12,942 columns.
The "train" split contains 2,121 rows (half of the background set) and 7 additional columns:
challengeID
: A unique numeric identifier for each child.- Six outcome variables (each variable name links to a blog post about that variable)
- Continuous variables:
grit
,gpa
,materialHardship
- Binary variables:
eviction
,layoff
,jobTraining
- Continuous variables:
These six outcome variables are the outcomes that we are trying to predict.
💡 For the purpose of validating feature contributions, we will focus on the
materialHardship
prediction problem. However, we want our feature definitions to be useful for all six prediction problems.
You can load the train split as follows:
from ballet import b
X_df, y_df = b.api.load_data()
The other half of the rows are reserved for the "leaderboard" and "test" splits. We will use the leaderboard split to validate feature contributions. We will not look at the test split until the end of the collaboration.
If you'd like, you can load the full "background" dataset which includes the rows from the train, leaderboard, and test splits combined, but excluding the target columns.
from fragile_families.load_data import load_background
background_df = load_background()
(This section is adapted from here)
To use the data, it may be useful to know something about what each variable (column) represents. (See also the full documentation.)
The background variables were collected in 5 waves.
- Wave 1: Collected in the hospital at the child's birth.
- Wave 2: Collected at approximately child age 1
- Wave 3: Collected at approximately child age 3
- Wave 4: Collected at approximately child age 5
- Wave 5: Collected at approximately child age 9
Note that wave numbers are not the same as child ages. The variable names and survey documentation are organized by wave number.
Predictor variables are identified by a prefix and a question number. Prefixes the survey in which a question was collected. This is useful because the documentation is organized by survey. For instance the variable m1a4
refers to the m
other interview in wave 1
, question a4
.
- The prefix c in front of any variable indicates variables constructed from other responses. For instance,
cm4b_age
isc
onstructed from them
other wave4
interview, and captures the child's age (b
aby'sa
ge). m1
,m2
,m3
,m4
,m5
: Questions asked of the child'sm
other in wave1
through wave5
.f1
,f2
,f3
,f4
,f5
: Questions asked of the child'sf
ather in wave1
through wave5
hv3
,hv4
,hv5
: Questions asked in theh
omev
isit in waves3
,4
, and5
.p5
: Questions asked of thep
rimary caregiver in wave5
.k5
: Questions asked of the child (k
id) in wave5
ffcc
: Questions asked in variousc
hildc
are provider surveys in wave 3kind
: Questions asked of thekind
ergarten teacher in wave 4t5
: Questions asked of thet
eacher in wave5
.n5
: Questions asked of then
on-parental caregiver in wave5
We expose the full machine-readable codebook which you can use during feature development.
from fragile_families.load_data import load_codebook
codebook_df = load_codebook()
We also wrap the ffmetadata API for our own use in feature development. The metadata API returns more detailed metadata than is available in the codebook. See here for details on the filter operations and see here for an explanation of the resulting metadata.
import fragile_families.analysis.metadata as metadata
metadata.info('m1a4')
metadata.search({'name': 'label', 'op': 'like', 'val': '%school%'})
# can use metadata.searchinfo to combine the two methods
The metadata search shows results from the most up-to-date metadata available. In some cases, this reflects changes since the 2017 challenge, so variables that appear in metadata search may not appear in the dataset and vice-versa. In this case, the renamed variable's old_name
attribute is set to the previous name.
For example, kind_a2
was renamed to t4a2
.
If metadata.info
method receives an error from the metadata API due to a missing variable, it will automatically retry by first searching for a variable with that old name and then getting info for that old variable. You can disable this behavior with retry_with_old_name=False
.
A feature development partition describes a set of inputs for a data scientist to focus on in engineering features in this project. For example, the set of all questions asked during Wave 1 of the survey is a partition.
If you'd like to focus your effort in feature development, check out the existing partitions, which are tracked in issues under the feature-partition label. Comment on the issue with the response me
to "claim" it. It's okay for multiple people to claim one partition, but in that case, make sure you stay in touch directly or via the project chat, or follow each other's accepted (and rejected) feature contributions.
If you'd like to suggest a new partition, see #31.
In this project, feature contributions are validated to ensure that they are positively contributing to our shared feature engineering pipeline. One part of this validation is called "feature acceptance" validation, that is, does the performance of our ML pipeline improve when the new feature is added? We run the feature through two feature accepters: the MutualInformationAccepter
and the VarianceThresholdAccepter
. Based on the parameters set in our ballet.yml configuration file, a feature definition is accepted if it meets two criteria:
- the variance of each its feature column values is greater than a threshold (set to 0.05), i.e.
Var(z_i) > 0.05 ∀ z_i ∈ z
wherez_i
are columns of featurez
. - the mutual information of the feature values with the target on the held out leaderboard dataset split is greater than a threshold (set to 0.001), i.e.
I(z ; y) > 0.001
.
Want to chat about the project, compare ideas, or debug features with other collaborators? Join either of our two chat rooms:
If you think a question might have been answered before, check out the Ballet FAQ.
If you think you found a bug with Ballet, please open an issue and mention that you are working on the predict-life-outcomes project.
- FF Data and Documentation
- FF metadata homepage
-
To see detailed metadata for a variable, can use the variables endpoint in your browser, like so (just replace the variable name):
http://metadata.fragilefamilies.princeton.edu/variables/cf1cohm
-
- metadata_app
- ffmetadata-py
- Machine readable codebook
- Data dictionary in Excel
- Missing data in the challenge
- Outcomes