This repo contains a prototype implementation DoubleML-Serverless of distributed double machine learning with a serverless infrastructure using AWS Lambda. A detailed discussion of this prototype can be found in the paper "Distributed Double Machine Learning with a Serverless Architecture" (Kurz, 2021). DoubleML-Serverless is an extension for serverless cloud computing of the Python package DoubleML. DoubleML is available via PyPI https://pypi.org/project/DoubleML and on GitHub https://github.com/DoubleML/doubleml-for-py. The Python package DoubleML was introduced in "DoubleML - An Object-Oriented Implementation of Double Machine Learning in Python" (Bach et al., 2022) and a detailed documentation & user guide for the package is available at https://docs.doubleml.org.
To install download the latest source code from GitHub via
git clone git@github.com:DoubleML/doubleml-serverless.git
cd doubleml-serverless
Then build the package from source using pip in the editable mode.
pip install --editable .
Alternatively to the installation from source, released versions of the DoubleML-Serverless package in form of
.whl files can be obtained from GitHub Releases.
After downloading the wheel, the package can be installed with pip (replace XXX
with the downloaded package version).
pip install -U DoubleML-Serverless-XXX-py3-none-any.whl
To use AWS Lambda for estimating double machine learning models, a deployment in your AWS account is necessary. The corresponding serverless application consists of the following components:
- A AWS Lambda function called
LambdaCVPredict
(the source code is taken from this repository https://github.com/DoubleML/doubleml-serverless/blob/main/aws_lambda_app/lambda_functions/cv_predict.py). - A layer providing the Python libraries
scikit-learn
,pandas
andnumpy
together with their dependencies. - An S3 bucket for the data transfer (can be optionally generated, or an existing bucket is used).
- A role for the execution of the lambda function
LambdaCVPredict
which consists of the AWS-managedAWSLambdaBasicExecutionRole
policy plus read access to the S3 bucket for data transfer.
There are two options for deployment:
-
A version of DoubleML-Serverless is available in the AWS Serverless Application Repository: https://serverlessrepo.aws.amazon.com/applications/eu-central-1/839779594349/doubleml-serverless. It can be deployed by clicking on the
Deploy
button. -
The second option for deployment is based on AWS Serverless Application Model (AWS SAM).
2.1 Setup the AWS SAM CLI as described here: https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-getting-started.html
2.2 To deploy the application use the following commands (for more information see https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/what-is-sam.html)
cd aws_lambda_app sam build sam deploy --guided
Estimating a Partially Linear Regression Model with Double Machine Learning and Serverless Scaling Using AWS Lambda
To demonstrate the functionality of DoubleML-Serverless we revisit the Pennsylvania Reemployment Bonus experiment and estimate the effect of provisioning a cash bonus on the unemployment duration as studied in Chernozhukov et al. (2018). This example is also discussed in the accompanying paper to the DoubleML-Serverless package (Kurz, 2021).
We first load the data using functionalities from the DoubleML package.
from doubleml.datasets import fetch_bonus
df_bonus = fetch_bonus('DataFrame')
The class DoubleMLDataS3
serves as data-backend for DoubleML-Serverless model classes.
It is inherited from the DoubleML
class DoubleMLData
.
We initialize an object of the DoubleMLDataS3
for the bonus data and upload it to the S3 bucket doubleml-serverless-data
used for the data transfer to AWS Lambda.
from doubleml_serverless import DoubleMLDataS3
dml_data_bonus = DoubleMLDataS3(
'doubleml-serverless-data', 'bonus_data.csv',
df_bonus,
y_col='inuidur1',
d_cols='tg',
x_cols=['female', 'black', 'othrace',
'dep1', 'dep2', 'q2', 'q3',
'q4', 'q5', 'q6', 'agelt35',
'agegt54', 'durable', 'lusd', 'husd'])
dml_data_bonus.store_and_upload_to_s3()
To estimate the nuisance functions we use a random forest regressor which averages over 500 trees. We further apply repeated cross-fitting with 5 folds and 100 repetitions/splits.
from doubleml_serverless import DoubleMLPLRServerless
from sklearn.base import clone
from sklearn.ensemble import RandomForestRegressor
ml = RandomForestRegressor(n_estimators = 500)
ml_g = clone(ml)
ml_m = clone(ml)
dml_lambda_plr_bonus = DoubleMLPLRServerless(
'LambdaCVPredict', 'eu-central-1',
dml_data_bonus, ml_g, ml_m,
n_folds=5, n_rep=100)
To estimate the model locally we can call dml_lambda_plr_bonus.fit()
.
Estimation on AWS Lambda is achieved via dml_lambda_plr_bonus.fit_aws_lambda()
.
Note that you will be charged for all used resources in the AWS account you deployed the serverless application to.
dml_lambda_plr_bonus.fit_aws_lambda()
A summary of the estimation result is available via the property dml_lambda_plr_bonus.summary
.
Some metrics about the estimation on AWS Lambda can be obtained via the property dml_lambda_plr_bonus.aws_lambda_metrics
.
If you use the DoubleML-Serverless package a citation is highly appreciated:
Kurz, M. S. (2021). Distributed Double Machine Learning with a Serverless Architecture. In Companion of the ACM/SPEC International Conference on Performance Engineering (ICPE '21). Association for Computing Machinery, New York, NY, USA, 27–33. doi:10.1145/3447545.3451181.
Bibtex-entry:
@inproceedings{kurz2021DoublemlServerless,
author = {Kurz, Malte S.},
title = {Distributed Double Machine Learning with a Serverless Architecture},
year = {2021},
isbn = {9781450383318},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3447545.3451181},
doi = {10.1145/3447545.3451181},
abstract = {This paper explores serverless cloud computing for double machine learning. Being based on repeated cross-fitting, double machine learning is particularly well suited to exploit the high level of parallelism achievable with serverless computing. It allows to get fast on-demand estimations without additional cloud maintenance effort. We provide a prototype Python implementation DoubleML-Serverless for the estimation of double machine learning models with the serverless computing platform AWS Lambda and demonstrate its utility with a case study analyzing estimation times and costs.},
booktitle = {Companion of the ACM/SPEC International Conference on Performance Engineering},
pages = {27--33},
numpages = {7},
keywords = {machine learning, causal machine learning, serverless computing, distributed computing, AWS Lambda, function-as-a-service (FAAS)},
location = {Virtual Event, France},
series = {ICPE '21}
}
Funding by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) is acknowledged – Project Number 431701914.
Bach, P., Chernozhukov, V., Kurz, M. S., and Spindler, M. (2022), DoubleML - An Object-Oriented Implementation of Double Machine Learning in Python, Journal of Machine Learning Research, 23(53): 1-6, https://www.jmlr.org/papers/v23/21-0862.html.
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W. and Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21: C1-C68. doi:10.1111/ectj.12097.
Kurz, M. S. (2021). Distributed Double Machine Learning with a Serverless Architecture. In Companion of the ACM/SPEC International Conference on Performance Engineering (ICPE '21). Association for Computing Machinery, New York, NY, USA, 27–33. doi:10.1145/3447545.3451181.