This project consists on the development, deployment and monitoring of machine learning models following the best MLOps practices:
- Experiment tracking
- Model registry
- Workflow orchestration
- Containerization
- Cloud computing
- Infrastructure as Code
- Continuous integration, continuous deployment and continuous trainning
- Software engineering practices: code quality, code tests, pre-commit
- Model monitoring
The application will be able to predict the duration of a taxi trip in the city of Chicago.
- mlops-taxi-prediction
- Dataset
- Technologies and tools
- 1. Requirements and setup (in the developer machine)
- 3.- Install dependencies
- 4.- Install pre-commit hooks
- 5.- Infrastructure
- 6. ML project cifecycle: Developing
- 7. ML project cifecycle: ML pipeline
- 8. ML project cifecycle: Tests and CI
- 9. ML project cifecycle: CD - Model deployment
- 10. Front-end app
- 11. Monitoring
- Notes
- Bugs
- ToDo
- Improvements
- Useful commands and snippets
The dataset used is somehow basic (since the important matter here is MLOps): the Chicago Taxi Trips dataset to predict the trip duration. https://data.cityofchicago.org/Transportation/Taxi-Trips/wrvz-psew
The dataset is not available as files, so it is needed to access the api and requests the data, which is a more flexible way. In this project I have limited the amount of data for each month, since the accuracy of the models is not so important. You can easily modify it if you wish.
The goal is to implement a maturity level between 3 and 4 according to Microsoft (https://docs.microsoft.com/es-es/azure/architecture/example-scenario/mlops/mlops-maturity-model). In future projects, I will incorporate Data Version Controlling and Feature Storage.
- Experiment tracking: Mlflow
- Model registry: Mlflow
- Workflow orchestration: Prefect
- Containerization: Docker
- Cloud computing: AWS
- Model monitoring: Evidently
- Software engineering practices: code quality (isort, black, pylint), testing (pytest), pre-commit
- Continuous integration, continuous deployment: Github Actions
- Continuous training: Evidently + Prefect
- Infrastructure as Code: Terraform
I have used IaC to deploy a development EC2 machine, with the required setup including dependencies. So you only need to download Terraform to your machine and build the infrastructure, including the development machine.
-
Terraform Terraform will be used to build the infrastructure as code. Terraform requires a status file to manage the infrastructure. This may be local or remote. In this case, we will use a remote file, stored in S3, so the S3 bucket has to be created manually.
-
Download Terraform executable:
https://www.terraform.io/downloads
to ./infrastructure/deployment directory.- Windows: Just save it to ./infrastructure/deployment directory
- Linux: Follow the instructions on
https://www.terraform.io/downloads
-
Create terraform backend bucket to keep Terraform state: Note: bucket names shall be unique. Choose your location accordingly. Note: bucket is private, but anyone with appropriate permissions can grant public access to objects.
aws s3api create-bucket --bucket [your bucket name] --create-bucket-configuration LocationConstraint=eu-west-1
-
Follow the instructions in the Infrastructure section. Do not test the infrastructure yet.
-
-
Upload AWS credentials to your github repo: You will need to upload the AWS credentials to yout github repo in order to run the CI/CD workflows: Please, follow the instructions in this video: https://youtu.be/xkTWF9c33mU?list=PL3MmuxUbc_hIUISrluw_A7wDSmfOhErJK&t=1036
-
Open SSH connection to the instance: You will find the SSH key under the infrastructure folder with the name ec2_ssh_key_dev-chicago_taxi.pem. After you open the SSH session, you can test the infrastructure as explained in the Infrastructure section.
However, if you wish to setup a machine by yourself, please install these tools:
- Python 3.9 (recommended to install Anaconda: https://www.anaconda.com/)
- Docker
- docker-compose
- Git (if you are using window: https://gitforwindows.org/)
- Git Bash (only if you are using windows: https://gitforwindows.org/)
- Github account with aws secrets set-up in the repository (that one forked by you, see below). To be used in CI/CD.
- AWS account with permissions to create infrastructure
- AWS access key (id and secret)
- AWS cli (command line interface)
-
Download and install AWS cli
- https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html
- Windows:
msiexec.exe /i https://awscli.amazonaws.com/AWSCLIV2.msi aws --version aws-cli/2.4.24 Python/3.8.8 Windows/10 exe/AMD64 prompt/off
- Linux: (unzip required)
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" unzip awscliv2.zip sudo ./aws/install
-
Configure AWS account with AWS cli
aws configure AWS Access Key ID [None]: [your aws key id] AWS Secret Access Key [None]: [your asw secret access key] Default region name [None]: eu-west-1 Default output format [None]: Check with aws sts get-caller-identity ```
-
Follow the instructions in the Terraform section. Do not forget to create the terraform backend bucket.
-
Fork and clone repo. Prepare AWS credentials to work with github
Go to:
https://github.com/MarcosMJD/mlops-chicago-taxi
And fork the repo Then clone the forked repo in your machine
git clone https://github.com/MarcosMJD/mlops-chicago-taxi
Follow the instructions in the Upload AWS credentials to your github repo section.
-
Go to sources directory and run:
pip install --upgrade pip
pip install pipenv
pipenv install --dev
pipenv shell
Note that Pipfile lists the specific versions in use.
Go to the root directory of the repo and run:
pre-commit install
git add .pre-commit-config.yaml
This last step is needed because .git folder is not cloned and pre-commits live there
For the sake of simplicity only one set of infrastructure will be created: the staging infrastructure. The production infrastructure may be created in the same way (currently not implemented), by changing the name of the resources. Hence, the CI/CD workflows will be performed in the staging infrastructure, based on pull-requests or pushes into the develop branch
The following picture depicts the architecture:
Note: Blue color means AWS serviceCommon layer:
- S3 bucket will store both Prefect artifacts and MLflow artifacts (including the models for each experiment).
- RDS Postgres database is used by Mlflow to store the metadata of experiments.
Development layer:
- MLflow server runs on its own EC2 instance.
- Development machine.
- Used to perform all tasks explained in this document.
- Used to run the prefect agent for the deployment of the trainning pipeline with Prefect.
- Used to run batch monitoring with Evidently.
Trainning pipeline:
- Prefect Orion runs on its own EC2 instance.
Production layer: A serverless solution is used:
- Lambda function that loads the Lambda image from ECR registry and the model from S3, and makes the predictions based on requests coming from the API gateway.
- API gateway provides the API for the client application.
- ECR stores the Lambda Docker image of that will be used by Lambda service to launch the container.
Important
It turns out that Windows uses CRLF in text files, but Linux uses LF.
If main.tf has CRLF as the end of line. That will cause the CI/CD workflows to recreate the instances, because CI/CD converts them to LF, and Terraform sees that user_data has changed. To avoid this behavior, this repo has been configured already with git config core.autocrlf false.
Please, note that the editor has to be set up to use LF in main.tf file (i.e. in VS Code, in the status bar, click on CRLF to change to LF).
If you with to set up this feature for all repos (not recommended), use git config --global core.autocrlf false
.
The same happens with the files that triggers the building of the lambda image: chicago_taxi_prediction.py, Dockerfile and model_service.py. Use LF in these files as well.
In Windows, use Git Bash in this step
-
Go to
infrastructure\deployment
directory, editstg.tfvars
and modify the variables3_bucket_name_suffix
This variable is used to create the name of the S3 bucket, so suffix will avoid conflicts with existing buckets in your aws zone. -
Edit
main.tf
file ininfrastructure
directory and modify the backend location for Terraform (the name of the S3 bucket you have created manually before)backend "s3" { bucket = "chicago-taxi-tfstate-mmjd" <- Use your own bucket name
-
Build the infrastructure
cd infrastructure ./terraform.exe init ./terraform.exe plan --var-file=stg.tfvars ./terraform.exe apply --var-file=stg.tfvars yes
Terraform will automatically create the Lambda image and upload it to ECR (before creating the Lambda function). For the creation of the Lambda image, the files in the
production
folder will be used.In order to have terraform output variables exported in the shell to be used later on, in the root directory of the repo, run the followin command: In Gitbash (Windows):
./setup_dev_windows_gitbash.sh
In Linux:./setup_dev_linux.sh
Please, note that this script also configures the prefect libraries to be able to conect to the server API with the following command:
prefect config set PREFECT_API_URL="http://<prefect-external-ip>:8080/api"
At this point, the lambda function loads a dummy model, since we have not run any experiment.
Go to sources/tests directory and run
python ./test_api_gateway_lambda.py
It should answer with a prediction of 40 (the sum of pick-up and drop-off location ids 8 and 32 respectively)
Data scientist will perform some EDA, feature engineering, and train and evaluate some models.
In this project, the performance of the model is not so important, but the process of the generation of the model is.
This is a regression problem. For the sake of simplicity, the features used are
pickup_community_area
and dropoff_community_area
and the target is trip_seconds
, with some preprocessing such as converting to minutes.
Go to sources/development directory and run
Jupyter notebook
Go to your browser, load model_development.ipynb
And run the notebook
Data is downloaded from Chicago Taxi Dataset API and stored in the directory ./data
as csv and parquet formats.
Data is limited to 2 days in order to run eveything faster. You may change the dates in the notebook.
Train month is February 2022 and Validation is March 2022.
The models used are [GradientBoostingRegressor, ExtraTreesRegressor, LinearRegression, Lasso, Ridge, LinearSVR]
During this phase, MLflow is used for experiment tracking. Check your experiments and models in the mlflow server url
http://<mlflow_external_ip>:8080
The best three models are registered in MLflow model registry and the best one in production
stage.
Please, note that to deploy the model, we do not use the model registry but take the model directly from the S3 bucket where models are stored as artifacts in each experiment. MLflow autolog is used.
From the Jupyter notebook, the following python modules/scripts are created:
downloader.py
: Downloads data from the Chicago Taxi API and stores the datasets in csv format.
model.py
: Manages the model (actually sklearn pipeline with DictVectorizer and Model): creates the model, preprocess datasets, fits and predicts.
preprocessor.py
: Perform the preprocessing of the datasets, also generates the parquet format
trainning_pipeline.py
: Implements the trainning pipeline as a @flow with the following tasks:
preprocess_datasets
train_models
register_models
Basically the pipeline performs the same taks as the jupyter notebook, so MLflow is used and the best models registered.
The trainning pipeline may be executed manually or as part of a Prefect deployment.
Go to sources/development
and run:
python trainning_pipeline.py
Check your flow run in prefect server url
http://<prefect-external-ip>:8080
Make the Prefect deployment:
This does not work in Windows For some reason, when executing in Windows, in S3 the subfolders (/development/) are not created, but filenames with the name of the directory ('\development__init__py'). And also the agent will try to get the files from the absolute path where the prefect_deployment.py is present (obviously, the agent will not have this path, since this path is in the developer machine, not in the agent)
In linux, everything works ok.
Go to sources/
and run:
python ./development/prefect_deployment.py
This script will create, in prefect orion server, the storage block (a folder in the S3 bucket), the queue and the deployment itself. And will upload the artifacts to the S3 storage block.
Go to Prefect server url and check the deployment, block and queue.
Run trainning pipeline by agent:
Firstly, to start the agent, run:
prefect agent start chicago-taxi
Secondly, to launch the run of the deployment, on another shell (i.e. Git Bash) do:
- Go to
sources
directory, executepipenv shell
- Run
prefect config set PREFECT_API_URL="http://<prefect-external-ip>:8080/api"
<- Use the external ip of the Prefect server build on the IaC phase. - Run
prefect deployment run main-flow/chicago-taxi-deployment
Check that the Agent executes the flow.
Model deployment will be part of the CI/CD.
The modules/scripts under production
folder implements the ML prediction server by means of loading the model and make predictions being requested by clients apps through API gateway to Lambda function. In other words, the predictions are made by the Lambda function.
- chicago_taxi_prediction.py:
- Initializes the ModelService class.
- Implements the lambda_handler (the function called by Lambda service), that uses ModelService to execute predictions.
- Receives the http request (coming from the API gateway to Lambda function)
- Extracts the features from the body of the http request
- Calls ModelService to make the prediction
- Creates and sends back the http response with the prediction object.
- model_service.py (class ModelService)
- Loads the model, whether on a local path, S3 or a Dummy model.
- Makes predictions.
The lambda function parameters are the following, passed as env variables during the creation (Terraform) or update (CD, see below) of the Lambda function, for instance: MLFLOW_MODEL_LOCATION="s3" MLFLOW_EXPERIMENT_ID="4" MLFLOW_RUN_ID="2f45030a9a4d41888b811884d5e71de1" MLFLOW_BUCKET_NAME="stg-chicago-taxi-mmjd" MLFLOW_BUCKET_FOLDER="mlflow"
Unit tests are found under the tests/unit_tests
directory.
Integration tests are found under tests/integration_tests
directory.
Integration tests are done by creating a Lambda image from the sources under production
directory, running the container, and testing with a predefined prediction. Everything runs locally.
These test are part of the CI workflow that will see later, but these tests and also quality checks can be executed manually with the Makefile under the sources
directory.
Go to sources
directory and run:
make integration_tests
(this will execute all tests and quality checks)
- Unit tests -> Runs ./unit_tests/run.sh
- Quality checks (isort, black, pylint)
- Build local image
- Integration tests runs ./integration_tests/run.sh (also runs the container)
You can also run manually the sh tests scripts, from the sources directory.
The CI will ensure that this code works and its quality by using pre-commit hooks:
- trailing-whitespace
- end-of-file-fixer
- check-yaml
- check-added-large-files
- isort
- black
- pylint
- pytest
- unit_tests
- integration_tests
Check .pre-commit-config.yaml
in the root
directory of the repo for further details.
Run the CI workflow:
The CI workflow definition can be found in ci-tests.yml
, in .github/workflows
directory.
Please, note that CI is set to be run on a pull-request on the develop
branch.
It runs unit tests and integration tests and checks that the infrastructure definition is correct with Terraform plan
command.
Execute/do:
git checkout -b <new-branch> develop
<- E.g. = "Feature1"- Modify any file in the sources directory
git add -A
git commit -m 'test ci/cd'
<- pre-commit hooks will be executedgit push
- go to
github.com
to your forked repo and... - go to Pull requests
- Click on New pull request
- Select base: develop
- Compare:
- Click Create pull request
- Click Create pull request
- Go to Actions
- Check CD test sucessfully passed
When the infrastructure is build, the Lambda function is initialized with the default parameters, that means that a dummy model is used. As an example, we will update the lambda function with the model of the last experiment. Better approaches may be used, such as using MLflow model registry.
The definition of the CD workflow is in cd-deploy.yml
, in .github/workflows
directory. It will:
- Create/update the infrastructure
- Build the Lambda image
- Push the Lambda image to ECR
- Update the Lambda function parameters (i.e. load the model)
During the CD phase, the Lamba function will be updated with the following parameters:
MLFLOW_MODEL_LOCATION="s3", \
MLFLOW_EXPERIMENT_ID=$EXPERIMENT_ID, \
MLFLOW_RUN_ID=$RUN_ID, \
MLFLOW_BUCKET_NAME=$MODEL_BUCKET, \
MLFLOW_BUCKET_FOLDER=$BUCKET_FOLDER,
Where $EXPERIMENT_ID and $RUN_ID are those of the latest experiment, calculated in during the CD execution.
Execute/do:
- Go to Pull requests
- Click on Merge pull request
- Click on Confirm Merge
- Go To Actions, CD-Deploy shall be executing
After sucessfully finalized, test the new model configuration
Got to sources/tests directory
and run:
python ./test_api_gateway_lambda.py
Front-end end user dashboard has been developed in order to facilitate testing and accessing mlflow and prefect servers.
In order to run it, please, go to sources\production
directory and run streamlit run front-end.py --server.port 8080
Open the url and test.
IMORTANT NOTE: During the CI/CD phase, for some reason, the mlflow and prefect servers may have been recreated, because Terraform sees that used data has changed, but it has not. In this case, monitoring will fail in the last step (prefect deployment run main-flow/chicago-taxi-deployment).
In this project, batch monitoring has been implemented by using evidently library from Evidently AI. Data drift and regression performance are implemented. A report is created with data drift and regression performance sections. Data drift is tested and the trainning pipeline is scheduled if data drift (more than 30% of features with detected data drift) is detected (a prefect agent will take care of the flow execution).
The source code is under the folder sources\monitoring
Three elements are needed:
- send_data.py script
- Mongo database (lanched from docker-compose)
- batch_monitoring.py script
The reference dataset is: Taxi_Trips_2022_03.parquet
(i.e. March), under reference_data
directory.
To simulate the real case usage of the system, the module send_data.py
performs some tasks:
- Reads from a more recent dataset
Taxi_Trips_2022_04.parquet
(i.e. April), iterating row by row, and request predictions by sending http requests to the API gateway. - Processes and stores the target values in a csv file, together with the id field of each record of data. In other words, when the customer arrives at the destination, the target value (trip duration) is known, and stored.
- Stores the predictions in the Mongo database.
The batch_monitoring script will:
- Read the csv file with the target values and add them to each record in the mongo db. id field is used as the key to match the prediction with the target.
- Reads the reference datasets, processes the target value (trip duration) and makes the predictions.
- Fetches the data from the database.
- Calculates the metrics and generates the report.
- Triggers the trainning pipeline (i.e. schedules a flow in prefect) if DataDrift is detected (more than 30% of features with detected data drift)
Monitoring execution:
Note that the prediction service must be up and running
- Go to
./sources/monitoring
docker-compose up
python ./send_data.py
(wait for some predictions to be made)python ./match_monitoring.py
Data drift shall be detected and a flow scheduled. You can check it in the prefect orion url. If an agent has been launched, the trainning pipeline shall run.
The development machine will install Anaconda, docker, docker-compose and pipenv, and also clone the repo Note that there is also a setup_vm.sh. This script may be used when a manual creation of the vm is performed, then a manual git clone is performed. And after execution of setup_vm.sh finishes, all of these programs are installed. There is also a setup_dev_linux.sh that will prepare the env vars from terraform so that the rest of the modules/scripts will work correcly with the infrastructure created.
Since we use tag latest and the same image name and same ECR, by simply terraform apply will make and push de image, but not update lambda function code (the container itself), because lambda parameters (those parameters used to create the lambda function) do not change. Check Terraform plan in CI phase. In CI phase, we set the env vars to default. In CD, we build and push the image again, but not update the lambda image. What we do is update the function configuration with the environment vars to update the model, these are not lambda parameters To update lambda image, we need to use update-function-code --image-uri or do it manually in the console. Note: In CD, lambda is always relaunched, even if there is no change in the env vars. So latest model is used.
- check setup_devs scripts. PYTHONPATH not added. Maybe shall be export PYTHONPATH? Check in Linux. In Windows is not working with source ./setup...sh. Workaround: run the scripts before activating the environment, since the script activates the environment.
- setup_dev_windows_gitbash.sh pipenv shell prefect.exe... does not work. So we are executing prefec before pipenv shell. This means that the main environment shall have prefect. In linux, setup_dev_linux, pipenv shell is run with the parameter prefect... So it works. Workaround: install prefect in main environment.
-
Decide if setup vm with iac or with sh (after cloning repo)
-
Fix readme for the vm, after setupvm, iac and then setup_dev_linux?
-
Fix setup_vm script, after newgrp, a new shell is opened, so pipenv will not run. Check https://unix.stackexchange.com/questions/18897/problem-while-running-newgrp-command-in-script
-
Fix managment of batches (arrays). Keep id of the trip and so on.
-
ML
- Improve the performance of the model (feature engineering). Hyperparameter tunning. Maybe cross data with weather info?
Use pipeline also for data preprocessing. Onehotencoder after dv to process numerical variables read as strings (pipckup, dropoff)
-
Development
Use streamlit in development server to show interface to see IPs, and run manually the pipeline and request predictions.- Use streamlit to run manually batch monitor
- Preprocess in production, fill Nans with -1
- Use design patterns: factory and strategy
- Add sw version and model version over all stages. To keep track of predictions. i.e. Lambda loads model, and sets model version to prediction as well as the sw version of the lambda function which is passed as a parameter that changes in each commit.
- Make a Makefile in the main folder repo to run different parts of the code
- Don't forget to add the . PYTHONPATH if not exists yet
- ¿IaC in Makefile?
- Check where data is stored. Better use a common path under sources dir
- Refactor model service in producion if needed and use it in development.
Allow model service to process batches of predictionsModify model and preprocessor to use pipeline or model, selected by user?-> Now refactor using design patters (factory and strategy) uses pipelines only.- Use parameters in the calls to main in case user whish to run the modules separately from command line
-
Prefect
- Pass parameters to prefect flows/deployment
- Prefect Agent in EC2 (use Terraform prefect agent)
- Check https://docs.prefect.io/concepts/infrastructure/ for different options
- Use Postgres in Prefect? Prefect recommends SQlite. Postgres only for heavy workloads
-
CI/CD
- Manage the CI part when the image is updated with parameters "", since in CI a dummy model is in use. Well it does not matter since in CD the real model will be used anyway.
-
Monitoring
- Run monitoring in batch mode with prefect deployment - scheduled each month
- Monitoring in real time locally
- Monitoring in real time in EC2
- Check why evidently raises error when batch monitoring and all columns are used. Check empty values?
-
Best practices
- Fix problems with isort and black figthing each other.
- Test localstack aws gateway + ECR + lambda + S3
- Check why aws config initialization fails when github actions if profile default is set in main.tf
-
IaC
- Separate creation of s3 bucket, mlflow and prefect servers from the rest to avoid recreation of these in CD because of random generation number. Use random number generation again.
- Separate dev vm.
- Manage passwords (e.g. database) in aws
- Make user_data in EC2 persistent, so that after reboot the ec2, it still works
- Use S3 to store datasets?
- Check no cache when using pipenv in Dockerfile. pip --no-cache-dir
- If AWS changes amiid, the EC2 instances will be recreated. And then, prefect flows will be deleted. Fix this.
- If mlflow is recreated, it will detect a database version change, and will not run. It's needed to run mlflow db upgrade <database_uri> to fix it
Setup ingress rules based on current ip address. Set them in stg.tfvars or with https://stackoverflow.com/questions/46763287/i-want-to-identify-the-public-ip-of-the-terraform-execution-environment-and-add
-
MLflow
- Check Mlflow unnable to get signature of models. Actually it happens when using Dictvectorizer in the pipeline. In any case, the models are loaded correctly
Header of a sh script
#!/usr/bin/env bash
Returns the directory where the Bash script file is saved
$(dirname "$0")
Get last error code
ERROR_CODE=$?
HTTP Post with curl
curl -X POST -H "Content-Type: application/json" \
-d "{'trip_id': 33, 'pickup_community_area': '8.0', 'dropoff_community_area': '32.0'}" \
https://8bi0bzeja8.execute-api.eu-west-1.amazonaws.com/api_gateway_stage-chicago-taxi/hello
bash -c "string"
-c string If the -c option is present, then commands are read from string.
If there are arguments after the string, they are assigned to the positional parameters, starting with $0.
if [[ -d .git ]]
checks if a directory exists
if [ -x "$(command -v docker)" ]; then
The command command returns the docker command's location, then the -x flag tests that it there and able to execute.
[ -x FILE ] True if FILE exists and is executable.
if [[ -z "${GITHUB_ACTIONS}" ]]; then
-z: True of the length if "STRING" is zero.
command1 && command2
conditional execution. The second command will only execute if the first command has executed successfully i.e, its exit status is zero.
if [ -d "." ] && [[ ":$PYTHONPATH:" != *":.:"* ]]; then
PYTHONPATH="${PYTHONPATH:+"$PYTHONPATH:"}."
fi
Add "." to PYTHONPATH if it is not yet
To create programmatically an storage block in prefect: from prefect.filesystems import S3 block = S3(bucket_path="chicago-taxi-fc4rdz8d") block.save("example-block")
Then to build a deployment prefect deployment build trainning_pipeline.py:main_flow --name test --tag test --storage-block s3/example-block -q test This will upload the .py files to S3 bucket
Or with object:
To create the deployment and queue (in prefect orion server), and also upload to S3 the yaml file: prefect deployment apply .\main_flow-deployment.yaml
To start an agent prefect agent start -q 'test'
To run a flow prefect deployment run <FLOW_NAME>/<DEPLOYMENT_NAME>
Download object from aws s3 aws s3api get-object --bucket stg-chicago-taxi-fc4rdz8d --key mlflow/2/e4ff37b7254a408c86826fb2a25573a9/artifacts/model/conda.yaml ./conda.yaml
Download folder from aws s3 aws s3 cp s3://stg-chicago-taxi-fc4rdz8d/mlflow/2/e4ff37b7254a408c86826fb2a25573a9/artifacts/model ./model --recursive
Get latest RUN_ID from latest S3 partition. In practice, this is generally picked up from a tool like MLflow or a DB
export RUN_ID=$(aws s3api list-objects-v2 --bucket ${MODEL_BUCKET_DEV}
--query 'sort_by(Contents, &LastModified)[-1].Key' --output=text | cut -f2 -d/)
Copy between buckets
aws s3 sync s3://${MODEL_BUCKET_DEV} s3://${MODEL_BUCKET_PROD}
Update lambda env vars
https://docs.aws.amazon.com/lambda/latest/dg/configuration-envvars.html
aws lambda update-function-configuration --function-name ${LAMBDA_FUNCTION} --environment "Variables=${variables}"
Create a default precommit hook: pre-commit sample-config > .pre-commit-config.yaml pre-commit install git add .pre-commit-config.yaml
Pytest only adds to sys.path directories where test files are, so you need to add the sources directory with: export PYTHONPATH=. fro sources directory
You alternatively can run python -m pytest
Stop services only docker-compose stop
Stop and remove containers, networks.. docker-compose down
Down and remove volumes docker-compose down --volumes
Down and remove images docker-compose down --rmi <all|local>
Execute OS command:
command = f"python scripy.py {YEAR} {MONTH}"
os.system(command)
Display pandas pretty:
from IPython.display import display, HTML
Add x permissions to files: git update-index --chmod=+x setup_vm.sh setup_dev_linux.sh setup_dev_windows_gitbash.sh
Install git on AWS EC2 Amazon Linux 2
sudo yum update -y
sudo yum install git -y
Conda can not create env in AWS EC2 Amazon Linux 2. Permission error. Fix: chown -R 500:500 /home/ec2-user/anaconda3
AWS EC2: Execute user_data on reboot
Content-Type: multipart/mixed; boundary="//"
MIME-Version: 1.0
--//
Content-Type: text/cloud-config; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename="cloud-config.txt"
#cloud-config
cloud_final_modules:
- [scripts-user, always]
--//
Content-Type: text/x-shellscript; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename="userdata.txt"
#!/bin/bash
/bin/echo "Hello World" >> /tmp/testfile.txt
--//--
if [ -x "$(command -v docker)" ]; then
echo "Update docker"
# command
else
echo "Install docker"
# command
fi
Check ubuntu version
lsb_release -a
Update linux packages
sudo apt-get update command is used to download package information from all configured sources
sudo apt-get upgrade command downloads and installs the updates for each outdated package and dependency on your system