This project is part of the Machine Learning School program.
- The Penguins in Production notebook: An Amazon SageMaker pipeline hosting a multi-class classification model for the Penguins dataset.
- The Pipeline of Digits notebook: A starting notebook for solving the "Pipeline of Digits" assignment.
The goal of this session is to build a simple SageMaker Pipeline with one step to preprocess the Penguins dataset. We'll use a Processing Step with a SKLearnProcessor to execute a preprocessing script.
-
If you can't access an existing AWS Account, set up a new account. Create a user that belongs to the "administrators" User Group. Ensure you use MFA (Multi-Factor Authentication).
-
Set up an Amazon SageMaker domain. Create a new execution role and ensure it has access to the S3 bucket you'll use during this class. You can also specify "Any S3 bucket" if you want this role to access every S3 bucket in your AWS account.
-
Create a GitHub repository and clone it from inside SageMaker Studio. You'll use this repository to store the code used during this program.
-
Configure your SageMaker Studio session to store your name and email address and cache your credentials. You can use the following commands from a Terminal window:
$ git config --global user.name "John Doe"
$ git config --global user.email johndoe@example.com
$ git config --global credential.helper store
-
Throughout the course, you will work on the "Pipeline of Digits" project with the goal of seting up a SageMaker pipeline for a simple computer vision project. For this assignment, open the
mnist.ipynb
notebook and follow the instructions to prepare the dataset for the project. -
Setup a SageMaker pipeline for the "Pipeline of Digits" project. Create a Processing Step where you split the MNIST dataset into a train and a test set.