This repository contains the source code of the Machine Learning School program. Fork it to follow along.
If you find any problems with the code or have any ideas on improving it, please open and issue and share your recommendations.
During this program, we'll create a SageMaker Pipeline to build an end-to-end Machine Learning system to solve the problem of classifying penguin species.
Here are the relevant notebooks:
- The Setup notebook: We'll use this notebook at the beginning of the program to set up SageMaker Studio. You only need to go through the code here once.
- The Penguins in Production notebook: This is the main notebook we'll use during the program. Inside, you'll find the code of every session.
During the program, you are encouraged to work on the Pipeline of Digits problem as the main assignment. To make it easier to start, you can use the Pipeline of Digits as a starting point.
- Serving a TensorFlow model from a Flask application: A simple Flask application that serves a multi-class classification TensorFlow model to determine the species of a penguin.
Answering these questions will help you understand the material discussed during the program. Notice that each question could have one or more correct answers.
What will happen if we apply the SciKit-Learn transformation pipeline to the entire dataset before splitting it?
- Scaling will use the dataset's global statistics, leaking the test samples' mean and variance into the training process.
- Imputing the missing numeric values will use the global mean, leading to data leakage.
- It wouldn't work because the transformation pipeline expects multiple sets.
- We will reduce the number of lines of code we need to transform the dataset.
A hospital wants to predict which patients are prone to get a disease based on their medical history. They use weak supervision to label the data using a set of heuristics automatically. What are some of the disadvantages of weak supervision?
- Weak supervision doesn't scale to large datasets.
- Weak supervision doesn't adapt well to changes requiring relabeling.
- Weak supervision produces noisy labels.
- We might be unable to use weak supervision to label every data sample.
When collecting the information about the penguins, the scientists encountered a few rare species. To prevent these samples from not showing when splitting the data, they recommended using Stratified Sampling. Which of the following statements about Stratified Sampling are correct?
- Stratified Sampling assigns every population sample an equal chance of being selected.
- Stratified Sampling preserves the data's original distribution of different groups.
- Stratified Sampling requires having a larger dataset compared to Random Sampling.
- Stratified Sampling can't be used when dividing all samples into groups is impossible.
Using more features to build a model will not necessarily lead to better predictions. Which of the following are the drawbacks of adding more features?
- More features in a dataset increase the opportunity for data leakage.
- More features in a dataset increase the opportunity for overfitting.
- More features in a dataset increase the memory necessary to serve a model.
- More features in a dataset increase a model's development and maintenance time.
A bank wants to store every transaction it handles in a set of files in the cloud. Each file will contain the transactions generated in a day. The team managing these files wants to optimize the storage space and downloading speed. What format should the bank use to store the transactions?
- The bank should store the data in JSON format.
- The bank should store the data in CSV format.
- The bank should store the data in Parquet format.
- The bank should store the data in Pandas format.
When you turn on caching, SageMaker tries to find a previous run of your current pipeline step to reuse its output. How does SageMaker decide which previous run to use?
- SageMaker will use the result of the most recent run.
- SageMaker will use the result of the most recent successful run.
- SageMaker will use the result of the first run.
- SageMaker will use the result of the first successful run.
We use an instance of the PipelineSession
class when configuring the processor. Which of the following statements are correct about PipelineSession
:
- When we use a
PipelineSession
and callprocessor.run()
, the Processing Job will not start immediately. Instead, the job will run later during the execution of the pipeline. - When creating a
PipelineSession
, you can specify the default bucket that SageMaker will use during the session. - The
PipelineSession
manages a session that executes pipelines and jobs locally in a pipeline context. - The
PipelineSession
is recommended over a regular SageMaker Session when building pipelines.
A research team built a model to detect pneumonia. They annotated every image showing pneumonia as positive and labeled everything else as negative. Their dataset had 120,000 images from 30,000 unique patients. They randomly split it into 80% training and 20% validation. What can you say about this setup?
- The team will never be able to test the model because they didn't create a separate test split.
- This setup will never produce good results because there are many different classes of pneumonia, but the team used binary labels.
- This setup will lead to data leakage because the team split the data randomly, and they had more images than patients.
- There's nothing wrong with this setup.
Imagine we want to process a large dataset and ensure the Processing Job has enough disk space to download the data from S3. How can we modify the code to allocate enough space for the job?
- We can set the
PipelineDefinitionConfig.volume_size_in_gb
attribute to the GB space we need. - We can set the
ProcessingStep.volume_size_in_gb
attribute to the amount of space in GB that we need. - We can set the
SKLearnProcessor.volume_size_in_gb
attribute to the GB space we need. - We can set the
Pipeline.volume_size_in_gb
attribute to the space we need in GB.
We used version 0.23-1
of the Scikit-Learn container to run the preprocessing script. Which of the following versions are also available for the same framework? To answer this question, you can check the image_uris.config_for_framework()
function.
- Version
0.20.0
- Version
0.22-1
- Version
1.0-1
- Version
1.2-1
Why do we use the SparseCategoricalCrossentropy
loss function to train our model instead of the CategoricalCrossentropy
function?
- Because our target column contains integer values.
- Because our target column is one-hot encoded.
- Because our target column contains categorical values.
- Because there are a lot of sparse values in our dataset.
When a Training Job finishes, SageMaker automatically uploads the model to S3. Which of the following statements about this process is correct?
- SageMaker automatically creates a
model.tar.gz
file with the entire content of the/opt/ml/model
directory. - SageMaker automatically creates a
model.tar.gz
file with any files inside the/opt/ml/model
directory as long as those files belong to the model we trained. - SageMaker automatically creates a
model.tar.gz
file with any new files created inside the container by the training script. - SageMaker automatically creates a
model.tar.gz
file with the output folder content configured in the training script.
Our pipeline uses "file mode" to provide the Training Job access to the dataset. When using file mode, SageMaker downloads the training data from S3 to a local directory in the training container. Imagine we have a large dataset and don't want to wait for SageMaker to download every time we want to train a model. How can we solve this problem?
- We can train our model with a smaller portion of the dataset.
- We can increase the number of instances and train many models in parallel.
- We can use "fast file mode" to get file system access to S3.
- We can use "pipe mode" to stream data directly from S3 into the training container.
Which of the following statements are true about the usage of max_jobs
and max_parallel_jobs
when running a Hyperparameter Tuning Job?
max_jobs
represents the maximum number of Training Jobs the Hyperparameter Tuning Job will start.max_parallel_jobs
represents the maximum number of Training Jobs that will run in parallel at any given time during a Hyperparameter Tuning Job.max_parallel_jobs
can never be larger thanmax_jobs
.max_jobs
can never be larger thanmax_parallel_jobs
.
Which statements are true about tuning hyperparameters as part of a pipeline?
- Hyperparameter Tuning Jobs that don't use Amazon algorithms require a regular expression to extract the objective metric from the logs.
- When using a Tuning Step as part of a pipeline, SageMaker will create as many Hyperparameter Tuning Jobs as specified by the
HyperparameterTuner.max_jobs
attribute. - Hyperparameter Tuning Jobs support Bayesian, Grid Search, and Random Search strategies.
- Using a Tuning Step is more expensive than a Training Step.
When registering a model in the Model Registry, we can specify a set of metrics stored with the model. Which of the following are some of the metrics supported by SageMaker?
- Metrics that measure the bias in a model.
- Metrics that help explain a model.
- Metrics that measure the quality of the input data for a model.
- Metrics that measure the quality of a model.
We use the Join
function to build the error message for the Fail Step. Imagine we want to build an Amazon S3 URI. What would be the output of executing Join(on='/', values=['s3:/', "mlschool", "/", "12345"])
?
- The output will be
s3://mlschool/12345
- The output will be
s3://mlschool/12345/
- The output will be
s3://mlschool//12345
- The output will be
s3:/mlschool//12345
Which of the following statements are correct about the Condition Step in SageMaker:
ConditionComparison
is a supported condition type.ConditionIn
is a supported condition type.- When using multiple conditions together, the step will succeed if at least one of the conditions returns True.
- When using multiple conditions together, they must return True for the step to succeed.
Imagine we use a Tuning Step to run 100 Training Jobs. The best model should have the highest validation accuracy, but we mistakenly used "Minimize" as the objective type instead of "Maximize." The consequence is that the index of our best model is 100 instead of 0. How can we retrieve the best model from the Tuning Step?
- We can use
TuningStep.get_top_model_s3_uri(top_k=0)
to retrieve the best model. - We can use
TuningStep.get_top_model_s3_uri(top_k=100)
to retrieve the best model. - We can use
TuningStep.get_bottom_model_s3_uri(top_k=0)
to retrieve the best model. - In this example, we can't retrieve the best model.
If the model's accuracy is above the threshold, our pipeline registers it in the Model Registry. Which of the following functions are related to the Model Registry?
- Model versioning: We can use the Model Registry to track different model versions, especially as they get updated or refined over time.
- Model deployment: We can initiate the deployment of a model right from the Model Registry.
- Model metrics: The Model Registry provides insights about a particular model through the registration of metrics.
- Model features: The Model Registry lists every feature used to build the model.
Imagine you created three models using the same Machine Learning framework. You want to host these models using SageMaker Endpoints. Multi-model endpoints provide a scalable and cost-effective solution for deploying many models in the same Endpoint. Which of the following statements are true regarding how Multi-model endpoints work?
- SageMaker dynamically downloads the model from S3 and caches it in memory when you invoke the Endpoint.
- SageMaker automatically unloads unused models from memory when an Endpoint's memory utilization is high, and SageMaker needs to load another model.
- You can dynamically add a new model to your Endpoint without writing any code. To add a model, upload it to the S3 bucket and invoke it through the Endpoint.
- You can use the SageMaker SDK to delete a model from your Endpoint.
We enabled Data Capture in the Endpoint configuration. Data Capture is commonly used to record information that can be used for training, debugging, and monitoring. Which of the following statements are true about Data Capture?
- SageMaker can capture the input traffic, the output responses, or both simultaneously.
- If not specified, SageMaker captures 100% of requests.
- The higher the traffic an Endpoint gets, the higher should be the sampling percentage used in the Data Capture configuration.
- SageMaker supports Data Capture on always-running Endpoints but doesn't support it for Serverless Endpoints.
Imagine you expect your model to have long idle periods. You don't want to run an Endpoint continuously; instead, you decide to use a Serverless Endpoint. Which of the following statements are true about Serverless Inference in SageMaker?
- Serverless Inference scales the number of available endpoints to 0 when there are no requests.
- Serverless Inference scales the number of available endpoints to 1 when there are no requests.
- Serverless Inference solves the problem of cold starts when an Endpoint starts receiving traffic.
- For Serverless Inference, you pay for the compute capacity used to process inference requests and the amount of data processed.
Imagine you create an Endpoint with two different variants and assign each variant an initial weight of 1. How will SageMaker distribute the traffic between each variant?
- SageMaker will send 100% of requests to both variants.
- SageMaker will send 50% of requests to the first and 50% to the second variants.
- SageMaker will send 100% of requests to the first variant and ignore the second one.
- This scenario won't work because the sum of the initial weights across variants must be 1.
Which attributes can you control when setting up auto-scaling for a model?
- SageMaker will use the target metric to determine when and how much to scale.
- The minimum capacity indicates the minimum number of instances that should be available.
- The amount of time that should pass after a scale-in or a scale-out activity before another activity can start.
- The algorithm that SageMaker will use to determine how to scale the model.
To compute the data and the model quality baselines, we use the train-baseline
and test-baseline
outputs from the Preprocessing step of the pipeline. Which of the following is why we don't use the train
and test
outputs?
- The
train
andtest
outputs are used in the Train and Evaluation steps, and SageMaker doesn't allow the reuse of outputs across a pipeline. - Computing the two baselines requires the data to be transformed with the SciKit-Learn pipeline we created as part of the Preprocessing step.
- Computing the two baselines requires the data to be in its original format.
- Computing the two baselines requires JSON data, but the
train
andtest
outputs are in CSV format.
You build a computer vision model to recognize the brand and model of luxury handbags. After you deploy the model, one of the most important brands releases a new handbag that your model can't predict. How would you classify this type of model drift?
- Sudden drift.
- Gradual drift.
- Incremental drift.
- Reocurring drift.
We use a custom script as part of the creation of the Data Monitoring schedule. Why do we need this custom script?
- This script expands the input data with the fields coming from the endpoint output.
- This script combines the input data with the endpoint output.
- This script prevents the monitoring job from reporting superfluous violations.
- This script expands the list of fields with the data SageMaker needs to detect violations.
We created a function to generate labels for the data captured by the endpoint randomly. How does SageMaker know which label corresponds to a specific request?
- SageMaker uses the timestamp of the request.
- SageMaker uses the
inference_id
field we send to the endpoint on every request. - SageMaker uses the
event_id
field we send to the endpoint on every request. - SageMaker uses the
label_id
field we send to the endpoint on every request.
Using our model, we use a Transform Step to generate predictions for the test data. When configuring this step, we filter the result from the step using the output_filter
attribute. Assuming we configure this attribute with the value $.SageMakerOutput['prediction','groundtruth']
, which of the following statements should be correct about the endpoint?
- The endpoint should return a top-level field named
prediction
. - The endpoint should return a top-level field named
groundtruth
. - The endpoint should return a top-level field named
SageMakerOutput
. - The test dataset should include a field named
groundtruth
.