This repository contains the source code of the Machine Learning School program. Fork it to follow along.
If you find any problems with the code or have any ideas on improving it, please, open and issue and share your recommendations.
During this program, we'll create a SageMaker Pipeline to build an end-to-end Machine Learning system to solve the problem of classifying penguin species.
Here are the relevant notebooks:
- The Setup notebook: We'll use this notebook at the beginning of the program to setup SageMaker Studio. You only need to go through the code here once.
- The Penguins in Production notebook: This is the main notebook we'll use during the program. Inside you'll find the code of every session.
Answering these questions will help you understand the material discussed during this session. Notice that each question could have one or more correct answers.
<div style="margin: 30px 0 10px 0;"><span style="font-size: 1.1em; padding:4px; background-color: #b8bf9f; color: #000;"><strong>Question 1.1</strong></span></div>
What will happen if we apply the SciKit-Learn transformation pipeline to the entire dataset before splitting it?
- Scaling will use the global statistics of the dataset, leaking the mean and variance of the test samples into the training process.
- Imputing the missing numeric values will use the global mean, leading to data leakage.
- The transformation pipeline expects multiple sets so it wouldn't work.
- We will reduce the number of lines of code we need to transform the dataset.
A hospital wants to predict which patients are prone to get a disease based on their medical history. They use weak supervision to label the data using a set of heuristics automatically. What are some of the disadvantages of weak supervision?
- Weak supervision doesn't scale to large datasets.
- Weak supervision doesn't adapt well to changes requiring relabeling.
- Weak supervision produces noisy labels.
- We might be unable to use weak supervision to label every data sample.
When collecting the information about the penguins, the scientists encountered a few rare species. To prevent these samples from not showing when splitting the data, they recommended using Stratified Sampling. Which of the following statements about Stratified Sampling are correct?
- Stratified Sampling assigns every sample of the population an equal chance of being selected.
- Stratified Sampling preserves the original distribution of different groups in the data.
- Stratified Sampling requires having a larger dataset compared to Random Sampling.
- Stratified Sampling can't be used when is not possible to divide all samples into groups.
Using more features to build a model will not necessarily lead to better predictions. Which of the following are drawbacks from adding more features?
- More features in a dataset increases the opportunity for data leakage.
- More features in a dataset increases the opportunity for overfitting.
- More features in a dataset increases the memory necessary to serve a model.
- More features in a dataset increases the development and maintenance time of a model.
A bank wants to store every transaction they handle in a set of files in the cloud. Each file will contain the transactions generated in a day. The team who will manage these files wants to optimize the storage space and downloding speed. What format should the bank use to store the transactions?
- The bank should store the data in JSON format.
- The bank should store the data in CSV format.
- The bank should store the data in Parquet format.
- The bank should store the data in Pandas format.
During the program, you are encouraged to work on the Pipeline of Digits problem as the main assignment. To make it easier to start, you can use the Pipeline of Digits as a starting point.
- Serving a TensorFlow model from a Flask application: A simple Flask application that serves a multi-class classification TensorFlow model to determine the species of a penguin.