diff --git a/README.md b/README.md index 751d749..9829423 100644 --- a/README.md +++ b/README.md @@ -6,12 +6,66 @@ If you find any problems with the code or have any ideas on improving it, please ## Penguins -During this program we'll create a [SageMaker Pipeline](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-sdk.html) to build an end-to-end Machine Learning system to solve the problem of classifying penguin species. +During this program, we'll create a [SageMaker Pipeline](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-sdk.html) to build an end-to-end Machine Learning system to solve the problem of classifying penguin species. Here are the relevant notebooks: * The [Setup notebook](penguins/penguins-setup.ipynb): We'll use this notebook at the beginning of the program to setup SageMaker Studio. You only need to go through the code here once. -* The [Penguins in Production notebook](penguins/penguins-cohort.ipynb): This is the main notebook we'll use during the program. Inside you'll find the code of every session. +* The [Penguins in Production notebook](penguins/penguins-cohort.ipynb): This is the main notebook we'll use during the program. Inside you'll find the code of every session. + +## Questions + +Answering these questions will help you understand the material discussed during this session. Notice that each question could have one or more correct answers. + +
Question 1.1
+ +What will happen if we apply the SciKit-Learn transformation pipeline to the entire dataset before splitting it? + +1. Scaling will use the global statistics of the dataset, leaking the mean and variance of the test samples into the training process. +2. Imputing the missing numeric values will use the global mean, leading to data leakage. +3. The transformation pipeline expects multiple sets so it wouldn't work. +4. We will reduce the number of lines of code we need to transform the dataset. + + +
Question 1.2
+ +A hospital wants to predict which patients are prone to get a disease based on their medical history. They use weak supervision to label the data using a set of heuristics automatically. What are some of the disadvantages of weak supervision? + +1. Weak supervision doesn't scale to large datasets. +2. Weak supervision doesn't adapt well to changes requiring relabeling. +3. Weak supervision produces noisy labels. +4. We might be unable to use weak supervision to label every data sample. + + +
Question 1.3
+ +When collecting the information about the penguins, the scientists encountered a few rare species. To prevent these samples from not showing when splitting the data, they recommended using Stratified Sampling. Which of the following statements about Stratified Sampling are correct? + +1. Stratified Sampling assigns every sample of the population an equal chance of being selected. +2. Stratified Sampling preserves the original distribution of different groups in the data. +3. Stratified Sampling requires having a larger dataset compared to Random Sampling. +4. Stratified Sampling can't be used when is not possible to divide all samples into groups. + + +
Question 1.4
+ +Using more features to build a model will not necessarily lead to better predictions. Which of the following are drawbacks from adding more features? + +1. More features in a dataset increases the opportunity for data leakage. +2. More features in a dataset increases the opportunity for overfitting. +3. More features in a dataset increases the memory necessary to serve a model. +4. More features in a dataset increases the development and maintenance time of a model. + + +
Question 1.5
+ +A bank wants to store every transaction they handle in a set of files in the cloud. Each file will contain the transactions generated in a day. The team who will manage these files wants to optimize the storage space and downloding speed. What format should the bank use to store the transactions? + +1. The bank should store the data in JSON format. +2. The bank should store the data in CSV format. +3. The bank should store the data in Parquet format. +4. The bank should store the data in Pandas format. + ## Pipeline of Digits