Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,166 +1,160 @@
= 301 Project 01 - Choosing a Model
= 301 Project 01 - Intro to ML - Using Anvil

== Project Objectives

In this project, we will briefly understand different machine learning models. You do not need to understand how the models work for now but just at a high level what the differences are between those models.
Get comfortable using the Anvil platform and running python code in jupyter notebooks.
Also, begin using basic functions of the pandas library.

- Flexibility vs. Interpretability
- Classification vs. Regression
- Prediction vs. Inference
- Supervised vs. Unsupervised Learning
- Non-parameterization with Splines

== Extra Reading and Resources

- https://the-examples-book.com/starter-guides/data-science/data-modeling/choosing-model/[DataMine Examples Book - Choosing a Model]
- https://www.statlearning.com/[An Introduction to Statistical Learning 2.1.3-2.1.5 ]
.Learning Objectives
****
- Create and use Anvil sessions
- Create jupyter notebooks
- Load dataset with pandas
- Basic data manipulation with pandas
****

== Dataset
- `/anvil/projects/tdm/data/boston.csv`


This project will use the following dataset:
- `/anvil/projects/tdm/data/Iris.csv`

== Questions


=== Question 1 (2 points)

Let's start out by starting a new Anvil session. If you do not remember how to do this, please read through https://the-examples-book.com/projects/fall2024/10100/10100-2024-project1[TDM 101's project 1].

Once you have started a new Anvil session, download https://the-examples-book.com/projects/_attachments/project_template.ipynb[the project template] and upload it. Then, open this template in jupyter notebook. Save it as a new file with the following naming convention: `lastname_firstname_project#.ipynb`. For example, `doe_jane_project1.ipynb`.

Before building and training the specific models, the data will need to be understood and it must be decided which are the independent variables (features) and the dependent variables (labels).

Then you will need to split the data into training and testing sets.

Please read https://www.geeksforgeeks.org/how-to-split-the-dataset-with-scikit-learns-train_test_split-function/[this article] to get more information on the split concepts and how to do it.

Then a scaler is useful to standardize the dataset. Please read https://medium.com/analytics-vidhya/why-scaling-is-important-in-machine-learning-aee5781d161a[this article] to learn about scaling, standardization and normalization

The following is an example of loading, splitting, and scaling the data
[NOTE]
====
You may be prompted to select a kernel when opening the notebook. We will use the `seminar` kernel for 301 projects. You are able to change the kernel by clicking on the kernel dropdown menu and selecting the appropriate kernel if needed.
====

To make sure everything is working, run the following code cell:
[source,python]
----
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

file_path = '/anvil/projects/tdm/data/boston.csv'
my_df = pd.read_csv(file_path)

# Split the dataset into features and target variable
X = my_df.drop('MEDV', axis=1)
y = my_df['MEDV']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"X-train pre-scaling:\n------------------------------------\n{X_train.head(1)}\n------------------------------------\n\n")
print(f"X-train post-scaling:\n------------------------------------\n{X_train_scaled[0]}\n------------------------------------")
# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


# Print X_train before and after scaling
print(f"X-train pre-scaling:\n------------------------------------\n{X_train.head(1)}\n------------------------------------\n\n")
print(f"X-train post-scaling:\n------------------------------------\n{X_train_scaled[0]}\n------------------------------------")
print("Hello, world!")
----
You should notice the X_train is a pandas DataFrame but X_train_scaled is a numpy array


.. Please use your own words to describe what a training dataset is, what a testing dataset is, and why we need to split the dataset.
.. Please explain what a scaler is, and what benefit a scaler can offer


Your output should be `Hello, world!`. If you see this, you are ready to move on to the next question.
.Deliverables
====
- Output of running the code cell
====

=== Question 2 (2 points)

**Flexibility vs. Interpretability**

Flexible models are used for complex and unstructured datasets more effectively, but they are usually considered as "black boxes"; their outputs are not easily explainable.

Interpretable models' outputs are easily understood and provide insight into which features are responsible for the prediction.
Now that we have our jupyter notebook set up, let's begin working with the pandas library.

Please read https://www.baeldung.com/cs/ml-flexible-and-inflexible-models[this article] to learn more about flexibility, inflexibility, and interpretability.

.. Please describe the main difference between flexible and interpretable models in your own words.
.. Please read https://the-examples-book.com/starter-guides/data-science/data-modeling/choosing-model/flexibility-interpret[this Examples book page], and choose one model for each that are flexible or interpretable
Pandas is a python library that allows us to work with datasets in tabular form. There are functions for loading datasets, manipulating data, etc.

To start out with, let's load the Iris dataset that is located at `/anvil/projects/tdm/data/Iris.csv`.

To do this, you will need to import the pandas library and use the `read_csv` function to load the dataset.

=== Question 3 (2 points)
Run the following code cell to load the dataset:
[source,python]
----
import pandas as pd

df = pd.read_csv('/anvil/projects/tdm/data/Iris.csv')
----

**Classification vs. Regression**
[NOTE]
====
pandas is commonly imported as `pd` for brevity. This is a common convention in the python community. Similarly, `df` (short for dataframe) is often used as a variable for pandas dataframes. It is not required for you to follow either of these conventions, but it is good practice to do so.
====

Classification is when we classify data into distinct groups such as eye color like blue or black, or animal types like dog or cat.
Now that our dataset is loaded, let's take a look at the first 5 rows of the dataset. To do this, run the following code cell:
[source,python]
----
df.head()
----

Classification model will put existing data into different groups.
[NOTE]
====
The head function is used to display the first n rows of the dataset. By default, n is set to 5. You can change this by passing an integer to the function. For example, `df.head(10)` will display the first 10 rows of the dataset. This function is useful for quickly inspecting the dataframe to see what the data looks like.
====

A regression model will produce predicted future data as numeric values like income or age.
.Deliverables
====
- Output of running the code cell
====

For example: Logistic regression is for binary classification. A binary classification problem is one where the output can only be one of two possible values, usually true or false, 0 or 1, etc.
=== Question 3 (2 points)

In contrast, linear regression is for predicting a continuous target variable as numeric data, for example: predicting a house's price based on the house's features.
An important aspect of our dataframe for machine learning is the shape (rows, columns). As you will learn later, the shape will help us determine what kind of machine learning model will be the best fit, as well as how complex it may be.

To get the shape of the dataframe, run the following code cell:
[source,python]
----
df.shape
----

.. Describe the main difference between classification problems and regression problems in your own words.
.. Describe a situation where classification would be useful vs. regression and vice versa.

[NOTE]
====
There are multiple ways to get the number of rows and columns in a dataframe. The shape attribute is the most common way to do this.
However, an alternative way to get the number of rows is `len(df.index)`, which returns the length of the rows in the dataframe. Similarly, an alternative way to get the number of columns is `len(df.columns)`, which returns the length of the columns in the dataframe. The shape attribute is preferred because it is more concise and returns both the number of rows and columns in one call.
====

This returns a tuple in the form (rows, columns).
.Deliverables
====
- How many rows are in the dataframe?
- How many columns are in the dataframe?
====

=== Question 4 (2 points)

Now that we have loaded the dataset, let's investigate how we can manipulate the data.

**Prediction vs. Inference**

Predictive models focus on forecasting, like using historical house prices to predict future house prices.

Inferential models focus on understanding relationships, like understanding underlying factors that impact house prices.

Visit https://www.datascienceblog.net/post/commentary/inference-vs-prediction/[this link] to learn more on the comparison of prediction and inference.
One common operation is to select a subset of the data. This is done using the `iloc` function, which allows us to index the dataframe by row and column numbers.
[NOTE]
====
The `iloc` function is extremely powerful. It can be used in way too many ways to list here. For a more comprehensive list of how to use `iloc`, please refer to https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html[the official pandas iloc documentation].
====


.. Explain when you would use prediction versus inference in modeling and why, in your own words.
To select the first n rows of the dataframe, we can use the `iloc` function with a slice: `df.iloc[:n]`.

Write code to select the first 10 rows of the dataframe from Question 3 into a new dataframe called `df_subset`. Print the shape of `df_subset` to verify that you have selected the correct number of rows.

.Deliverables
====
- Output of printing the shape of `df_subset`
====

=== Question 5 (2 points)

Another common operation is to remove column(s) from the dataframe. This is done using the `drop` function.

**Supervised vs. Unsupervised Learning**

Supervised learning is when correct predictions are provided to the model; it allows the model to learn the mapping from inputs to the desired outputs, like using labeled data that has features and a labeled column to predict if an animal is a dog or a cat, and learning from the correct or incorrect predictions made.

Unsupervised learning does not provide model with correct predictions, instead it provides incentive for the model to grow unstructured in the right direction; it uncovers patterns or structures within the data.

Please read https://domino.ai/blog/supervised-vs-unsupervised-learning[this article], which provides more comparison of supervised and unsupervised learning.

[NOTE]
====
Similarly to the `iloc` function, the `drop` function is extremely powerful. For a more comprehensive list of how to use `drop`, please refer to https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html[the official pandas drop documentation].
====

**Parameterization vs. Non-Parameterization**
The most readable way to drop a column is by dropping it by name. To drop column(s) by name, you can use the following syntax: `df.drop(['column1_name', 'column2_name', ...], axis=1)`. The `axis=1` argument tells pandas to drop columns, not rows.

Parameterization involves assigning parameters (starting values) to develop a function.
Write code to drop the `Id` column from the dataframe into a new dataframe called `df_without_id`. Print the shape of the dataframe to verify that the column has been removed.

Non-Parameterization uses the data itself to derive the function parameters instead of predefined parameters.
.Deliverables
====
- Output of printing the shape of the dataframe after dropping the `Id` column
====

Please read https://www.geeksforgeeks.org/difference-between-parametric-and-non-parametric-methods/[this article] for an in-depth look at the concepts of parameterization and non-parameterization.

.. Use your own words to explain the difference between supervised and unsupervised learning with simple examples.
.. Use your own words to describe the difference between Parametric models and non-parametric models.
== Submitting your Work

Once you have completed the questions, save your jupyter notebook. You can then download the notebook and submit it to Gradescope.

Project 01 Assignment Checklist
.Items to submit
====
* Jupyter Lab notebook with your code, comments, and output for the assignment
** `firstname-lastname-project01.ipynb`

* Submit files through Gradescope
- firstname_lastname_project1.ipynb
====

[WARNING]
====
_Please_ make sure to double-check that your submission is complete and contains all of your code and output before submitting. If you have a spotty internet connection, it is recommended to download your submission after submitting it to ensure what you _think_ you submitted is what you _actually_ submitted.
You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this.

In addition, please review our https://the-examples-book.com/projects/submissions[submission guidelines] before submitting your project.
====
You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.
====
Loading