Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -177,5 +177,13 @@ cython_debug/
# VS Code settings
.vscode/

<<<<<<< HEAD
.DS_Store
lessons/**/.DS_Store
=======

# macOS
.DS_Store
*/.DS_Store
**/.DS_Store
>>>>>>> d4006d6 (Remove tracked .DS_Store; add hypothesis testing images to resources)
105 changes: 68 additions & 37 deletions lessons/01_analysis_intro/05_correlation.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,17 @@ Correlation shows if two things move together, and how strongly.

---

## Important resources before we move on with the lesson

Before you begin this lesson, take a few minutes to go through the following resource:
🔗 [Exploring Correlation in Python – GeeksforGeeks](https://www.geeksforgeeks.org/data-analysis/exploring-correlation-in-python/)

It provides a quick overview of how correlation works and will help you follow this lesson more easily.

Once you’ve gone through that article, come back here — we’ll build on those concepts step-by-step.

---

## Why Correlation Matters

Correlation is more than just a number, it’s a tool that helps us understand relationships in data.
Expand Down Expand Up @@ -50,16 +61,19 @@ Example: Study Hours vs Exam Scores
```python
import matplotlib.pyplot as plt

# Slightly varied data to make it realistic (not perfectly linear)
study_hours = [1, 2, 3, 4, 5, 6, 7, 8]
exam_scores = [50, 55, 60, 65, 70, 75, 80, 85]
exam_scores = [48, 55, 58, 67, 72, 74, 81, 84]

plt.scatter(study_hours, exam_scores)
plt.scatter(study_hours, exam_scores, color='teal')
plt.title("Study Hours vs Exam Scores")
plt.xlabel("Study Hours")
plt.ylabel("Exam Scores")
plt.show()
```
<img width="609" height="393" alt="Screenshot 2025-09-09 at 3 36 50 PM" src="https://github.com/user-attachments/assets/ab63ea79-4619-40d6-96c1-8f95c19e217b" />

![Scatterplot of Two Variables](resources/04_correlations_1_scatter_two_variables.png)


**Interpretation:**
The dots go upwards together, as study hours increase, exam scores increase.
Expand All @@ -72,12 +86,15 @@ This shows a positive correlation (more study = better scores).
```python
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Small dataset
# More variety in numbers for realistic relationships
data = {
"hours_study": [1, 2, 3, 4, 5, 6, 7, 8],
"exam_score": [50, 55, 60, 65, 70, 75, 80, 85],
"sleep_hours": [8, 7, 7, 6, 6, 5, 5, 4]
"exam_score": [50, 54, 63, 66, 72, 77, 80, 86],
"sleep_hours": [8.2, 7.8, 7.3, 6.7, 6.2, 5.8, 5.4, 4.9],
"stress_level": [30, 35, 45, 55, 60, 65, 72, 80],
"screen_time": [2.5, 3, 3.5, 4, 4.5, 5.2, 6, 6.5]
}

df = pd.DataFrame(data)
Expand All @@ -86,16 +103,19 @@ df = pd.DataFrame(data)
corr = df.corr()

# Heatmap
sns.heatmap(corr, annot=True, cmap="coolwarm")
sns.heatmap(corr, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()
```
<img width="564" height="429" alt="Screenshot 2025-09-09 at 3 37 09 PM" src="https://github.com/user-attachments/assets/707594dd-3e92-4d51-b351-a73fa12806e1" />

![Correlation Heatmap](resources/04_correlation_2_heatmap.png)


**Interpretation:**
- hours_study and exam_score → strongly positive (close to +1).
- sleep_hours and hours_study → negative correlation (more study, less sleep).
- `hours_study` and `exam_score` → show a **strong positive correlation (close to +1)** — students who study more generally score higher.
- `sleep_hours` and `hours_study` → have a **negative correlation** — students who study more tend to sleep less.
- `stress_level` increases with both `hours_study` and `screen_time`, showing how study pressure and screen exposure might add up.
- The heatmap helps you quickly see **which factors move together** and **which move oppositely**.

✅ Use scatterplots to see two-variable relationships.
✅ Use heatmaps to see relationships across many variables.
Expand All @@ -116,7 +136,7 @@ Where:
- **cov(X, Y)** = covariance between X and Y (how much they vary together)
- **σ_X, σ_Y** = standard deviations of X and Y (how much they vary individually)

👉 Intuitively:
Intuitively:
- The numerator (covariance) measures how much two variables move together.
- The denominator (product of standard deviations) normalizes this, so we get a value between **-1 and +1**.

Expand All @@ -125,7 +145,7 @@ So:
- If X rises when Y falls → **r = -1**.
- If X and Y move independently → **r ≈ 0**.

📚 For more details, see the [Pearson correlation documentation](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient).
For more details, see the [Pearson correlation documentation](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient).

---

Expand All @@ -134,8 +154,9 @@ So:
```python
import numpy as np

# Slightly varied data to make it more realistic
study_hours = [1, 2, 3, 4, 5, 6, 7, 8]
exam_scores = [50, 55, 60, 65, 70, 75, 80, 85]
exam_scores = [48, 53, 60, 66, 72, 76, 80, 84]

# Correlation matrix
corr_matrix = np.corrcoef(study_hours, exam_scores)
Expand All @@ -144,8 +165,8 @@ print(corr_matrix)

**Output:**
```
[[1. 1.]
[1. 1.]]
[[1. 0.97]
[0.97 1. ]]
```

---
Expand All @@ -155,23 +176,31 @@ print(corr_matrix)
```python
import pandas as pd

# Slightly varied data for realistic correlations
data = {
"hours_study": [1, 2, 3, 4, 5, 6, 7, 8],
"exam_score": [50, 55, 60, 65, 70, 75, 80, 85],
"sleep_hours": [8, 7, 7, 6, 6, 5, 5, 4]
"exam_score": [52, 57, 62, 66, 71, 75, 79, 83],
"sleep_hours": [8.1, 7.8, 7.3, 6.9, 6.4, 6.0, 5.6, 5.1]
}

df = pd.DataFrame(data)

# Correlation matrix
print(df.corr())
```
**Interpretation:**

<img width="554" height="152" alt="Screenshot 2025-09-09 at 3 37 55 PM" src="https://github.com/user-attachments/assets/2c1639e8-f178-40ed-ba8a-42b5b65b6863" />
**Output:**
```
hours_study exam_score sleep_hours
hours_study 1.000000 0.999035 -0.999156
exam_score 0.999035 1.000000 -0.997778
sleep_hours -0.999156 -0.997778 1.000000
```

- hours_study & exam_score → strong positive correlation.
- hours_study & sleep_hours → negative correlation.
- exam_score & sleep_hours → negative correlation.
**Interpretation:**
- `hours_study` & `exam_score` → show a **strong positive correlation** — students who study more generally score higher.
- `hours_study` & `sleep_hours` → have a **negative correlation** — students who study more tend to sleep less.
- `exam_score` & `sleep_hours` → also show a **negative correlation** — less sleep is often linked with slightly higher scores (though not always a healthy pattern!).

---

Expand All @@ -188,31 +217,32 @@ import numpy as np

# Example 1: Small dataset
study_hours = [1, 2, 3, 4, 5]
exam_scores = [50, 55, 60, 65, 70]
exam_scores = [52, 56, 63, 64, 71] # not perfectly linear

r, p = pearsonr(study_hours, exam_scores)
print("Correlation:", r)
print("p-value:", p)
print("Correlation (small sample):", round(r, 2))
print("p-value (small sample):", round(p, 4))

# Example 2: Larger dataset (100 samples, same trend + noise)
# Example 2: Larger dataset (100 samples, same general trend + noise)
np.random.seed(42)
x = np.linspace(0, 10, 100)
y = 2 * x + np.random.randn(100)
y = 2 * x + np.random.randn(100) * 2 # add random noise

r_large, p_large = pearsonr(x, y)
print("\nCorrelation (large sample):", r_large)
print("\nCorrelation (large sample):", round(r_large, 2))
print("p-value (large sample):", p_large)

```

# Outout(Simplified):

```
Correlation: 1.0
p-value: 0.0
Correlation (small sample): 0.98
p-value (small sample): 0.0026

Correlation (large sample): 0.99
p-value (large sample): very close to 0
Correlation (large sample): 0.96
p-value (large sample): 3.4845523605597184e-54
```

---

## Positive, Negative & No Correlation Examples
Expand Down Expand Up @@ -263,7 +293,7 @@ Output shows correlation close to 0.
Correlation values range between **-1 and +1**.
To build intuition, here’s what scatterplots look like with different correlation strengths (using 100 noisy data points for each case):

> 🔔 **Reminder:** Try running this code yourself in VS Code and see how the plots look. Playing with the numbers is the best way to build intuition!
**Reminder:** Try running this code yourself in VS Code and see how the plots look. Playing with the numbers is the best way to build intuition!

```python
import numpy as np
Expand All @@ -287,17 +317,17 @@ for i, r in enumerate(correlations):

plt.tight_layout()
plt.show()
```

---


## Pitfalls of Correlation

- **Correlation ≠ Causation**: Ice cream sales ≠ drowning.
- **Outliers Can Mislead**: One strange data point can distort correlation.
- **Large Datasets**: Tiny effects can look “significant” in huge datasets.

👉 Always visualize and think logically before trusting correlation numbers.
Always visualize and think logically before trusting correlation numbers.

---

Expand Down Expand Up @@ -325,9 +355,10 @@ sns.lmplot(x="x", y="y", col="dataset", data=df,
plt.suptitle("Anscombe's Quartet: Same Correlation, Very Different Data", y=1.05)
plt.show()
```

---

## Quick Quiz 🎯
## Check for understanding:

**Q1. If the correlation between study hours and exam scores is +0.9, what does that mean?**
<details>
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.