Code-the-Dream-School · Shubham-Agarwall · Oct 16, 2025 · Oct 16, 2025 · Oct 28, 2025 · Oct 28, 2025
diff --git a/.gitignore b/.gitignore
@@ -177,5 +177,13 @@ cython_debug/
 # VS Code settings
 .vscode/
 
+<<<<<<< HEAD
 .DS_Store
 lessons/**/.DS_Store
+=======
+
+# macOS
+.DS_Store
+*/.DS_Store
+**/.DS_Store
+>>>>>>> d4006d6 (Remove tracked .DS_Store; add hypothesis testing images to resources)
diff --git a/lessons/01_analysis_intro/05_correlation.md b/lessons/01_analysis_intro/05_correlation.md
@@ -19,6 +19,17 @@ Correlation shows if two things move together, and how strongly.
 
 ---
 
+## Important resources before we move on with the lesson
+
+Before you begin this lesson, take a few minutes to go through the following resource:  
+🔗 [Exploring Correlation in Python – GeeksforGeeks](https://www.geeksforgeeks.org/data-analysis/exploring-correlation-in-python/)
+
+It provides a quick overview of how correlation works and will help you follow this lesson more easily.
+
+Once you’ve gone through that article, come back here — we’ll build on those concepts step-by-step.
+
+---
+
 ## Why Correlation Matters
 
 Correlation is more than just a number, it’s a tool that helps us understand relationships in data.
@@ -50,16 +61,19 @@ Example: Study Hours vs Exam Scores
 ```python
 import matplotlib.pyplot as plt
 
+# Slightly varied data to make it realistic (not perfectly linear)
 study_hours = [1, 2, 3, 4, 5, 6, 7, 8]
-exam_scores = [50, 55, 60, 65, 70, 75, 80, 85]
+exam_scores = [48, 55, 58, 67, 72, 74, 81, 84]
 
-plt.scatter(study_hours, exam_scores)
+plt.scatter(study_hours, exam_scores, color='teal')
 plt.title("Study Hours vs Exam Scores")
 plt.xlabel("Study Hours")
 plt.ylabel("Exam Scores")
 plt.show()
 ```
-<img width="609" height="393" alt="Screenshot 2025-09-09 at 3 36 50 PM" src="https://github.com/user-attachments/assets/ab63ea79-4619-40d6-96c1-8f95c19e217b" />
+
+![Scatterplot of Two Variables](resources/04_correlations_1_scatter_two_variables.png)
+
 
 **Interpretation:**  
 The dots go upwards together, as study hours increase, exam scores increase.  
@@ -72,12 +86,15 @@ This shows a positive correlation (more study = better scores).
 ```python
 import pandas as pd
 import seaborn as sns
+import matplotlib.pyplot as plt
 
-# Small dataset
+# More variety in numbers for realistic relationships
 data = {
     "hours_study": [1, 2, 3, 4, 5, 6, 7, 8],
-    "exam_score": [50, 55, 60, 65, 70, 75, 80, 85],
-    "sleep_hours": [8, 7, 7, 6, 6, 5, 5, 4]
+    "exam_score": [50, 54, 63, 66, 72, 77, 80, 86],
+    "sleep_hours": [8.2, 7.8, 7.3, 6.7, 6.2, 5.8, 5.4, 4.9],
+    "stress_level": [30, 35, 45, 55, 60, 65, 72, 80],
+    "screen_time": [2.5, 3, 3.5, 4, 4.5, 5.2, 6, 6.5]
 }
 
 df = pd.DataFrame(data)
@@ -86,16 +103,19 @@ df = pd.DataFrame(data)
 corr = df.corr()
 
 # Heatmap
-sns.heatmap(corr, annot=True, cmap="coolwarm")
+sns.heatmap(corr, annot=True, cmap="coolwarm", fmt=".2f")
 plt.title("Correlation Heatmap")
 plt.show()
 ```
-<img width="564" height="429" alt="Screenshot 2025-09-09 at 3 37 09 PM" src="https://github.com/user-attachments/assets/707594dd-3e92-4d51-b351-a73fa12806e1" />
+
+![Correlation Heatmap](resources/04_correlation_2_heatmap.png)
 
 
 **Interpretation:**  
-- hours_study and exam_score → strongly positive (close to +1).  
-- sleep_hours and hours_study → negative correlation (more study, less sleep).  
+- `hours_study` and `exam_score` → show a **strong positive correlation (close to +1)** — students who study more generally score higher.  
+- `sleep_hours` and `hours_study` → have a **negative correlation** — students who study more tend to sleep less.  
+- `stress_level` increases with both `hours_study` and `screen_time`, showing how study pressure and screen exposure might add up.  
+- The heatmap helps you quickly see **which factors move together** and **which move oppositely**.  
 
 ✅ Use scatterplots to see two-variable relationships.  
 ✅ Use heatmaps to see relationships across many variables.
@@ -116,7 +136,7 @@ Where:
 - **cov(X, Y)** = covariance between X and Y (how much they vary together)  
 - **σ_X, σ_Y** = standard deviations of X and Y (how much they vary individually)  
 
-👉 Intuitively:  
+Intuitively:  
 - The numerator (covariance) measures how much two variables move together.  
 - The denominator (product of standard deviations) normalizes this, so we get a value between **-1 and +1**.  
 
@@ -125,7 +145,7 @@ So:
 - If X rises when Y falls → **r = -1**.  
 - If X and Y move independently → **r ≈ 0**.  
 
-📚 For more details, see the [Pearson correlation documentation](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient).  
+ For more details, see the [Pearson correlation documentation](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient).  
 
 ---
 
@@ -134,8 +154,9 @@ So:
 ```python
 import numpy as np
 
+# Slightly varied data to make it more realistic
 study_hours = [1, 2, 3, 4, 5, 6, 7, 8]
-exam_scores = [50, 55, 60, 65, 70, 75, 80, 85]
+exam_scores = [48, 53, 60, 66, 72, 76, 80, 84]
 
 # Correlation matrix
 corr_matrix = np.corrcoef(study_hours, exam_scores)
@@ -144,8 +165,8 @@ print(corr_matrix)
 
 **Output:**  
 ```
-[[1. 1.]
- [1. 1.]]
+[[1.         0.97]
+ [0.97       1.        ]]
 ```
 
 ---
@@ -155,23 +176,31 @@ print(corr_matrix)
 ```python
 import pandas as pd
 
+# Slightly varied data for realistic correlations
 data = {
     "hours_study": [1, 2, 3, 4, 5, 6, 7, 8],
-    "exam_score": [50, 55, 60, 65, 70, 75, 80, 85],
-    "sleep_hours": [8, 7, 7, 6, 6, 5, 5, 4]
+    "exam_score": [52, 57, 62, 66, 71, 75, 79, 83],
+    "sleep_hours": [8.1, 7.8, 7.3, 6.9, 6.4, 6.0, 5.6, 5.1]
 }
+
 df = pd.DataFrame(data)
 
 # Correlation matrix
 print(df.corr())
 ```
-**Interpretation:** 
 
-<img width="554" height="152" alt="Screenshot 2025-09-09 at 3 37 55 PM" src="https://github.com/user-attachments/assets/2c1639e8-f178-40ed-ba8a-42b5b65b6863" />
+**Output:**  
+```
+             hours_study  exam_score  sleep_hours
+hours_study     1.000000    0.999035    -0.999156
+exam_score      0.999035    1.000000    -0.997778
+sleep_hours    -0.999156   -0.997778     1.000000
+```
 
-- hours_study & exam_score → strong positive correlation.  
-- hours_study & sleep_hours → negative correlation.  
-- exam_score & sleep_hours → negative correlation.  
+**Interpretation:**  
+- `hours_study` & `exam_score` → show a **strong positive correlation** — students who study more generally score higher.  
+- `hours_study` & `sleep_hours` → have a **negative correlation** — students who study more tend to sleep less.  
+- `exam_score` & `sleep_hours` → also show a **negative correlation** — less sleep is often linked with slightly higher scores (though not always a healthy pattern!).  
 
 ---
 
@@ -188,31 +217,32 @@ import numpy as np
 
 # Example 1: Small dataset
 study_hours = [1, 2, 3, 4, 5]
-exam_scores = [50, 55, 60, 65, 70]
+exam_scores = [52, 56, 63, 64, 71]  # not perfectly linear
 
 r, p = pearsonr(study_hours, exam_scores)
-print("Correlation:", r)
-print("p-value:", p)
+print("Correlation (small sample):", round(r, 2))
+print("p-value (small sample):", round(p, 4))
 
-# Example 2: Larger dataset (100 samples, same trend + noise)
+# Example 2: Larger dataset (100 samples, same general trend + noise)
 np.random.seed(42)
 x = np.linspace(0, 10, 100)
-y = 2 * x + np.random.randn(100)
+y = 2 * x + np.random.randn(100) * 2  # add random noise
 
 r_large, p_large = pearsonr(x, y)
-print("\nCorrelation (large sample):", r_large)
+print("\nCorrelation (large sample):", round(r_large, 2))
 print("p-value (large sample):", p_large)
-
+```
 
 # Outout(Simplified):
 
 ```
-Correlation: 1.0
-p-value: 0.0
+Correlation (small sample): 0.98
+p-value (small sample): 0.0026
 
-Correlation (large sample): 0.99
-p-value (large sample): very close to 0
+Correlation (large sample): 0.96
+p-value (large sample): 3.4845523605597184e-54
 ```
+
 ---
 
 ## Positive, Negative & No Correlation Examples
@@ -263,7 +293,7 @@ Output shows correlation close to 0.
 Correlation values range between **-1 and +1**.  
 To build intuition, here’s what scatterplots look like with different correlation strengths (using 100 noisy data points for each case):
 
-> 🔔 **Reminder:** Try running this code yourself in VS Code and see how the plots look. Playing with the numbers is the best way to build intuition!
+**Reminder:** Try running this code yourself in VS Code and see how the plots look. Playing with the numbers is the best way to build intuition!
 
 ```python
 import numpy as np
@@ -287,17 +317,17 @@ for i, r in enumerate(correlations):
 
 plt.tight_layout()
 plt.show()
+```
 
 ---
 
-
 ## Pitfalls of Correlation
 
 - **Correlation ≠ Causation**: Ice cream sales ≠ drowning.  
 - **Outliers Can Mislead**: One strange data point can distort correlation.  
 - **Large Datasets**: Tiny effects can look “significant” in huge datasets.  
 
-👉 Always visualize and think logically before trusting correlation numbers.
+Always visualize and think logically before trusting correlation numbers.
 
 ---
 
@@ -325,9 +355,10 @@ sns.lmplot(x="x", y="y", col="dataset", data=df,
 plt.suptitle("Anscombe's Quartet: Same Correlation, Very Different Data", y=1.05)
 plt.show()
 ```
+
 ---
 
-## Quick Quiz 🎯
+## Check for understanding:
 
 **Q1. If the correlation between study hours and exam scores is +0.9, what does that mean?**  
 <details>

diff --git a/lessons/01_analysis_intro/resources/04_correlation_1_scatter_two_variables.png b/lessons/01_analysis_intro/resources/04_correlation_1_scatter_two_variables.png
diff --git a/lessons/01_analysis_intro/resources/04_correlation_2_heatmap.png b/lessons/01_analysis_intro/resources/04_correlation_2_heatmap.png