Adding a practice to observe effect of sample size and simulation times on sample mean distribution.

lcnature · lcnature · commit 0cf1062ab3fa · 2024-09-24T02:09:02.000-04:00
Adding explanation of quantile
diff --git a/notebooks/06-Sampling.ipynb b/notebooks/06-Sampling.ipynb
@@ -29,6 +29,30 @@
         "adult_nhanes_data = adult_nhanes_data.dropna(subset=['StandingHeightCm']).rename(columns={'StandingHeightCm': 'Height'})"
       ]
     },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Now let's draw a sample of 50 individuals from the dataset, and calculate its mean.\n",
+        "Try to execude the next cell repeatedly. What do you see?"
+      ],
+      "metadata": {
+        "id": "t_pKb6uq7qsX"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "sample_size = 50\n",
+        "sample = adult_nhanes_data.sample(sample_size)\n",
+        "print('Sample mean:', sample['Height'].mean())\n",
+        "print('Sample standard deviation:', sample['Height'].std())"
+      ],
+      "metadata": {
+        "id": "FN_DN2Lo7qCb"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
     {
       "cell_type": "markdown",
       "metadata": {
@@ -55,11 +79,14 @@
         "\n",
         "# set up a variable to store the result\n",
         "sampling_results = pd.DataFrame({'mean': np.zeros(num_samples)})\n",
-        "\n",
+        "print('An empty data frame to be filled with sampling means:')\n",
+        "print(sampling_results)\n",
         "for sample_num in range(num_samples):\n",
         "    sample = adult_nhanes_data.sample(sample_size)\n",
         "    sampling_results.loc[sample_num, 'mean'] = sample['Height'].mean()\n",
-        "#-"
+        "#-\n",
+        "print('Means of 5000 samples:')\n",
+        "print(sampling_results)"
       ]
     },
     {
@@ -103,9 +130,23 @@
         "    loc=sampling_results['mean'].mean(),\n",
         "    scale=sampling_results['mean'].std())\n",
         "plt.plot(x_values, normal_values, color='r')\n",
-        "#+"
+        "#+\n",
+        "print('standard deviation of the sample means:', sampling_results['mean'].std())"
       ]
     },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Now, can you redo the simulation of sampling above, but make the following changes each time?\n",
+        "\n",
+        "- Changing the sample size to 5 or 500. What difference do you observe in the distribution of sample means?\n",
+        "\n",
+        "- Changing the number of times to draw the samples to 50,000. Does the histogram appear closer to a normal distribution?"
+      ],
+      "metadata": {
+        "id": "p5J5iklPDqhu"
+      }
+    },
     {
       "cell_type": "markdown",
       "metadata": {
@@ -125,7 +166,8 @@
       },
       "outputs": [],
       "source": [
-        "plt.hist(adult_nhanes_data['AnnualFamilyIncome'])"
+        "plt.hist(adult_nhanes_data['AnnualFamilyIncome'])\n",
+        "plt.show()"
       ]
     },
     {
@@ -192,7 +234,8 @@
       "source": [
         "adult_income_data = adult_nhanes_data.dropna(subset=['AnnualFamilyIncome'])\n",
         "family_income_sampling_dist = sample_and_return_mean(adult_income_data, 'AnnualFamilyIncome')\n",
-        "_ = plt.hist(family_income_sampling_dist['mean'], 100)"
+        "_ = plt.hist(family_income_sampling_dist['mean'], 100)\n",
+        "plt.show()"
       ]
     },
     {
@@ -201,7 +244,11 @@
         "id": "O3FH7bGx7SjX"
       },
       "source": [
-        "This distribution looks nearly normal.  We can also use a quantile-quantile, or \"Q-Q\" plot, to examine this.  We will plot two Q-Q plots; on the left we plot one for the original data, and on the right we plot one for the sampling distribution of the mean."
+        "This distribution looks nearly normal.  We can also use a quantile-quantile, or \"Q-Q\" plot, to examine this.  \n",
+        "\n",
+        "Quantile means the value below which certain percentage of all the scores are distributed. 5 percentile means 5% of the score is below this value. If two distributions are of the same shape, then their corresponding percentiles should form a linear relationship.\n",
+        "\n",
+        "We will plot two Q-Q plots; on the left we plot one for the original data, and on the right we plot one for the sampling distribution of the mean."
       ]
     },
     {