|
29 | 29 | "adult_nhanes_data = adult_nhanes_data.dropna(subset=['StandingHeightCm']).rename(columns={'StandingHeightCm': 'Height'})" |
30 | 30 | ] |
31 | 31 | }, |
| 32 | + { |
| 33 | + "cell_type": "markdown", |
| 34 | + "source": [ |
| 35 | + "Now let's draw a sample of 50 individuals from the dataset, and calculate its mean.\n", |
| 36 | + "Try to execude the next cell repeatedly. What do you see?" |
| 37 | + ], |
| 38 | + "metadata": { |
| 39 | + "id": "t_pKb6uq7qsX" |
| 40 | + } |
| 41 | + }, |
| 42 | + { |
| 43 | + "cell_type": "code", |
| 44 | + "source": [ |
| 45 | + "sample_size = 50\n", |
| 46 | + "sample = adult_nhanes_data.sample(sample_size)\n", |
| 47 | + "print('Sample mean:', sample['Height'].mean())\n", |
| 48 | + "print('Sample standard deviation:', sample['Height'].std())" |
| 49 | + ], |
| 50 | + "metadata": { |
| 51 | + "id": "FN_DN2Lo7qCb" |
| 52 | + }, |
| 53 | + "execution_count": null, |
| 54 | + "outputs": [] |
| 55 | + }, |
32 | 56 | { |
33 | 57 | "cell_type": "markdown", |
34 | 58 | "metadata": { |
|
55 | 79 | "\n", |
56 | 80 | "# set up a variable to store the result\n", |
57 | 81 | "sampling_results = pd.DataFrame({'mean': np.zeros(num_samples)})\n", |
58 | | - "\n", |
| 82 | + "print('An empty data frame to be filled with sampling means:')\n", |
| 83 | + "print(sampling_results)\n", |
59 | 84 | "for sample_num in range(num_samples):\n", |
60 | 85 | " sample = adult_nhanes_data.sample(sample_size)\n", |
61 | 86 | " sampling_results.loc[sample_num, 'mean'] = sample['Height'].mean()\n", |
62 | | - "#-" |
| 87 | + "#-\n", |
| 88 | + "print('Means of 5000 samples:')\n", |
| 89 | + "print(sampling_results)" |
63 | 90 | ] |
64 | 91 | }, |
65 | 92 | { |
|
103 | 130 | " loc=sampling_results['mean'].mean(),\n", |
104 | 131 | " scale=sampling_results['mean'].std())\n", |
105 | 132 | "plt.plot(x_values, normal_values, color='r')\n", |
106 | | - "#+" |
| 133 | + "#+\n", |
| 134 | + "print('standard deviation of the sample means:', sampling_results['mean'].std())" |
107 | 135 | ] |
108 | 136 | }, |
| 137 | + { |
| 138 | + "cell_type": "markdown", |
| 139 | + "source": [ |
| 140 | + "Now, can you redo the simulation of sampling above, but make the following changes each time?\n", |
| 141 | + "\n", |
| 142 | + "- Changing the sample size to 5 or 500. What difference do you observe in the distribution of sample means?\n", |
| 143 | + "\n", |
| 144 | + "- Changing the number of times to draw the samples to 50,000. Does the histogram appear closer to a normal distribution?" |
| 145 | + ], |
| 146 | + "metadata": { |
| 147 | + "id": "p5J5iklPDqhu" |
| 148 | + } |
| 149 | + }, |
109 | 150 | { |
110 | 151 | "cell_type": "markdown", |
111 | 152 | "metadata": { |
|
125 | 166 | }, |
126 | 167 | "outputs": [], |
127 | 168 | "source": [ |
128 | | - "plt.hist(adult_nhanes_data['AnnualFamilyIncome'])" |
| 169 | + "plt.hist(adult_nhanes_data['AnnualFamilyIncome'])\n", |
| 170 | + "plt.show()" |
129 | 171 | ] |
130 | 172 | }, |
131 | 173 | { |
|
192 | 234 | "source": [ |
193 | 235 | "adult_income_data = adult_nhanes_data.dropna(subset=['AnnualFamilyIncome'])\n", |
194 | 236 | "family_income_sampling_dist = sample_and_return_mean(adult_income_data, 'AnnualFamilyIncome')\n", |
195 | | - "_ = plt.hist(family_income_sampling_dist['mean'], 100)" |
| 237 | + "_ = plt.hist(family_income_sampling_dist['mean'], 100)\n", |
| 238 | + "plt.show()" |
196 | 239 | ] |
197 | 240 | }, |
198 | 241 | { |
|
201 | 244 | "id": "O3FH7bGx7SjX" |
202 | 245 | }, |
203 | 246 | "source": [ |
204 | | - "This distribution looks nearly normal. We can also use a quantile-quantile, or \"Q-Q\" plot, to examine this. We will plot two Q-Q plots; on the left we plot one for the original data, and on the right we plot one for the sampling distribution of the mean." |
| 247 | + "This distribution looks nearly normal. We can also use a quantile-quantile, or \"Q-Q\" plot, to examine this. \n", |
| 248 | + "\n", |
| 249 | + "Quantile means the value below which certain percentage of all the scores are distributed. 5 percentile means 5% of the score is below this value. If two distributions are of the same shape, then their corresponding percentiles should form a linear relationship.\n", |
| 250 | + "\n", |
| 251 | + "We will plot two Q-Q plots; on the left we plot one for the original data, and on the right we plot one for the sampling distribution of the mean." |
205 | 252 | ] |
206 | 253 | }, |
207 | 254 | { |
|
0 commit comments