Skip to content

Commit 0cf1062

Browse files
committed
Adding a practice to observe effect of sample size and simulation times on sample mean distribution.
Adding explanation of quantile
1 parent b42f019 commit 0cf1062

File tree

1 file changed

+53
-6
lines changed

1 file changed

+53
-6
lines changed

notebooks/06-Sampling.ipynb

Lines changed: 53 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,30 @@
2929
"adult_nhanes_data = adult_nhanes_data.dropna(subset=['StandingHeightCm']).rename(columns={'StandingHeightCm': 'Height'})"
3030
]
3131
},
32+
{
33+
"cell_type": "markdown",
34+
"source": [
35+
"Now let's draw a sample of 50 individuals from the dataset, and calculate its mean.\n",
36+
"Try to execude the next cell repeatedly. What do you see?"
37+
],
38+
"metadata": {
39+
"id": "t_pKb6uq7qsX"
40+
}
41+
},
42+
{
43+
"cell_type": "code",
44+
"source": [
45+
"sample_size = 50\n",
46+
"sample = adult_nhanes_data.sample(sample_size)\n",
47+
"print('Sample mean:', sample['Height'].mean())\n",
48+
"print('Sample standard deviation:', sample['Height'].std())"
49+
],
50+
"metadata": {
51+
"id": "FN_DN2Lo7qCb"
52+
},
53+
"execution_count": null,
54+
"outputs": []
55+
},
3256
{
3357
"cell_type": "markdown",
3458
"metadata": {
@@ -55,11 +79,14 @@
5579
"\n",
5680
"# set up a variable to store the result\n",
5781
"sampling_results = pd.DataFrame({'mean': np.zeros(num_samples)})\n",
58-
"\n",
82+
"print('An empty data frame to be filled with sampling means:')\n",
83+
"print(sampling_results)\n",
5984
"for sample_num in range(num_samples):\n",
6085
" sample = adult_nhanes_data.sample(sample_size)\n",
6186
" sampling_results.loc[sample_num, 'mean'] = sample['Height'].mean()\n",
62-
"#-"
87+
"#-\n",
88+
"print('Means of 5000 samples:')\n",
89+
"print(sampling_results)"
6390
]
6491
},
6592
{
@@ -103,9 +130,23 @@
103130
" loc=sampling_results['mean'].mean(),\n",
104131
" scale=sampling_results['mean'].std())\n",
105132
"plt.plot(x_values, normal_values, color='r')\n",
106-
"#+"
133+
"#+\n",
134+
"print('standard deviation of the sample means:', sampling_results['mean'].std())"
107135
]
108136
},
137+
{
138+
"cell_type": "markdown",
139+
"source": [
140+
"Now, can you redo the simulation of sampling above, but make the following changes each time?\n",
141+
"\n",
142+
"- Changing the sample size to 5 or 500. What difference do you observe in the distribution of sample means?\n",
143+
"\n",
144+
"- Changing the number of times to draw the samples to 50,000. Does the histogram appear closer to a normal distribution?"
145+
],
146+
"metadata": {
147+
"id": "p5J5iklPDqhu"
148+
}
149+
},
109150
{
110151
"cell_type": "markdown",
111152
"metadata": {
@@ -125,7 +166,8 @@
125166
},
126167
"outputs": [],
127168
"source": [
128-
"plt.hist(adult_nhanes_data['AnnualFamilyIncome'])"
169+
"plt.hist(adult_nhanes_data['AnnualFamilyIncome'])\n",
170+
"plt.show()"
129171
]
130172
},
131173
{
@@ -192,7 +234,8 @@
192234
"source": [
193235
"adult_income_data = adult_nhanes_data.dropna(subset=['AnnualFamilyIncome'])\n",
194236
"family_income_sampling_dist = sample_and_return_mean(adult_income_data, 'AnnualFamilyIncome')\n",
195-
"_ = plt.hist(family_income_sampling_dist['mean'], 100)"
237+
"_ = plt.hist(family_income_sampling_dist['mean'], 100)\n",
238+
"plt.show()"
196239
]
197240
},
198241
{
@@ -201,7 +244,11 @@
201244
"id": "O3FH7bGx7SjX"
202245
},
203246
"source": [
204-
"This distribution looks nearly normal. We can also use a quantile-quantile, or \"Q-Q\" plot, to examine this. We will plot two Q-Q plots; on the left we plot one for the original data, and on the right we plot one for the sampling distribution of the mean."
247+
"This distribution looks nearly normal. We can also use a quantile-quantile, or \"Q-Q\" plot, to examine this. \n",
248+
"\n",
249+
"Quantile means the value below which certain percentage of all the scores are distributed. 5 percentile means 5% of the score is below this value. If two distributions are of the same shape, then their corresponding percentiles should form a linear relationship.\n",
250+
"\n",
251+
"We will plot two Q-Q plots; on the left we plot one for the original data, and on the right we plot one for the sampling distribution of the mean."
205252
]
206253
},
207254
{

0 commit comments

Comments
 (0)