forked from bvkrauth/is4e
-
Notifications
You must be signed in to change notification settings - Fork 0
/
05-Basic-data-analysis.Rmd
1178 lines (913 loc) · 48.4 KB
/
05-Basic-data-analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Basic data analysis with Excel {#basic-data-analysis-with-excel}
Once we have [cleaned our data](#basic-data-cleaning-with-excel) we are ready
to start analyzing it. Our analysis can include viewing the data, summarizing
it in tables, and visualizing it in charts.
This chapter will demonstrate the use of Excel to construct ***univariate***
statistics and charts, i.e., statistics and charts that describe a single
variable.
::: {.goals data-latex=""}
**Chapter goals**
In this chapter, we will learn how to:
1. Use Excel tools for viewing larger data tables, including:
- Sorting
- Filtering
- Freezing panes
2. Calculate and interpret univariate summary statistics by hand:
- Sample size (count)
- Sample average
- Sample median
- Sample variance and standard deviation
3. Calculate univariate summary statistics in Excel.
- Sample size (count)
- Sample average
- Sample range, percentiles, and median
- Sample variance and standard deviation
4. Construct and interpret simple and binned frequency tables in Excel.
5. Construct and interpret time series (line) graphs in Excel.
6. Construct and interpret bar graphs and histograms in Excel.
:::
To prepare for this chapter, please review the chapter on
[basic data cleaning with Excel](#basic-data-cleaning-with-excel).
## Viewing large data files
The first step in any data analysis is to literally look at the data.
::: example
**Historical employment data for Canada**
Our analysis in this chapter will use historical employment data for Canada from
January 1976 through January 2021.
1. Set up a project folder.
2. Download the file at
[https://bookdown.org/bkrauth/IS4E/sampledata/CanEmpHist.xlsx](sampledata/CanEmpHist.xlsx) and put it in your raw data subfolder.
3. Make a working copy in your project folder.
- I called mine `CanEmpHist_WC.xlsx`.
4. Open your working copy. It will look something like this:
![*Canadian employment data, 1976-2021*](bin/CanadaEmploymentHistorical.png)
There is a worksheet titled *Raw data* that contains the original data as
obtained from Statistics Canada, along with information on its sources.
Our main data set for analysis is in the worksheet *Data for Analysis*, which
includes the following variables:
- **MonthYr**: the month and year of the observation.
- **Population**: the civilian, non-institutionalized working-age population
of Canada at that time, in thousands.
- **Employed**: the total number employed in the population, in thousands.
- **Unemployed**: the total number employed in the population, in thousands.
- **LabourForce**: the sum of Employed and Unemployed.
- **NotInLabourForce**: the difference between Population and LabourForce.
- **UnempRate**: the percentage of the labour force that is unemployed. As
before, it is calculated and stored as a decimal (ranging from 0.0 to 1.0)
but displayed as a percentage (ranging from 0% to 100%).
- **LFPRate**: the percentage of the population that is in the labour force. As
before, it is calculated and stored as a decimal (ranging from 0.0 to 1.0)
but displayed as a percentage (ranging from 0% to 100%).
- **Party**: the political party in control of the Federal government. If the
party in control changed during the month, it is listed as "Transfer".
- **PrimeMinister**: the name of the Prime Minister. If the prime minister
changed during the month, it is listed as "Transfer".
- **AnnPopGrowth**: the rate of population growth over the previous 12 months,
calculated as a proportion and displayed as a percentage. Note that this
variable is blank for the first 12 months of the data set.
Note that this data set covers over 500 months, and is much larger than the
single month of data we cleaned earlier.
:::
Most data sets of interest are very large. For example, surveys from
Statistics Canada can have hundreds of observations and hundreds of thousands of
variables, and companies and governments often work with transactions-level data
that includes millions of observations. Humans are smart, but not smart enough
to fully understand a large data set without some kind of simplification or
"dimension reduction." Instead of trying to directly interpret thousands or
millions of numbers, we will need to calculate and view a relatively small
number of ***statistics*** based on the data. A statistic is just a number
calculated from data.
But before we calculate statistics for our data set, we can use a few Excel
tools - sorting, filtering and freezing - to view selected parts of our
data set.
### Sorting
***Sorting*** allows us to re-order the rows of our data based on
the value in one or more of the columns. Since order does not matter
with tidy data, we can sort in whatever way we like without changing
the content of our data.
::: fyi
Before sorting data, always make sure that the order of rows does not convey
important information.
:::
::: example
**Sorting the employment data**
The original data are ordered by date. But suppose we wanted it to be ordered by
unemployment rate (with the highest rate on top) instead. Here's all we need to
do:
1. Select any cell in the **UnempRate** column.
2. Select `Home > Sort and Filter > Sort Largest to Smallest`.
As you can see, the data set is now sorted by unemployment rate. We can follow
similar steps to put the data set back in date order:
1. Select any cell in the **MonthYr** column.
2. Select `Home > Sort and Filter > Sort Oldest to Newest`.
Notice that Excel can tell whether a column contains numbers, texts or dates,
and will sort accordingly.
:::
You can sort in ascending or descending order, and you can sort on multiple
columns by selecting the "Custom sort" option.
### Filtering
***Filtering*** allows us to hide some observations so we can look at a
particular subset of observations that we are interested in. The hidden
observations are still there, so this is a much safer option than deleting
observations.
::: example
**Filtering the employment data**
Suppose we only want to see data from when Paul Martin was the Prime Minister.
Here's what we can do:
1. Select `Home > Sort and Filter > Filter`. If you look at the column headers
in your sheet you will see that they have become drop-down boxes.
3. Click on the drop-down box for **PrimeMinister**, then uncheck every box
but the one marked `Paul Martin`.
At this point, only the months when Paul Martin was the Prime Minister appear in
the table. Don't worry, the other ones haven't gone anywhere.
We can undo the filter and remove the drop-down boxes by selecting
`Home > Sort and Filter > Filter` again.
:::
You can filter on more complex criteria, and you can combine sorting and
filtering.
### Freezing panes
***Freezing panes*** keeps some rows and/or columns visible regardless of which
cell is currently selected. This allows us to work with large tables while
keeping the top row (variable names) and/or the first column (observation IDs)
visible.
::: example
**Freezing panes**
Go down to row 50 or so in your worksheet. Notice that you cannot see the
variable names in row 1 any more. We can make sure the first row stays
visible by doing the following:
1. Go back to cell A1.
2. Select `View > Freeze Panes > Freeze Top Row`.
Now go back down to row 50 or so. You will see that the top row is still
displayed and you can see the variable names.
To undo this, select `View > Freeze Panes > Unfreeze Panes`.
:::
Instead of freezing the first row, you can freeze the first column, or you can
freeze any number of rows and columns.
## Summary statistics
We will start by calculating some commonly-used ***univariate*** statistics,
which are statistics that describe a single variable in isolation. In Section
\@ref(univariate-graphs-in-excel) we will construct some commonly-used
univariate graphs.
Multivariate statistics describe the relationships among multiple variables,
and will be covered in Chapter \@ref(multivariate-data-analysis).
A table of ***summary statistics*** reports various univariate statistics for
each variable in our data set. For example, our table of summary statistics
might look like this:
| Statistic | *(variable name 1)* | *(variable name 2)* | *(etc.)* |
|:----------------|:-------------------:|---------------------|---------|
| Count | *(count of valid observations)*
| Average | *(average)*
| StdDev | *(standard deviation)*
| Min | *(minimum value observed)*
| 10th percentile |
| Median |
| 90th percentile |
| Max | *(maximum value observed)*
You saw many of these words - standard deviation, percentile, median, etc. -
in Chapter \@ref(random-variables), but I need to be clear on something: even
though the names are the same, the concepts are *not* exactly the same.
The ***count*** or ***sample size*** is the number $n$ of observations with
valid (numeric) values for the variable. It is calculated in Excel using the
`COUNT()` function.
The ***sample average*** is a measure of central tendency in data, and is
calculated:
$$\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i$$
or in Excel using the `AVERAGE()` function.
The ***sample standard deviation*** is a measure of variation in the data,
and is calculated:
$$s_x = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2}$$
or in Excel using the `STDEV.S()` function.
The ***sample minimum*** is the lowest value observed in the data, and the
***sample maximum*** is the highest value observed in the data. They can be
calculated in Excel using the `MIN()` and `MAX()` functions.
The ***sample median*** is another measure of central tendency, and is defined:
\begin{align}
\hat{m} &= \begin{cases}
x_{[(n/2) + (1/2)]} & \textrm{if $n$ is odd} \\
\frac{x_{[n/2]} + x_{[(n/2) + 1]}}{2} & \textrm{if $n$ is even} \\
\end{cases}
\end{align}
where $x_{[k]}$ is the value of $x_i$ for the $k$th-lowest observation in the
data set. This formula looks complicated, but it's just a way of describing how
you were probably taught to calculate medians in grade school. For example:
- The sample median of $D_3 = (4,1,3)$ is $\hat{m} = 3$.
- The sample median of $D_4 = (4,1,3,2)$ is $\hat{m} = 2.5$.
It can be calculated in Excel using the `MEDIAN()` function.
The ***sample 10th percentile*** is a number that is above 10\% of observations
in the data, and the ***sample 90th percentile*** is a number that is above
90\% of observations. Either can be calculated in Excel using the `PERCENTILE.INC()`
function. I will not ask you to calculate these by hand.
Note that the sample average/median/percentiles/etc. describe
*the values observed in a set of data*, and are calculated from that data set.
The mean/median/percentiles we learned about in Chapter \@ref(random-variables)
describe *the probability distribution of a random variable* and are calculated
from that probability distribution. As you might imagine, these two sets of
concepts are distinct but related. We will discuss how they are related in
Chapter \@ref(statistics). For now, just keep in mind that they are distinct.
### Constructing the table
We are going to create a nice table of summary statistics for some of our
variables.
::: example
**Table setup**
To set up the table:
1. Create a new blank worksheet.
- You can do this by clicking on the !["+"](bin/AddWorksheetButton.png){height=16pt}
button at the bottom of the screen.
2. Rename the worksheet *Summary Statistics*.
- You can do this by double-clicking on the tab for your new worksheet. A
dialog box will open for you to enter the new name.
3. Fill in the first column (cells A1:A9) as in the table above.
:::
The next step is to fill in the row of variable names. We could type them in,
but let's do something more sophisticated and flexible: use a formula to pull
the variable names in from the original data set.
::: example
**Getting variable names from the data**
To fill in the variable name for column B:
1. Go to cell B1.
2. Type `=` but *don't* hit `<enter>` yet.
3. Select the tab for the *Data for Analysis* worksheet, and then select cell G1
in the *Data for Analysis* worksheet.
- The formula bar now says `='Data for Analysis'!G1`
4. Select `<enter>`.
- You should now be back in cell B1 in the Summary statistics worksheet.
- Cell B1 should display **UnempRate** (the contents of cell G1 in
*Data for Analysis*).
What have we done here? We have constructed a formula that successfully
references a cell in another worksheet. Note that the relative references are
even copied over correctly.
:::
Excel has many built-in functions to calculate summary statistics.
::: example
**Summary statistics for UnempRate**
To fill in the statistics for column B:
1. Use the `COUNT()` function to report the observation count in cell B2.
- We will want to use an absolute reference for the rows, and a relative
reference for the columns, so the formula should be
`=COUNT('Data for Analysis'!G$2:G$542)`.
2. Use the `AVERAGE()` function to report the average unemployment rate in cell
B3.
3. Use the `STDEV.S()` function to report the standard deviation of the
unemployment rate in cell B4.
- There is another built-in Excel function called `STDEV.P()` that uses a
slightly different formula for the standard deviation:
$$s_x^p = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2}$$
We will discuss the difference between these two statistics later.
4. Use the `MIN()` function to report the minimum unemployment rate in cell B5.
5. Use the `PERCENTILE.INC()` function to report the 10th percentile of the
unemployment rate in cell B6.
- Warning: Despite its name, the `PERCENTILE.INC()` function takes a quantile
(between 0 and 1) rather than a percentile (between 0 and 100) as its argument.
6. Use the `MEDIAN()` function to report the median unemployment rate in cell
B7.
7. Use the `PERCENTILE.INC()` function to report the 90th percentile of the
unemployment rate in cell B8.
8. Use the `MAX()` function to report the maximum unemployment rate in cell B9.
We now have a table that reports all of the major summary statistics calculated
for the unemployment rate.
:::
We would also like to calculate summary statistics for other variables.
Fortunately, we have set up our table in a way that makes that easy.
::: example
**Summary statistics for other variables**
To fill in summary statistics for other variables:
1. Copy the contents of cells B1:B9 to cells C1:F9.
Now column C contains summary statistics for the labour force participation
rate (**LFPRate**), and columns D through F contain summary statistics for some
of the other variables in our data set.
:::
If we wanted to, we could set up the table to calculate summary statistics for
all of the variables. But let's just stop with these, and move on to making the
table look a little nicer.
### Cleaning up the table
Our table now has all of the information we need, but it is still kind of ugly.
Let's make it look nice and presentable.
::: example
**Cell size**
The first problem is that the columns may be too narrow or too wide. If a column
is too narrow, some of its values will display as "####". If it is too wide, it
will take up too much room.
There are many different ways to adjust row heights and column widths, but here
is the simplest:
1. Select the whole sheet by clicking on the
![select sheet](bin/SelectSheetButton.png){height=16pt}
button in its upper left corner.
2. Select `Home > Format > AutoFit Column Width` from the menu.
Not only will this make everything fit, it will automatically adjust the width
as necessary when anything changes.
:::
::: example
**Error codes**
The second problem is that the non-numeric variables report error codes like
`#DIV/0!` or `#NUM!` for several statistics, and nonsense values for others.
It is usually better to leave something blank than have it display meaningless,
confusing or incorrect information. So let's just delete those columns:
1. Select columns D and E.
2. Select `Home > Delete > Delete Sheet Columns`.
:::
::: example
**Display formatting**
To make the display formatting a little nicer.
1. Put the top row in **bold**.
2. Put the first column in *italics*.
3. Adjust the number display formats to look nice. Remember that the number
display format has no effect on the number itself.
- Leave the counts as they are.
- Display the unemployment rate, LFP rate and population growth rate in
percentages, rounded to one decimal place.
4. Feel free to play around with colors, fonts, etc. to get a table that you
like.
5. Once you have the table you like, save your working copy.
We now have a nice table of summary statistics:
![*Table of summary statistics*](bin/SummaryStatisticsTable.png){width=50%}
that we could put into a Word or PowerPoint document, and share with an audience.
:::
## Frequency tables
Another way of looking at a single variable is to construct a
***frequency table***. A frequency table describes the distribution of values
for a single variable.
### Simple frequency tables
Frequency tables are at their simplest and most useful when the underlying
variable is discrete, categorical, or even non-numeric. The frequency table for
this kind of variable looks like this:
| variablename | Count | Percentage |
|:-----------------|:------------:|:------------:|
| *(value1)* | *(# value1)* | *(% value1)* |
| *(value2)* | *(# value2)* | *(% value2)* |
| *(value3)* | *(# value3)* | *(% value3)* |
| *etc.* | | |
where:
- The **(variablename)** column lists all possible values of the variable.
- The **Count** column reports the number of observations in which the
variable matches that value or range of values.
- The **Percentage** column reports the count as a percentage of the
total number of observations.
The Excel functions `COUNTIF()` and `COUNTIFS()` can be used to construct the
counts. I will use `COUNTIFS()` which has a pair of arguments:
- The first argument `criteria_range` gives the range containing the data we
want to describe.
- The second argument `criteria` gives the criteria we want to match.
The kind of criteria you can use are most easily described by examples:
- The formula `=COUNTIFS(A1:A5,"Hello")` returns a count of the number of cells
in the range A1:A5 that contain the string "Hello".
- The formula `=COUNTIFS(A1:A5,5)` returns a count of the number of cells that
contain the number 5.
- The formula `=COUNTIFS(A1:A5,">0")` returns a count of the number of cells in
the range A1:A5 that contain a number greater than zero.
- The formula `=COUNTIFS(A1:A5,B1)` returns a count of the number of cells in
the range A1:A5 that satisfy the criterion given in cell B1.
::: example
**A simple frequency table**
Let's create a frequency table for the **Party** variable in our data set. It
should look like this:
| Party | Count | Percentage |
|:---------------|:------------------:|:------------------:|
| *Liberal* | *(# Liberal)* | *(% Liberal)* |
| *Conservative* | *(# Conservative)* | *(% Conservative)* |
| *NDP* | *(# NDP)* | *(% NDP)* |
| *Transfer* | *(# Transfer)* | *(% Transfer)* |
Note that I have included a political party (the NDP) that is not observed in
our data set. We start by setting up the table:
1. Create a new sheet named "Party control"
2. Fill in the top row and first column as in the table above.
Next we will fill in values for the **Count** column (B):
3. Fill in cell B2 using the Excel function `COUNTIFS()`.
- The `criteria_range` will be the data in our main data set
(the *Data for Analysis* worksheet) covering the **Party** variable
(cell range I2:I542), or `'Data for Analysis'!I2:I542`.
- The `criteria` we want to match is the contents of the cell
in the current row of column A, or `A2`.
- So the formula to enter in cell B2 is
`=COUNTIFS('Data for Analysis'!I2:I542,A2)`.
- The result displayed should be 302, the number of observations in our data
that have the Party identified as Liberal.
4. We will want to copy the formula in cell B2 to the other cells in column B.
But before we do that, we need to make some of the relative references into
absolute references. In this case, our table will keep the same
`criteria_range` but will want to change the `criteria`, so:
- Change the formula in cell B2 to
`=COUNTIFS('Data for Analysis'!I$2:I$542,A2)`.
5. Copy cell B2 to cells B3:B4.
- Check the formulas in those cells to make sure they look right.
Next, we need to fill in the **Percentage** column (C):
6. To do that we just need to divide each cell in the Count column by the sum of
all the cells in that column. So the formula in cell C2 should be
`=B2/SUM(B2:B5)`.
7. Change relative references to absolute references as needed.
8. Copy the formula in cell C2 to cells C3:C5.
9. By default, the percentages are displayed as proportions. Change the
display format to percentage, with one decimal place.
10. Make any other formatting changes you like, and save your working copy.
We now have a simple frequency table for the Party variable:
![*Party control frequency table*](bin/PartyControlFrequencyTable.png){width=50%}
:::
### Binned frequency tables
We can also construct frequency tables for continuous variables or discrete
variables with many possible values, but doing so is a little more complicated.
We cannot just construct a table with a row for each possible value, since there
are many possible values.
Instead, we divide the data's range of possible or observed values into a set of
sub-ranges or ***bins***. Then we can calculate and report the number or
percentage of observations that fall within each bin. A binned frequency table
looks like this:
| From | To | Count | Percentage |
|:----------|:--------:|:-----:|:----------:|
| *(from1)* | *(to1)* | | |
| *(from2)* | *(to2)* | | |
| etc. | | | |
In constructing bins, we need to apply some good judgment, and keep in mind a
few requirements and considerations:
- We need the bins to cover the full range of the data. In particular:
- The lower bound of the lowest bin should be lower than the lowest value in
the data.
- The upper bound of the highest bin should be higher than the highest value
in the data.
- Each bin's upper bound should be the lower bound of the next bin.
- Boundaries should be addressed in a consistent manner, so that each
observation falls into exactly one bin.
- We often want the bins to be equally sized.
- But that isn't always the case. See the unemployment rate table in the
example below; if it used equally sized bins, most of the bins would be
empty.
- We often want the upper and lower bounds of the bins to be nice round numbers.
- The number of bins is a matter for judgment, and depends on what kind of
patterns we are aiming to find in the data:
- We may miss broad patterns if we have too many bins.
- We may miss potentially interesting details if we have too few bins.
- The solution is to explore multiple options, and see what patterns you can
find out.
Again, we will use `COUNTIFS()` to construct the counts. But we will need to
take advantage of a feature of `COUNTIFS()` I have not yet mentioned: it takes
multiple `criteria_range` and `criteria` arguments, allowing it to make multiple
comparisons (that's the difference between `COUNTIF()` and `COUNTIFS()`).
::: example
**A binned frequency table**
Let's create the following table for the unemployment rate variable:
| From | To | Count | Percentage |
|:------|:------:|:-----:|:----------:|
| 0.0% | 5.0% | | |
| 5.0% | 7.5% | | |
| 7.5% | 10.0% | | |
| 10.0% | 15.0% | | |
| 15.0% | 100.0% | | |
where **Count** is the number of observations for **UnempRate** that are greater
than that row's *From* value and less than or equal to the row's *To* value, and
*Percentage* is the count as a percentage of the total. We start by setting up
the table:
1. Create a new sheet titled *Unemployment frequency*.
2. Copy the first row and first two columns from the table above.
- Enter the **From** and **To** values as decimals (0 to 1) and display
them as percentages (0% to 100%).
Next, we fill in the first count in cell C2. This will be a somewhat complex
formula, so we will build it in stages:
1. Let's start by counting the observations with UnempRate greater than zero:
- The `criteria_range1` value should be the data range for the UnempRate
variable, i.e, `'Data for Analysis'!G2:G542`.
- The `criteria1` value should be is the string ">0".
- So the formula should be `=COUNTIFS('Data for Analysis'!G2:G542, ">0")`
2. Next, let's add the criterion that UnempRate is less than or equal to 5%.:
- The `criteria_range2` value should be the same as `criteria_range1`.
- The `criteria2` value should be is the string "<=0.05".
- So the formula should be
`=COUNTIFS('Data for Analysis'!G2:G542, ">0", 'Data for Analysis'!G2:G542, "<=0.05")`.
3. Finally, let's have the formula retrieve the `criteria1` and `criteria2`
values from cells A2 and B2.
- This enhancement is not strictly required but it has two big advantages:
(1) We will be able to copy this formula into other cells.
(2) We can change the bins in columns A and B, and the calculation
will automatically adjust.
- We can use the `CONCAT()` function to construct the criteria:
- `CONCAT(">",A2)` will return ">0".
- `CONCAT("<=",B2)` will return "<=0.05".
- So the formula is
`=COUNTIFS('Data for Analysis'!G2:G542,CONCAT(">",A2),'Data for Analysis'!G2:G542,CONCAT("<=",B2))`.
Cell C2 should now display the number of months (zero) in which the Canadian
unemployment rate was between 0% and 5%. To finish up the table:
4. Update the formula in cell C2 to use absolute references where appropriate.
5. Copy the contents of cell C2 to cells C3:C6.
6. Fill in the **Percentage** column (C) with the appropriate formula and change
its display format to Percent.
7. Make any other formatting changes you like, and save your working copy.
We now have our binned frequency table:
![*Binned frequency table for* **UnempRate**](bin/UnemploymentBinnedFrequencyTable.png){width=50%}
:::
::: {.fyi data-latex=""}
The `FREQUENCY()` function is another way to create a frequency table. However,
this function is a tricky one to learn, and Microsoft recently changed how it
works. So we will skip it.
:::
## Univariate graphs in Excel
Another way to explore our data is by ***visualizing*** it: constructing and
viewing graphs. In this section, we will construct and view three graphs that
are useful for understanding a single variable: the time series graph, the
bar/column graph, and the histogram. We will also discuss some basic principles
for producing presentation-quality graphs.
### Charts in Excel
Excel calls graphs ***charts***. Excel charts have three main components:
1. The ***data source***. This is the table containing the data used to
construct the graph.
- The individual columns in the data source are called ***series***.
2. The ***chart type***. There are many chart types in Excel, but the ones we
will use are:
- ***Line*** graphs have categories on the horizontal axis, and the value of
a variable/series on the vertical axis. Values are depicted by a line
connecting the values of the series at each category.
- ***Column*** charts have categories on the horizontal axis, and the value
of a variable or series on the vertical axis. Values are depicted by a
vertical bar for each value of the series.
- ***Bar*** charts are just column charts with the axes reversed.
- ***Scatter*** or ***XY*** charts plot the value of one series against the
value of another series. Values can be depicted as a set of points, as a
connected line, or both.
3. The ***chart elements***, which are the different pieces of the chart.
- Chart elements can be added, removed, or modified.
- The available elements vary across chart types, but can include axes,
titles, legends, gridlines, and text labels.
The usual workflow here is to select the data, choose a chart type, and then
modify chart elements until the chart looks the way you want it.
### Time series (line) graphs
A ***time series graph*** plots one or more variables at multiple points in
time. Conventionally, time is on the horizontal axis and the variable's value
is on the vertical axis. For example, we will generate a time series graph that
looks like this:
![*Time series graph of* **UnempRate**](bin/TimeSeriesGraphFinal.png){width=50%}
In Excel, time series graphs can be implemented using either the Line chart type
or the Scatter chart type. Line graphs are simpler, so we will start with that.
We will learn how to make scatter plots in Chapter \@ref(using-r).
The first step in creating an Excel chart is to select the data source and graph
type.
::: example
**Creating a time series graph for UnempRate**
We want to create a time series graph for our variable **UnempRate**.
1. Select any cell in the *Data for analysis* table.
2. Select `Insert > Recommended Charts` from the menu.
3. The dialog box will show a few recommended charts. None of them are what
we want, so select the `All Charts` tab at the top to see the full set
of options.
4. Select `Line` to see the line graph options.
5. The dialog box gives you several different types of line graph. Select the
picture that looks like this:
![Recommended](bin/RecommendedLineGraph.png){width=50%}
and select `OK`.
We have our time series graph (it looks like the picture above). Unfortunately,
it has a line for *every* variable in our data source. We only want to plot
one variable, so let's get rid of the others:
1. Select `Design > Select Data` which will open the `Select Data Source`
dialog box.
2. Uncheck the check box next to every series in the "Legend Entries (Series)"
box except **UnempRate**:
![*Time graph series of* **UnempRate**](bin/TimeSeriesGraphInitial.png){width=50%}
This looks like the graph we want.
:::
### Creating presentation-quality graphs
The graph we have right now is perfectly fine for exploratory data analysis -
using graphs and other tools to better understand the data. However, if we want
to convey our understanding to others we need to fine-tune the
***presentation*** of our data. Like other elements of a presentation, graphs
should be clear, informative, concise, and professional in appearance.
The field of ***data visualization*** was pioneered by the statistician Edward
Tufte and studies how best to visually convey information. A full discussion of
data visualization is beyond the scope of this course, but we can talk about and
apply a few basic ideas:
1. The information provided should be clear.
2. All necessary information should be provided.
3. Repeated, irrelevant or misleading elements (what Tufte calls "chartjunk")
should be removed.
4. Related information should be visually "close" and required eye movement
should be minimized.
5. Color and formatting should be used to simplify interpretation rather than to
complicate it.
5. Accessibility should be taken into account. For example:
- Color is not always available to readers: Some readers are color
blind, projectors often get colors "wrong", or your document
may get printed on a black-and-white printer.
- Always include non-color visual cues (for example dashed lines or
different points with varying shape) so that readers who cannot
distinguish color (for whatever reason) can tell different series apart.
- Consider using a color palette that is colorblind-friendly.
- A simple way to do this is to avoid using red and green or blue and
yellow as contrasting colors, since these are the two most common forms
of colorblindness.
- The geographer Cynthia Brewer has developed a set of
colorblind-friendly palettes known as ColorBrewer. They are available
at https://colorbrewer2.org/.
- Readers in their 50's and above often have difficulty reading small text.
- Avoid squeezing important text into a tight space.
- Generate graphics that can be re-scaled.
- Readers may have more general and severe visual impairments.
- Provide alt text for all images.
::: example
**Preparing our graph for presentation**
Let's start by removing unnecessary chart elements.
1. Select the chart. This will cause the `Chart Design` and `Format` menu items
to appear in the menu.
2. To remove the horizontal gridlines, select
`Chart Design > Add Chart Element > Gridlines > Primary Major Horizontal`.
3. To remove the legend, select
`Chart Design > Add Chart Element > Legend > None`.
Next, let's modify the title to be more informative. Right now, it is not clear
from the graph what country's unemployment rate this is.
1. Select the title.
2. Type in the new title, and hit the `<enter>` key.
- I went with the title **Canadian unemployment, 1976-present**
- I also made it **boldface**.
Next, let's modify the horizontal axis to be a little less busy:
1. Double-click on the horizontal axis. The `Format Axis` box should appear to
the right.
- It may take you more than one try to double-click on the correct object.
2. Select `Axis Options`.
- You can see all of the choices Excel made here based on what it sees in the
data.
- Feel free to play around with these options. You can return each option to
its original/default state by clicking on that option's `Reset` button.
3. Change the major units to either 4 Years or 8 Years, whichever one you like
better.
Finally, let's add a text label for our time series:
1. Select the chart.
2. Select `Format > Insert Text Box` from the menu. The cursor will change into
one that looks like this:
3. Draw the text box at the desired location on the chart and enter the text
"Unemployment rate"
4. Resize and move the box to just above the line for the time series.
5. Change the color of the text to the same color as the line.
This isn't really necessary, and violates our principle of avoiding repeated
information. But it is useful when we are graphing multiple time series in the
same chart, as it is more direct than a legend.
Finally, we want to add alt text for the visually impaired.
1. Select the chart.
2. Select `Format > Alt Text`. The Alt Text dialog box will open to the right.
3. Enter the alt text in the box.
- I used *Graph of Canadian unemployment rate from January 1976 to present.*
4. Make any other formatting changes you like, and save your working copy.
The graph will now look like the one above. The graph here is visually clean
and simple, in part because I left out many elements that I could have included:
axis labels (not needed because the units are obvious from context), a fancy
background, etc.
:::
One consideration that often comes up when constructing graphs is whether the
vertical axis should start at zero when the range of data does not reach zero.
In our time series graph of the unemployment rate, Excel included 0% by default
even though the variable never went below 5%. I could have changed the vertical
axis to start at 5%, but decided not to.
There is no strict rule for whether to include zero, but we can consider the
general principle described earlier: our graph should be informative and not
misleading.
::: {.fyi data-latex=""}
**For further reading**
Data visualization skills are valuable in the academic world and in the
professional world. If you are interested in developing your skills
further, you might consider our course
ECON 334: *Data Visualization and Economic Analysis*.
You might also get a book on data visualization, either Kieran Healy's
*Data Visualization* or Cole Nussbaumer Knaflic's *Storytelling with Data*.
Healy's book is aimed at a primarily academic audience using R, while Knaflic's
is aimed at a business audience using Excel. Both of them are practical and easy
reads, with many examples.
:::
### Frequency (bar/column) graphs
***Bar*** graphs are used to represent the frequency distribution of a
categorical variable. Bar graphs use bars of different length to represent the
value of some aggregate variable in each category.
Bar graphs are produced from a frequency table, as they are a visual depiction
of the information in such a table.
They can be produced in Excel using either the Bar chart type or the Column
chart type. The difference between the two is that the bars are horizontal in
the Bar chart type and vertical in the Column chart type.
::: example
**A bar graph of the Party variable**
To construct a bar graph of the **Party** variable:
1. Select any cell in the *Party control* worksheet.
2. Select `Insert > Recommended Charts` from the menu.
3. Select the `All Charts` tab to see the full list of options.
4. Select `Column` and then select `OK`.
The graph will look something like this:
![*Bar graph of* **Party** *variable, default options*](bin/BarGraphInitial.png){width=50%}
As you can see, basic bar graphs are quite simple. There is a bar for each
category (the first column of the table) and variable (the other columns), and
the length of each bar corresponds to its value.
As with our line graph, the current graph contains more information than we
actually want - it shows the count *and* the percentage, and we really only
need one of those.
1. Select the chart.
2. Select `Chart Design > Select Data`.
3. Uncheck the check boxes for Percentage, NDP and Transfer.
We now have a bar graph that looks like this:
![*Bar graph of* **Party** *variable, simplified*](bin/BarGraph2.png){width=50%}
and shows the number of months in which the two largest federal parties were in
government over the time frame of our data.
:::
As with line graphs, we can prepare a bar graph for presentation by using
Excel's tools with an eye towards the principles of effective presentation
graphics.
::: example
**Cleaning up our bar graph**
We can do a few easy things to simplify and clarify our bar graph:
1. Change the title.
2. Remove the gridlines.
3. Remove the legend.
4. Add the vertical axis title **Months in control** to clarify the units.
Another thing we can do is use color and branding to convey information: each
major Canadian political party has a distinctive color as part of its brand: red
for the Liberals, blue for the Conservatives, and orange for the NDP.
1. Double-click on the Liberal bar.
2. Change its color to red.
3. Double-click on the Conservative bar.
4. Change its color to blue. The Conservatives use two shades of blue, so pick
whichever one you like best.
Finally, the purpose of a bar graph is to enable the viewer to compare
magnitudes, which requires looking at the *top* of each bar. But you may notice
that the category labels are at the *bottom* of each bar. Following the
principle that we don't want to make the reader's eye do any extra work, let's
put those labels on top.
1. Select the chart.
2. Select `Chart Design > Add Chart Element > Data Labels > More data label options`.
You will notice the numbers 302 and 233 (the heights of the two bars) appear
above the bars, and the `Format Data Labels` box will appear to the right.
3. Select `Label options` from the dialog box.
4. Uncheck `Value` and check `Category Name`. The category names Liberal and
Conservative will now appear above the respective bar.
5. Select `Chart Design > Add Chart Element > Axes > Primary Horizontal` to
eliminate the category names from the bottom of each bar.
6. You might also consider changing the color of the two labels to match the
color of the corresponding bar.
7. Make any other formatting changes you like, and save your working copy.
Our final graph looks like this:
![*Bar graph of* **Party** *variable, final version*](bin/BarGraphFinal.png){width=50%}
:::
One of the keys to effective bar graph design is to keep the graph simple and
clean, and to avoid "chartjunk."
First, bar graphs should *always* start at zero. The size of the bar is meant
to visually represent the value of the variable it is depicting. Using any
origin other than zero can cause relative sizes to be misleading.
::: example
**Why bar graphs should always start at zero**
Suppose we start our **Party** bar graph at 200 rather than zero. We will get
this:
![*Bar graph for* **Party**, *starting at 200*](bin/BarGraphNonzeroOrigin.png){width=50%}
The relative size of these bars is very misleading. The bar for Liberal is three
times as big as the bar for Conservative, even though the number it represents
is only 29% higher.
:::
Pie graphs are an alternative way of depicting relative frequencies, and are
also available in Excel along with 3-dimensional variations on both pie and bar
graphs:
| Graph type | Example |
|:-------------|:-------------------------------------------------------------------:|
| Pie graph | ![*Pie chart of* **Party** *variable*](bin/PieChart.png){width=30%} |
| 3D pie graph | ![3D pie chart of Party variable](bin/3DPieChart.png){width=30%} |
| 3D bar graph | ![3D bar chart of Party variable](bin/3DBarChart.png){width=30%} |
However, most data visualization experts recommend against their use. Research
on how people process visual information usually find that they are *less*
informative in practice. People are much better at evaluating the relative size
of two lines or rectangles than they are at evaluating the relative size of two
pie slices. They are also much better at assessing relative distances in two
dimensions than they are in three dimensions.
### Histograms
A ***histogram*** is just a frequency bar/column graph for a continuous
variable. It is constructed by sorting the variable's values into some number
of equally-sized bins and then plotting a bar for each bin that represents
counts or percentages.
The histogram is a built-in chart type in Excel.
::: example
**Constructing a histogram for UnempRate**
To construct a histogram of the **UnempRate** variable:
1. Select column G in the *Data for Analysis* worksheet.
- Notice that for the other charts we could just select any cell in this
worksheet. For some reason, this won't work with the histogram chart type.
2. Select `Insert > Recommended Charts" from the menu bar.
3. Select the `All Charts` tab to see the full set of options.
3. Select `Histogram` and then `OK`.
Our graph will appear:
![*Histogram of* **UnempRate**, *with default options*](bin/HistogramInitial.png){width=50%}
:::
There are two main choices in histogram design:
- How many bins and how should they be determined?
- Should we depict frequencies as counts or percentages?