You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This chapter introduces some of the most important structures for storing and working with data: vectors, matrices, lists, and data frames.
7
7
8
-
## {{< fa bullseye >}} Objectives
8
+
## {{< fa bullseye >}} Objectives {.intro}
9
9
10
10
- Understand the differences between lists, vectors, data frames, matrices, and arrays in R and python
11
11
- Be able to use location-based indexing in R or python to pull out subsets of a complex data object
12
12
13
+
::: {.callout-caution .intro}
13
14
## Python Package Installation
14
15
15
16
You will need the `numpy` and `pandas` packages for this section. Pick one of the following ways to install python packages:
@@ -39,8 +40,9 @@ In a python chunk (or the python terminal), you can run the following command. T
39
40
40
41
:::
41
42
43
+
:::
42
44
43
-
## Data Structures Overview
45
+
## Data Structures Overview {.intro}
44
46
45
47
In @sec-basic-var-types, we discussed 4 different data types: strings/characters, numeric/double/floats, integers, and logical/booleans. As you might imagine, things are about to get more complicated.
46
48
@@ -53,114 +55,36 @@ Data **structures** are more complex arrangements of information, but they are s
53
55
| N-D | array ||
54
56
55
57
::: callout-warning
58
+
### Opinionated Structures
59
+
56
60
Those of you who have taken programming classes that were more computer science focused will realize that I am leaving out a lot of information about lower-level structures like pointers.
57
61
I'm making a deliberate choice to gloss over most of those details in this chapter, because it's already hard enough to learn 2 languages worth of data structures at a time.
58
62
In addition, R doesn't have pointers [No Pointers in R, @matloffArtProgrammingTour2011], so leaving out this material in python streamlines teaching both two languages, at the cost of overly simplifying some python concepts.
59
63
If you want to read more about the Python concepts I'm leaving out, check out @frippAnswerPythonPandas2016.
60
-
:::
61
-
62
-
63
-
## Lists
64
-
65
-
A **list** is a one-dimensional column of heterogeneous data - the things stored in a list can be of different types.
66
-
67
-

68
-
69
-
::: panel-tabset
70
-
### R {.unnumbered}
71
-
72
-
```{r list-r}
73
-
x <- list("a", 3, FALSE)
74
-
x
75
-
```
76
64
77
-
### Python {.unnumbered}
78
-
79
-
```{python list-py}
80
-
x = ["a", 3, False]
81
-
x
82
-
```
83
65
:::
84
66
85
-
The most important thing to know about lists, for the moment, is how to pull things out of the list. We call that process **indexing**.
67
+
In any data structure, it's important to be able to pull smaller pieces of data out of the structure.
68
+
We do this via **indexing**.
86
69
87
-
### Indexing
70
+
There are three main approaches to accessing information using indexes:
88
71
89
-
Every element in a list has an **index** (a location, indicated by an integer position)[^05-vectors-1].
72
+
1. Object Names
73
+
In some cases, components of a data structure are **named** and can be accessed using those names.
90
74
91
-
[^05-vectors-1]: Throughout this section (and other sections), lego pictures are rendered using https://www.mecabricks.com/en/workshop. It's a pretty nice tool for building stuff online!
75
+
2. Location
76
+
Think of a location index as accessing the nth item in a list, or accessing cell A5 in an Excel spreadsheet - you have strict directions as to what row/column or item to get.
92
77
93
-
::: panel-tabset
94
-
#### R concept {.unnumbered}
95
-
96
-
In R, we count from 1.
97
-
98
-

99
-
100
-
#### R code {.unnumbered}
78
+
3. Logical Indexing
79
+
In a logical index, you access all items in a structure for which a condition is TRUE. This would be like making a list of family members, and then assigning bedtimes using a statement like "all of the children go to bed at 8pm" - first you decide whether a person is a child, and then you can assign the appropriate bedtime if child is true.
101
80
102
-
```{r, error = T}
103
-
x <- list("a", 3, FALSE)
81
+
In both R and Python, we will primarily use square brackets to index different data types.
82
+
When the data type is rectangular (has both rows and columns), we will use `[row, column]` syntax -- that is, `[1, 3]` says access the first row, third column.
83
+
When the data type is a vector, we will use `[item]` indexing.
104
84
105
-
x[1] # This returns a list
106
-
x[1:2] # This returns multiple elements in the list
85
+
Another important difference to keep in mind is that in R, items are 1-indexed -- that is, the first item in a list`x` is `x[1]`.
86
+
In Python, on the other hand, items are 0-indexed -- the first item in a list`y` is `y[0]`.
107
87
108
-
x[[1]] # This returns the item
109
-
x[[1:2]] # This doesn't work - you can only use [[]] with a single index
110
-
```
111
-
112
-
In R, list indexing with `[]` will return a list with the specified elements.
113
-
114
-
To actually retrieve the item in the list, use `[[]]`. The only downside to `[[]]` is that you can only access one thing at a time.
115
-
116
-
#### Python concept {.unnumbered}
117
-
118
-
In Python, we count from 0.
119
-
120
-

121
-
122
-
#### Python code {.unnumbered}
123
-
124
-
```{python}
125
-
x = ["a", 3, False]
126
-
127
-
x[0]
128
-
x[1]
129
-
x[0:2]
130
-
```
131
-
132
-
In Python, we can use single brackets to get an object or a list back out, but we have to know how **slices** work. Essentially, in Python, `0:2` indicates that we want objects 0 and 1, but want to stop at 2 (not including 2). If you use a slice, Python will return a list; if you use a single index, python just returns the value in that location in the list.
133
-
:::
134
-
135
-
We'll talk more about indexing as it relates to vectors, but indexing is a general concept that applies to just about any multi-value object.
136
-
137
-
### Concatenation
138
-
139
-
Another important thing to know about lists is how to combine them.
140
-
If I have rosters for two classes and I want to make a list of all of my students, I need to somehow merge the two lists together.
list(dict.fromkeys(students)) # get only unique names
162
-
```
163
-
:::
164
88
165
89
## Vectors
166
90
@@ -384,7 +308,31 @@ animals[~good_pets] # equivalent to using bad_pets
384
308
```
385
309
:::
386
310
387
-
### Math with Vectors
311
+
312
+
#### Logical Operations on Vectors {.intermediate .advanced}
313
+
314
+
Indexing with logical vectors is an extremely powerful technique -- so much so that it is worthwhile to quickly review how logical operations can be combined.
315
+
In both R and Python, we can operate on logical vectors with standard operators -- AND, OR, and NOT.
316
+
317
+
::: panel-tabset
318
+
319
+
##### R
320
+
321
+
```{R}
322
+
# This set of code converts pi to a character,
323
+
# splits the string into single-character pieces,
324
+
# converts it back to numbers,
325
+
# and removes the NA resulting from converting '.' to a number.
326
+
x <- na.omit(as.numeric(strsplit(as.character(pi), "")[[1]]))
327
+
```
328
+
329
+
##### Python
330
+
331
+
332
+
:::
333
+
334
+
335
+
### Math with Vectors {.intermediate .advanced}
388
336
389
337
In order to talk about mathematical operations on (numerical) vectors, we first need to consider different ways we could combine vectors.
390
338
If the vectors are the same length, we could perform mathematical operations on the elements (and if they're not the same length we could come up with some convention to coerce them to be the same length).
@@ -422,6 +370,7 @@ When using numeric vectors, the element-wise operations are the same for vectors
422
370
: Element-wise mathematical operators in R and Python {#tbl-math-operators2}
423
371
424
372
::: panel-tabset
373
+
425
374
##### R
426
375
```{r}
427
376
a <- c(1:5)
@@ -454,7 +403,10 @@ a ** b
454
403
455
404
:::
456
405
406
+
407
+
457
408
#### Vector-to-Scalar Operations
409
+
458
410
Let's cover a few built-in or commonly-used vector summary operations here, focusing on those which are most useful for statistics.
459
411
460
412
Function | R | Python
@@ -584,9 +536,6 @@ pd.DataFrame(list(itertools.product(A, B))) # data frame
584
536
585
537
:::
586
538
587
-
### Logical Operations on Vectors
588
-
589
-
XXX TODO XXX
590
539
591
540
### Reviewing Types
592
541
@@ -679,6 +628,110 @@ x[x2xor3]
679
628
```
680
629
:::
681
630
631
+
632
+
## Lists
633
+
634
+
A **list** is a one-dimensional column of heterogeneous data - the things stored in a list can be of different types.
635
+
636
+

637
+
638
+
::: panel-tabset
639
+
### R {.unnumbered}
640
+
641
+
```{r list-r}
642
+
x <- list("a", 3, FALSE)
643
+
x
644
+
```
645
+
646
+
### Python {.unnumbered}
647
+
648
+
```{python list-py}
649
+
x = ["a", 3, False]
650
+
x
651
+
```
652
+
:::
653
+
654
+
The most important thing to know about lists, for the moment, is how to pull things out of the list. We call that process **indexing**.
655
+
656
+
### Indexing
657
+
658
+
Every element in a list has an **index** (a location, indicated by an integer position)[^05-vectors-1].
659
+
660
+
[^05-vectors-1]: Throughout this section (and other sections), lego pictures are rendered using https://www.mecabricks.com/en/workshop. It's a pretty nice tool for building stuff online!
661
+
662
+
::: panel-tabset
663
+
#### R concept {.unnumbered}
664
+
665
+
In R, we count from 1.
666
+
667
+
{fig-alt="A set of 5 bricks on a virtual lego board. The first brick is a blue 1x1 labeled with a 1. The second brick is a green 1x2, labeled with a 2. The third brick is a pink 1x4 labeled with a 3. The fourth brick is a yellow 1x2, labeled with a 4. The fifth brick is a pink 2x2 labeled with a 5. This represents a list, as the blocks are of different types (sizes)."}
668
+
669
+
#### R code {.unnumbered}
670
+
671
+
```{r, error = T}
672
+
x <- list("a", 3, FALSE)
673
+
674
+
x[1] # This returns a list
675
+
x[1:2] # This returns multiple elements in the list
676
+
677
+
x[[1]] # This returns the item
678
+
x[[1:2]] # This doesn't work - you can only use [[]] with a single index
679
+
```
680
+
681
+
In R, list indexing with `[]` will return a list with the specified elements.
682
+
683
+
To actually retrieve the item in the list, use `[[]]`. The only downside to `[[]]` is that you can only access one thing at a time.
684
+
685
+
#### Python concept {.unnumbered}
686
+
687
+
In Python, we count from 0.
688
+
689
+
{fig-alt="A set of 5 bricks on a virtual lego board. The first brick is a blue 1x1 labeled with a 0. The second brick is a green 1x2, labeled with a 1. The third brick is a pink 1x4 labeled with a 2. The fourth brick is a yellow 1x2, labeled with a 3. The fifth brick is a pink 2x2 labeled with a 4. This represents a list, as the blocks are of different types (sizes)."}
690
+
691
+
#### Python code {.unnumbered}
692
+
693
+
```{python}
694
+
x = ["a", 3, False]
695
+
696
+
x[0]
697
+
x[1]
698
+
x[0:2]
699
+
```
700
+
701
+
In Python, we can use single brackets to get an object or a list back out, but we have to know how **slices** work. Essentially, in Python, `0:2` indicates that we want objects 0 and 1, but want to stop at 2 (not including 2). If you use a slice, Python will return a list; if you use a single index, python just returns the value in that location in the list.
702
+
:::
703
+
704
+
We'll talk more about indexing as it relates to vectors, but indexing is a general concept that applies to just about any multi-value object.
705
+
706
+
### Concatenation
707
+
708
+
Another important thing to know about lists is how to combine them.
709
+
If I have rosters for two classes and I want to make a list of all of my students, I need to somehow merge the two lists together.
data = pd.read_html("https://worldpopulationreview.com/states")[0]
956
+
957
+
data = pd.read_csv("https://raw.githubusercontent.com/srvanderplas/stat-computing-r-python/main/data/population2024.csv")
904
958
905
959
population2024 = pd.Series(data['2024 Population'].values, index = data['State'].values).sort_values()
906
960
population2023 = pd.Series(data['2023 Population'].values, index = data['State'].values).sort_values()
907
-
populationCensus = pd.Series(data['2020 Population'].values, index = data['State'].values).sort_values()
961
+
population2020 = pd.Series(data['2020 Population'].values, index = data['State'].values).sort_values()
908
962
909
-
population2024.head()
910
-
population2023.head()
911
-
populationCensus.head()
963
+
population2024.head(10)
964
+
population2023.head(10)
965
+
population2020.head(10)
912
966
```
913
967
914
968
The only problem is that by doing this, we've now lost the ordering that matched across all 3 vectors.
915
-
Pandas Series are great for this, because they use labels that allow us to reconstitute which value corresponds to which label, but in R or even in numpy arrays, vectors don't inherently come with labels.
916
-
In these situations, sorting by one value can actually destroy the connection between two vectors!
969
+
Pandas Series are great for showing this problem, because they use labels that allow us to reconstitute which value corresponds to which label, but in R or even in numpy arrays, vectors don't inherently come with labels.
970
+
In these situations, sorting by one value can actually destroy the connection between two vectors, in a way that you don't even notice!
0 commit comments