Skip to content

Commit ff8729e

Browse files
author
Susan Vanderplas
committed
Update to remove problem with site update
1 parent 787c4d4 commit ff8729e

File tree

2 files changed

+164
-110
lines changed

2 files changed

+164
-110
lines changed

_freeze/part-gen-prog/03-data-struct/execute-results/html.json

Lines changed: 2 additions & 2 deletions
Large diffs are not rendered by default.

part-gen-prog/03-data-struct.qmd

Lines changed: 162 additions & 108 deletions
Original file line numberDiff line numberDiff line change
@@ -5,11 +5,12 @@
55

66
This chapter introduces some of the most important structures for storing and working with data: vectors, matrices, lists, and data frames.
77

8-
## {{< fa bullseye >}} Objectives
8+
## {{< fa bullseye >}} Objectives {.intro}
99

1010
- Understand the differences between lists, vectors, data frames, matrices, and arrays in R and python
1111
- Be able to use location-based indexing in R or python to pull out subsets of a complex data object
1212

13+
::: {.callout-caution .intro}
1314
## Python Package Installation
1415

1516
You will need the `numpy` and `pandas` packages for this section. Pick one of the following ways to install python packages:
@@ -39,8 +40,9 @@ In a python chunk (or the python terminal), you can run the following command. T
3940

4041
:::
4142

43+
:::
4244

43-
## Data Structures Overview
45+
## Data Structures Overview {.intro}
4446

4547
In @sec-basic-var-types, we discussed 4 different data types: strings/characters, numeric/double/floats, integers, and logical/booleans. As you might imagine, things are about to get more complicated.
4648

@@ -53,114 +55,36 @@ Data **structures** are more complex arrangements of information, but they are s
5355
| N-D | array | |
5456

5557
::: callout-warning
58+
### Opinionated Structures
59+
5660
Those of you who have taken programming classes that were more computer science focused will realize that I am leaving out a lot of information about lower-level structures like pointers.
5761
I'm making a deliberate choice to gloss over most of those details in this chapter, because it's already hard enough to learn 2 languages worth of data structures at a time.
5862
In addition, R doesn't have pointers [No Pointers in R, @matloffArtProgrammingTour2011], so leaving out this material in python streamlines teaching both two languages, at the cost of overly simplifying some python concepts.
5963
If you want to read more about the Python concepts I'm leaving out, check out @frippAnswerPythonPandas2016.
60-
:::
61-
62-
63-
## Lists
64-
65-
A **list** is a one-dimensional column of heterogeneous data - the things stored in a list can be of different types.
66-
67-
![A lego list: the bricks are all different types and colors, but they are still part of the same data structure.](../images/gen-prog/lego-list.png)
68-
69-
::: panel-tabset
70-
### R {.unnumbered}
71-
72-
```{r list-r}
73-
x <- list("a", 3, FALSE)
74-
x
75-
```
7664

77-
### Python {.unnumbered}
78-
79-
```{python list-py}
80-
x = ["a", 3, False]
81-
x
82-
```
8365
:::
8466

85-
The most important thing to know about lists, for the moment, is how to pull things out of the list. We call that process **indexing**.
67+
In any data structure, it's important to be able to pull smaller pieces of data out of the structure.
68+
We do this via **indexing**.
8669

87-
### Indexing
70+
There are three main approaches to accessing information using indexes:
8871

89-
Every element in a list has an **index** (a location, indicated by an integer position)[^05-vectors-1].
72+
1. Object Names
73+
In some cases, components of a data structure are **named** and can be accessed using those names.
9074

91-
[^05-vectors-1]: Throughout this section (and other sections), lego pictures are rendered using https://www.mecabricks.com/en/workshop. It's a pretty nice tool for building stuff online!
75+
2. Location
76+
Think of a location index as accessing the nth item in a list, or accessing cell A5 in an Excel spreadsheet - you have strict directions as to what row/column or item to get.
9277

93-
::: panel-tabset
94-
#### R concept {.unnumbered}
95-
96-
In R, we count from 1.
97-
98-
![An R-indexed lego list, counting from 1 to 5](../images/gen-prog/list-indexing-r.png)
99-
100-
#### R code {.unnumbered}
78+
3. Logical Indexing
79+
In a logical index, you access all items in a structure for which a condition is TRUE. This would be like making a list of family members, and then assigning bedtimes using a statement like "all of the children go to bed at 8pm" - first you decide whether a person is a child, and then you can assign the appropriate bedtime if child is true.
10180

102-
```{r, error = T}
103-
x <- list("a", 3, FALSE)
81+
In both R and Python, we will primarily use square brackets to index different data types.
82+
When the data type is rectangular (has both rows and columns), we will use `[row, column]` syntax -- that is, `[1, 3]` says access the first row, third column.
83+
When the data type is a vector, we will use `[item]` indexing.
10484

105-
x[1] # This returns a list
106-
x[1:2] # This returns multiple elements in the list
85+
Another important difference to keep in mind is that in R, items are 1-indexed -- that is, the first item in a list `x` is `x[1]`.
86+
In Python, on the other hand, items are 0-indexed -- the first item in a list `y` is `y[0]`.
10787

108-
x[[1]] # This returns the item
109-
x[[1:2]] # This doesn't work - you can only use [[]] with a single index
110-
```
111-
112-
In R, list indexing with `[]` will return a list with the specified elements.
113-
114-
To actually retrieve the item in the list, use `[[]]`. The only downside to `[[]]` is that you can only access one thing at a time.
115-
116-
#### Python concept {.unnumbered}
117-
118-
In Python, we count from 0.
119-
120-
![A python-indexed lego list, counting from 0 to 4](../images/gen-prog/list-indexing-py.png)
121-
122-
#### Python code {.unnumbered}
123-
124-
```{python}
125-
x = ["a", 3, False]
126-
127-
x[0]
128-
x[1]
129-
x[0:2]
130-
```
131-
132-
In Python, we can use single brackets to get an object or a list back out, but we have to know how **slices** work. Essentially, in Python, `0:2` indicates that we want objects 0 and 1, but want to stop at 2 (not including 2). If you use a slice, Python will return a list; if you use a single index, python just returns the value in that location in the list.
133-
:::
134-
135-
We'll talk more about indexing as it relates to vectors, but indexing is a general concept that applies to just about any multi-value object.
136-
137-
### Concatenation
138-
139-
Another important thing to know about lists is how to combine them.
140-
If I have rosters for two classes and I want to make a list of all of my students, I need to somehow merge the two lists together.
141-
142-
::: panel-tabset
143-
#### R
144-
```{r}
145-
class1 <- c("Benjamin Sisko", "Odo", "Julian Bashir", "Jadzia Dax", "Miles O'Brien", "Quark", "Kira Nerys", "Elim Garak")
146-
class2 <- c("Jean-Luc Picard", "William Riker", "Geordi La Forge", "Worf", "Miles O'Brien", "Beverly Crusher", "Deanna Troi", "Data")
147-
148-
students <- c(class1, class2)
149-
students
150-
151-
unique(students) # get only unique names
152-
```
153-
154-
#### Python
155-
```{python}
156-
class1 = ["Benjamin Sisko", "Odo", "Julian Bashir", "Jadzia Dax", "Miles O'Brien", "Quark", "Kira Nerys", "Elim Garak"]
157-
class2 = ["Jean-Luc Picard", "William Riker", "Geordi La Forge", "Worf", "Miles O'Brien", "Beverly Crusher", "Deanna Troi", "Data"]
158-
159-
students = class1 + class2
160-
students
161-
list(dict.fromkeys(students)) # get only unique names
162-
```
163-
:::
16488

16589
## Vectors
16690

@@ -384,7 +308,31 @@ animals[~good_pets] # equivalent to using bad_pets
384308
```
385309
:::
386310

387-
### Math with Vectors
311+
312+
#### Logical Operations on Vectors {.intermediate .advanced}
313+
314+
Indexing with logical vectors is an extremely powerful technique -- so much so that it is worthwhile to quickly review how logical operations can be combined.
315+
In both R and Python, we can operate on logical vectors with standard operators -- AND, OR, and NOT.
316+
317+
::: panel-tabset
318+
319+
##### R
320+
321+
```{R}
322+
# This set of code converts pi to a character,
323+
# splits the string into single-character pieces,
324+
# converts it back to numbers,
325+
# and removes the NA resulting from converting '.' to a number.
326+
x <- na.omit(as.numeric(strsplit(as.character(pi), "")[[1]]))
327+
```
328+
329+
##### Python
330+
331+
332+
:::
333+
334+
335+
### Math with Vectors {.intermediate .advanced}
388336

389337
In order to talk about mathematical operations on (numerical) vectors, we first need to consider different ways we could combine vectors.
390338
If the vectors are the same length, we could perform mathematical operations on the elements (and if they're not the same length we could come up with some convention to coerce them to be the same length).
@@ -422,6 +370,7 @@ When using numeric vectors, the element-wise operations are the same for vectors
422370
: Element-wise mathematical operators in R and Python {#tbl-math-operators2}
423371

424372
::: panel-tabset
373+
425374
##### R
426375
```{r}
427376
a <- c(1:5)
@@ -454,7 +403,10 @@ a ** b
454403

455404
:::
456405

406+
407+
457408
#### Vector-to-Scalar Operations
409+
458410
Let's cover a few built-in or commonly-used vector summary operations here, focusing on those which are most useful for statistics.
459411

460412
Function | R | Python
@@ -584,9 +536,6 @@ pd.DataFrame(list(itertools.product(A, B))) # data frame
584536

585537
:::
586538

587-
### Logical Operations on Vectors
588-
589-
XXX TODO XXX
590539

591540
### Reviewing Types
592541

@@ -679,6 +628,110 @@ x[x2xor3]
679628
```
680629
:::
681630

631+
632+
## Lists
633+
634+
A **list** is a one-dimensional column of heterogeneous data - the things stored in a list can be of different types.
635+
636+
![A lego list: the bricks are all different types and colors, but they are still part of the same data structure.](../images/gen-prog/lego-list.png)
637+
638+
::: panel-tabset
639+
### R {.unnumbered}
640+
641+
```{r list-r}
642+
x <- list("a", 3, FALSE)
643+
x
644+
```
645+
646+
### Python {.unnumbered}
647+
648+
```{python list-py}
649+
x = ["a", 3, False]
650+
x
651+
```
652+
:::
653+
654+
The most important thing to know about lists, for the moment, is how to pull things out of the list. We call that process **indexing**.
655+
656+
### Indexing
657+
658+
Every element in a list has an **index** (a location, indicated by an integer position)[^05-vectors-1].
659+
660+
[^05-vectors-1]: Throughout this section (and other sections), lego pictures are rendered using https://www.mecabricks.com/en/workshop. It's a pretty nice tool for building stuff online!
661+
662+
::: panel-tabset
663+
#### R concept {.unnumbered}
664+
665+
In R, we count from 1.
666+
667+
![An R-indexed lego list, counting from 1 to 5](../images/gen-prog/list-indexing-r.png){fig-alt="A set of 5 bricks on a virtual lego board. The first brick is a blue 1x1 labeled with a 1. The second brick is a green 1x2, labeled with a 2. The third brick is a pink 1x4 labeled with a 3. The fourth brick is a yellow 1x2, labeled with a 4. The fifth brick is a pink 2x2 labeled with a 5. This represents a list, as the blocks are of different types (sizes)."}
668+
669+
#### R code {.unnumbered}
670+
671+
```{r, error = T}
672+
x <- list("a", 3, FALSE)
673+
674+
x[1] # This returns a list
675+
x[1:2] # This returns multiple elements in the list
676+
677+
x[[1]] # This returns the item
678+
x[[1:2]] # This doesn't work - you can only use [[]] with a single index
679+
```
680+
681+
In R, list indexing with `[]` will return a list with the specified elements.
682+
683+
To actually retrieve the item in the list, use `[[]]`. The only downside to `[[]]` is that you can only access one thing at a time.
684+
685+
#### Python concept {.unnumbered}
686+
687+
In Python, we count from 0.
688+
689+
![A python-indexed lego list, counting from 0 to 4](../images/gen-prog/list-indexing-py.png){fig-alt="A set of 5 bricks on a virtual lego board. The first brick is a blue 1x1 labeled with a 0. The second brick is a green 1x2, labeled with a 1. The third brick is a pink 1x4 labeled with a 2. The fourth brick is a yellow 1x2, labeled with a 3. The fifth brick is a pink 2x2 labeled with a 4. This represents a list, as the blocks are of different types (sizes)."}
690+
691+
#### Python code {.unnumbered}
692+
693+
```{python}
694+
x = ["a", 3, False]
695+
696+
x[0]
697+
x[1]
698+
x[0:2]
699+
```
700+
701+
In Python, we can use single brackets to get an object or a list back out, but we have to know how **slices** work. Essentially, in Python, `0:2` indicates that we want objects 0 and 1, but want to stop at 2 (not including 2). If you use a slice, Python will return a list; if you use a single index, python just returns the value in that location in the list.
702+
:::
703+
704+
We'll talk more about indexing as it relates to vectors, but indexing is a general concept that applies to just about any multi-value object.
705+
706+
### Concatenation
707+
708+
Another important thing to know about lists is how to combine them.
709+
If I have rosters for two classes and I want to make a list of all of my students, I need to somehow merge the two lists together.
710+
711+
::: panel-tabset
712+
#### R
713+
```{r}
714+
class1 <- c("Benjamin Sisko", "Odo", "Julian Bashir", "Jadzia Dax", "Miles O'Brien", "Quark", "Kira Nerys", "Elim Garak")
715+
class2 <- c("Jean-Luc Picard", "William Riker", "Geordi La Forge", "Worf", "Miles O'Brien", "Beverly Crusher", "Deanna Troi", "Data")
716+
717+
students <- c(class1, class2)
718+
students
719+
720+
unique(students) # get only unique names
721+
```
722+
723+
#### Python
724+
```{python}
725+
class1 = ["Benjamin Sisko", "Odo", "Julian Bashir", "Jadzia Dax", "Miles O'Brien", "Quark", "Kira Nerys", "Elim Garak"]
726+
class2 = ["Jean-Luc Picard", "William Riker", "Geordi La Forge", "Worf", "Miles O'Brien", "Beverly Crusher", "Deanna Troi", "Data"]
727+
728+
students = class1 + class2
729+
students
730+
list(dict.fromkeys(students)) # get only unique names
731+
```
732+
:::
733+
734+
682735
## Matrices
683736

684737
A **matrix** is the next step after a vector - it's a set of values arranged in a two-dimensional, rectangular format.
@@ -887,7 +940,7 @@ Let's see what happens when we work with the data above as a set of vectors/Seri
887940
```{python read-state-pops, cache = T}
888941
import pandas as pd
889942
890-
data = pd.read_html("https://worldpopulationreview.com/states")[0]
943+
data = pd.read_csv("https://raw.githubusercontent.com/srvanderplas/stat-computing-r-python/main/data/population2024.csv")
891944
list(data.columns) # get names
892945
893946
# Create a few population series
@@ -900,20 +953,21 @@ Suppose that we want to sort each population vector by the population in that ye
900953

901954
```{python vector-analysis-python, dependson = 'read-state-pops'}
902955
import pandas as pd
903-
data = pd.read_html("https://worldpopulationreview.com/states")[0]
956+
957+
data = pd.read_csv("https://raw.githubusercontent.com/srvanderplas/stat-computing-r-python/main/data/population2024.csv")
904958
905959
population2024 = pd.Series(data['2024 Population'].values, index = data['State'].values).sort_values()
906960
population2023 = pd.Series(data['2023 Population'].values, index = data['State'].values).sort_values()
907-
populationCensus = pd.Series(data['2020 Population'].values, index = data['State'].values).sort_values()
961+
population2020 = pd.Series(data['2020 Population'].values, index = data['State'].values).sort_values()
908962
909-
population2024.head()
910-
population2023.head()
911-
populationCensus.head()
963+
population2024.head(10)
964+
population2023.head(10)
965+
population2020.head(10)
912966
```
913967

914968
The only problem is that by doing this, we've now lost the ordering that matched across all 3 vectors.
915-
Pandas Series are great for this, because they use labels that allow us to reconstitute which value corresponds to which label, but in R or even in numpy arrays, vectors don't inherently come with labels.
916-
In these situations, sorting by one value can actually destroy the connection between two vectors!
969+
Pandas Series are great for showing this problem, because they use labels that allow us to reconstitute which value corresponds to which label, but in R or even in numpy arrays, vectors don't inherently come with labels.
970+
In these situations, sorting by one value can actually destroy the connection between two vectors, in a way that you don't even notice!
917971

918972
#### R
919973

0 commit comments

Comments
 (0)