-
Notifications
You must be signed in to change notification settings - Fork 25
/
Copy pathjoins.qmd
115 lines (74 loc) · 3.46 KB
/
joins.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---
output: html_document
editor_options:
chunk_output_type: console
---
# Joining Data Tables
```{r echo=FALSE}
source("libs/Common.R")
```
```{r echo = FALSE}
pkg_ver(c("dplyr"))
```
----
`dplyr` provides functions for joining data from one table to another allowing you to combine information based on shared keys. Four joining functions, each with distinct behavior, are: `left_join`, `right_join`, `inner_join`, and `full join`.
To demonstrate these functions, we'll be joining two dataframes: `df` and `dj`.
```{r, message=FALSE}
df <- data.frame( x = c(1, 23, 4, 43, 2, 17),
y = c("a", "b", "b", "b", "a", "d"),
stringsAsFactors = FALSE)
df
```
```{r, message=FALSE}
dj <- data.frame( z = c("apple", "pear", "orange"),
y = c("a", "b", "c"),
stringsAsFactors = FALSE)
dj
```
In the examples that follow, we will join both tables by the shared key `y`. A shared key is a column (or set of columns) that appears in both tables and is used to match rows when joining them. It acts like a common identifier that links related data together. Note that the column names do not need to match (see [note](#note1) at the bottom of this page).
## Left join
In this example, if a join element in `df` does not exist in `dj`, `NA` will be assigned to column `z`. In other words, all elements in `df` will exist in the output regardless if a matching element is found in `dj`. Note that the output is sorted in the same order as `df` (the *left* table).
```{r, warning=FALSE}
library(dplyr)
left_join(df, dj, by="y")
```
<img src="img/left_join.png" style="width: 40%; height: auto;">
## Right join
If a join element in `df` does not exist in `dj`, that element is removed from the output. A few additional important notes follow:
+ All elements in `dj` appear at least once in the output (even if they don't have a match in `df` in which case an `NA` value is added);
+ The output table is sorted in the order in which the `y` elements appear in `dj`;
+ Element `y` will appear as many times as there are matching `y`'s in `df`.
```{r, warning=FALSE}
right_join(df, dj, by="y")
```
<img src="img/right_join.png" style="width: 40%; height: auto;">
## Inner join
In this example, **only** matching elements in both `df` and `dj` are saved in the output. This is basically an "intersection" of both tables.
```{r, warning=FALSE}
inner_join(df, dj, by="y")
```
<img src="img/inner_join.png" style="width: 40%; height: auto;">
## Full join
In this example, **all** elements in both `df` and `dj` are present in the output. For non-matching pairs, `NA` values are supplied. This is basically a "union" of both tables.
```{r warning=FALSE}
full_join(df, dj, by="y")
```
<img src="img/full_join.png" style="width: 40%; height: auto;">
## Joins in a piping operation
The aforementioned joining functions can be used with pipes. For example:
```{r}
df %>%
left_join(dj, by = "y")
```
## A note about shared key names {#note1}
If the shared keys have different names in each table, you need to specify the `by =` argument as `by = c("left_col" = "right_col")`. For example,
```{r, message=FALSE}
library(dplyr)
df <- data.frame( x = c(1, 23, 4, 43, 2, 17),
y1 = c("a", "b", "b", "b", "a", "d"),
stringsAsFactors = FALSE)
dj <- data.frame( z = c("apple", "pear", "orange"),
y2 = c("a", "b", "c"),
stringsAsFactors = FALSE)
left_join(df, dj, by = c("y1" = "y2"))
```