Author: Mark
Rieke
License:
MIT
{nplyr}
is a grammar of nested data manipulation that allows users to
perform dplyr-like manipulations on data
frames nested within a list-col of another data frame. Most dplyr verbs
have nested equivalents in nplyr. A (non-exhaustive) list of examples:
nest_mutate()
is the nested equivalent ofmutate()
nest_select()
is the nested equivalent ofselect()
nest_filter()
is the nested equivalent offilter()
nest_summarise()
is the nested equivalent ofsummarise()
nest_group_by()
is the nested equivalent ofgroup_by()
As of version 0.2.0, nplyr also supports nested versions of some tidyr functions:
nest_drop_na()
is the nested equivalent ofdrop_na()
nest_extract()
is the nested equivalent ofextract()
nest_fill()
is the nested equivalent offill()
nest_replace_na()
is the nested equivalent ofreplace_na()
nest_separate()
is the nested equivalent ofseparate()
nest_unite()
is the nested equivalent ofunite()
nplyr is largely a wrapper for dplyr. For the most up-to-date information on dplyr please visit dplyr’s website. If you are new to dplyr, the best place to start is the data transformation chapter in R for data science.
You can install the released version of nplyr from CRAN or the development version from github with the devtools or remotes package:
# install from CRAN
install.packages("nplyr")
# install from github
devtools::install_github("markjrieke/nplyr")
To get started, we’ll create a nested column for the country data within each continent from the gapminder dataset.
library(nplyr)
gm_nest <-
gapminder::gapminder_unfiltered %>%
tidyr::nest(country_data = -continent)
gm_nest
#> # A tibble: 6 × 2
#> continent country_data
#> <fct> <list>
#> 1 Asia <tibble [578 × 5]>
#> 2 Europe <tibble [1,302 × 5]>
#> 3 Africa <tibble [637 × 5]>
#> 4 Americas <tibble [470 × 5]>
#> 5 FSU <tibble [139 × 5]>
#> 6 Oceania <tibble [187 × 5]>
dplyr can perform operations on the top-level data frame, but with nplyr, we can perform operations on the nested data frames:
gm_nest_example <-
gm_nest %>%
nest_filter(country_data, year == max(year)) %>%
nest_mutate(country_data, pop_millions = pop/1000000)
# each nested tibble is now filtered to the most recent year
gm_nest_example
#> # A tibble: 6 × 2
#> continent country_data
#> <fct> <list>
#> 1 Asia <tibble [43 × 6]>
#> 2 Europe <tibble [34 × 6]>
#> 3 Africa <tibble [53 × 6]>
#> 4 Americas <tibble [33 × 6]>
#> 5 FSU <tibble [9 × 6]>
#> 6 Oceania <tibble [11 × 6]>
# if we unnest, we can see that a new column for pop_millions has been added
gm_nest_example %>%
slice_head(n = 1) %>%
tidyr::unnest(country_data)
#> # A tibble: 43 × 7
#> continent country year lifeExp pop gdpPercap pop_millions
#> <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
#> 1 Asia Afghanistan 2007 43.8 31889923 975. 31.9
#> 2 Asia Azerbaijan 2007 67.5 8017309 7709. 8.02
#> 3 Asia Bahrain 2007 75.6 708573 29796. 0.709
#> 4 Asia Bangladesh 2007 64.1 150448339 1391. 150.
#> 5 Asia Bhutan 2007 65.6 2327849 4745. 2.33
#> 6 Asia Brunei 2007 77.1 386511 48015. 0.387
#> 7 Asia Cambodia 2007 59.7 14131858 1714. 14.1
#> 8 Asia China 2007 73.0 1318683096 4959. 1319.
#> 9 Asia Hong Kong, China 2007 82.2 6980412 39725. 6.98
#> 10 Asia India 2007 64.7 1110396331 2452. 1110.
#> # … with 33 more rows
nplyr also supports grouped operations with nest_group_by()
:
gm_nest_example <-
gm_nest %>%
nest_group_by(country_data, year) %>%
nest_summarise(
country_data,
n = n(),
lifeExp = median(lifeExp),
pop = median(pop),
gdpPercap = median(gdpPercap)
)
gm_nest_example
#> # A tibble: 6 × 2
#> continent country_data
#> <fct> <list>
#> 1 Asia <tibble [58 × 5]>
#> 2 Europe <tibble [58 × 5]>
#> 3 Africa <tibble [13 × 5]>
#> 4 Americas <tibble [57 × 5]>
#> 5 FSU <tibble [44 × 5]>
#> 6 Oceania <tibble [56 × 5]>
# unnesting shows summarised tibbles for each continent
gm_nest_example %>%
slice(2) %>%
tidyr::unnest(country_data)
#> # A tibble: 58 × 6
#> continent year n lifeExp pop gdpPercap
#> <fct> <int> <int> <dbl> <dbl> <dbl>
#> 1 Europe 1950 22 65.8 7408264 6343.
#> 2 Europe 1951 18 65.7 7165515 6509.
#> 3 Europe 1952 31 65.9 7124673 5210.
#> 4 Europe 1953 17 67.3 7346100 6774.
#> 5 Europe 1954 17 68.0 7423300 7046.
#> 6 Europe 1955 17 68.5 7499400 7817.
#> 7 Europe 1956 17 68.5 7575800 8224.
#> 8 Europe 1957 31 67.5 7363802 6093.
#> 9 Europe 1958 18 69.6 8308052. 8833.
#> 10 Europe 1959 18 69.6 8379664. 9088.
#> # … with 48 more rows
More examples can be found in the package vignettes and function documentation.
If you notice a bug, want to request a new feature, or have recommendations on improving documentation, please open an issue in the package repository.