Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-3439: [R] R language bindings for Feather format #2947

Closed
wants to merge 15 commits into from

Conversation

romainfrancois
Copy link
Contributor

library(arrow, warn.conflicts = FALSE)
library(tibble)

tib <- tibble(x = 1:10, y = rnorm(10), z = letters[1:10])
tib
#> # A tibble: 10 x 3
#>        x       y z    
#>    <int>   <dbl> <chr>
#>  1     1  0.315  a    
#>  2     2 -1.16   b    
#>  3     3  0.0980 c    
#>  4     4  0.836  d    
#>  5     5 -1.12   e    
#>  6     6  0.118  f    
#>  7     7  2.74   g    
#>  8     8 -0.323  h    
#>  9     9  0.291  i    
#> 10    10  1.40   j
tf <- tempfile()
write_feather(tib, tf)

# reads back into an arrow::Table
tab1 <- read_feather(tf)
tab1
#> arrow::Table

# convert that to tibble
as_tibble(tab1)
#> # A tibble: 10 x 3
#>        x       y z    
#>    <int>   <dbl> <chr>
#>  1     1  0.315  a    
#>  2     2 -1.16   b    
#>  3     3  0.0980 c    
#>  4     4  0.836  d    
#>  5     5 -1.12   e    
#>  6     6  0.118  f    
#>  7     7  2.74   g    
#>  8     8 -0.323  h    
#>  9     9  0.291  i    
#> 10    10  1.40   j
identical(tib, as_tibble(tab1))
#> [1] TRUE

# read it through reticulate
library(reticulate)
pa <- import("pyarrow")

# read straight to a data frame via pyarrow.feather.read_feather
tab2 <- pa$feather$read_feather(tf)
# just so that we cna compare
attr(tab2, "pandas.index") <-  NULL
tab2
#>     x          y z
#> 1   1  0.3148670 a
#> 2   2 -1.1583740 b
#> 3   3  0.0980070 c
#> 4   4  0.8360379 d
#> 5   5 -1.1171395 e
#> 6   6  0.1177422 f
#> 7   7  2.7407804 g
#> 8   8 -0.3229787 h
#> 9   9  0.2910557 i
#> 10 10  1.4015148 j

identical(tib, as_tibble(tab2))
#> [1] TRUE

Created on 2018-11-12 by the reprex package (v0.2.1.9000)

@wesm
Copy link
Member

wesm commented Nov 12, 2018

Very nice. How's the performance compare for a ~1GB dataset with the current feather package?

@wesm
Copy link
Member

wesm commented Nov 12, 2018

BTW I frequently cannot read or interpret the emojis in your commit messages, would it be ok to use normal text instead? :)

@romainfrancois
Copy link
Contributor Author

No idea about the performance. Most of the work is done by the C++ library. I'll check that.

@jameslamb
Copy link
Contributor

This is so exciting!

@romainfrancois for context on @wesm 's question, see this: wesm/feather#294

current feather has a hard upper limit of 2.4GB per column

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK. I would expect the performance to be worse than library(feather) because memory mapping is not being used

r/R/feather.R Outdated Show resolved Hide resolved
r/R/feather.R Outdated Show resolved Hide resolved
@wesm
Copy link
Member

wesm commented Nov 12, 2018

FYI, when I switched the Feather stuff to run off the Arrow codebase, things got a lot faster

http://wesmckinney.com/blog/feather-arrow-future/

feather_perf

Might try to replicate these benchmarks in R to see. We're missing some optimizations in R, though, like multithreaded conversions (there's JIRAs up for this)

@romainfrancois
Copy link
Contributor Author

library(arrow, warn.conflicts = FALSE)
library(tibble)

tib <- tibble(x = 1:10, y = rnorm(10), z = letters[1:10])
tf <- tempfile()
write_feather(tib, tf)

tab <- read_feather(tf, mmap = TRUE)
tab
#> arrow::Table
as_tibble(tab)
#> # A tibble: 10 x 3
#>        x      y z    
#>    <int>  <dbl> <chr>
#>  1     1  0.253 a    
#>  2     2  0.988 b    
#>  3     3  1.54  c    
#>  4     4 -0.684 d    
#>  5     5 -0.666 e    
#>  6     6 -0.763 f    
#>  7     7  0.833 g    
#>  8     8  2.11  h    
#>  9     9  0.887 i    
#> 10    10  1.41  j

Created on 2018-11-12 by the reprex package (v0.2.1.9000)

@wesm
Copy link
Member

wesm commented Nov 12, 2018

Cool, I will review again when I can, I'll be interested to see the perf numbers; it might be useful to start a directory of benchmarks where we can periodically check the performance of things

@romainfrancois
Copy link
Contributor Author

library(lobstr)
library(tibble)
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
#> The following objects are masked from 'package:base':
#> 
#>     array, table
library(feather)
#> 
#> Attaching package: 'feather'
#> The following objects are masked from 'package:arrow':
#> 
#>     read_feather, write_feather

n <- 5e7
tib <- tibble(x = rnorm(n), y = rnorm(n), z = 1:n + 1L, a = sample(letters, n, replace = TRUE))
lobstr::obj_size(tib)
#> 1,400,002,664 B

write_feather(tib, "data.feather")

fs::file_info("data.feather")
#> # A tibble: 1 x 18
#>   path         type   size permissions modification_time   user  group
#>   <fs::path>   <fct> <fs:> <fs::perms> <dttm>              <chr> <chr>
#> 1 data.feather file  1.16G rw-r--r--   2018-11-13 15:19:50 roma… staff
#> # … with 11 more variables: device_id <dbl>, hard_links <dbl>,
#> #   special_device_id <dbl>, inode <dbl>, block_size <dbl>, blocks <dbl>,
#> #   flags <int>, generation <dbl>, access_time <dttm>, change_time <dttm>,
#> #   birth_time <dttm>

bench::mark(check = FALSE,
  feather::read_feather("data.feather"),

  arrow::read_feather("data.feather"),
  as_tibble(arrow::read_feather("data.feather"))
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 x 10
#>   expression    min  mean median      max `itr/sec` mem_alloc  n_gc n_itr
#>   <chr>      <bch:> <bch> <bch:> <bch:tm>     <dbl> <bch:byt> <dbl> <int>
#> 1 "feather:…  5.34s 5.34s  5.34s    5.34s     0.187     1.3GB     2     1
#> 2 "arrow::r… 1.38ms 7.3ms  1.6ms 426.94ms   137.       8.72MB     3   104
#> 3 "as_tibbl…  2.53s 2.53s  2.53s    2.53s     0.395     1.3GB     0     1
#> # … with 1 more variable: total_time <bch:tm>

Created on 2018-11-13 by the reprex package (v0.2.1.9000)

@romainfrancois
Copy link
Contributor Author

so arrow::read_feather() is pretty fast as this just makes an arrow::Table: no copy of data to R data structures.

if we add data copy we get the 3rd line, still twice as fast as feather, but not that impressive.

However, this is not using altrep at all yet, which it really should so that the memory of the columns of the arrow::Table are reused by the R object.

@wesm
Copy link
Member

wesm commented Nov 13, 2018

Well, being twice as fast is pretty good, as in theory the libraries are doing the same thing. Can you run a benchmark where about 50% of the values are NA?

@romainfrancois
Copy link
Contributor Author

library(lobstr)
library(tibble)
library(arrow)
library(feather)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

n <- 5e7
tib <- tibble(x = rnorm(n), y = rnorm(n), z = 1:n + 1L, a = sample(letters, n, replace = TRUE))
tib <- tib %>%
  mutate_all(~{ .[sample(length(.), length(.)/2L)] <- NA ; .})

lobstr::obj_size(tib)
#> 1,400,002,720 B
summarise_all(tib, ~sum(is.na(.)) / n())
#> # A tibble: 1 x 4
#>       x     y     z     a
#>   <dbl> <dbl> <dbl> <dbl>
#> 1   0.5   0.5   0.5   0.5

arrow::write_feather(tib, "data-na50.feather")

fs::file_info("data-na50.feather")
#> # A tibble: 1 x 18
#>   path              type   size permissions modification_time   user  group
#>   <fs::path>        <fct> <fs:> <fs::perms> <dttm>              <chr> <chr>
#> 1 data-na50.feather file  1.16G rw-r--r--   2018-11-13 15:46:20 roma… staff
#> # … with 11 more variables: device_id <dbl>, hard_links <dbl>,
#> #   special_device_id <dbl>, inode <dbl>, block_size <dbl>, blocks <dbl>,
#> #   flags <int>, generation <dbl>, access_time <dttm>, change_time <dttm>,
#> #   birth_time <dttm>

bench::mark(check = FALSE,
  feather = feather::read_feather("data-na50.feather"),

  # arrow::read_feather("data-na50.feather"),
  arrow = as_tibble(arrow::read_feather("data-na50.feather"))
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 x 10
#>   expression   min  mean median   max `itr/sec` mem_alloc  n_gc n_itr
#>   <chr>      <bch> <bch> <bch:> <bch>     <dbl> <bch:byt> <dbl> <int>
#> 1 feather    4.15s 4.15s  4.15s 4.15s     0.241     1.3GB     1     1
#> 2 arrow      3.51s 3.51s  3.51s 3.51s     0.285    1.31GB     1     1
#> # … with 1 more variable: total_time <bch:tm>

But at the moment all this: happens in one thread, e.g.

      // first copy all the data
      std::copy_n(p_values, n, data.begin() + start);

      if (null_count) {
        // then set the sentinel NA
        arrow::internal::BitmapReader bitmap_reader(array->null_bitmap()->data(),
                                                    array->offset(), n);

        for (size_t i = 0; i < n; i++, bitmap_reader.Next()) {
          if (bitmap_reader.IsNotSet()) {
            data[i + start] = default_value<RTYPE>();
          }
        }
      }

There's room for improvement here, even with copies (no altrep). But this is for later. We focus on the features first I suppose.

@wesm
Copy link
Member

wesm commented Nov 13, 2018

Multithreading (https://issues.apache.org/jira/browse/ARROW-3315) will help a lot

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thank you!

SetDescription = function(description) ipc___feather___TableWriter__SetDescription(self, description),
SetNumRows = function(num_rows) ipc___feather___TableWriter__SetNumRows(self, num_rows),
Append = function(name, values) ipc___feather___TableWriter__Append(self, name, values),
Finalize = function() ipc___feather___TableWriter__Finalize(self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a rationale for using 3 underscores in these methods like this (instead of 1 or 2)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of:

  • 3 underscores is for namespaces, so ipc:: becomes ipc___
  • 2 underscores is for methods of a class, so TableWriter.Finalize becomes TableWriter__Finalize
  • 1 underscore is sometimes used in the functions themselves

As of now, this is all generated manually, but maybe some day I'll sit down and have a way to generate these things (the R6 class, and the C++ functions) from a simple format. Something like what RcppR6 does, although IIRC we would need additional tooling for things like shared_ptr or unique_ptr, ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, is this an internal convention or an Rcpp thing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not. There are no convention, and also no way (IIRC) to export a function that lives in a namespace. That's just something semi recognizable and easier to generate automatically if at some point we go that way (à la RcppR6)

#' @return an arrow::Table
#'
#' @export
read_feather <- function(file, ...){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs to me that the semantics of this function are different from feather::read_feather (it returns an arrow::Table instead of data.frame), if we care

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO it would be a nice incentive for people to abandon the feather package in favor of arrow if we could say arrow::read_feather() / arrow::write_feathe()r are drop-in replacements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the goals of arrow is to make Array structures palpable in R, which feather didn't do. But yeah I guess we need both, i.e.

  • a function that returns an arrow::Table, not sure how to call this one.
  • a function, perhaps read_feather if this need to look like feather::read_feather that returns a tibble by calling as_tibble on the result of the first one.

Right now, it makes sense to go straight to tibble, but soon enough this (or other) packages will have more tooling to manipulate and compute directly on arrow memory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could add an as_tibble = TRUE argument to read_feather so this can be toggled on and off

@wesm
Copy link
Member

wesm commented Nov 14, 2018

Hmm, CI is failing, I just restarted the job. Where do we stand on #2770? It would be nice to be able to easily run the unit tests locally on a machine with Docker

@jameslamb
Copy link
Contributor

@wesm I'm still stuck on #2770 trying to figure out what part of the build isn't recognizing / using that ABI flag :/

@romainfrancois
Copy link
Contributor Author

It's usually very benign, e.g. here it's Rd consistency https://travis-ci.org/apache/arrow/jobs/454386405#L2169

I think the rate limit problem is fixed now (@jimhester can you confirm?) so perhaps we can revert #2846 ?

@romainfrancois
Copy link
Contributor Author

No idea what that new error is, but it's not R related.

@wesm
Copy link
Member

wesm commented Nov 14, 2018

Merging. CI failed this time due to rate limit

The downloaded source packages are in
	‘/tmp/RtmptNE6cU/downloaded_packages’
2.92s$ Rscript -e 'deps <- devtools::dev_package_deps(dependencies = NA);devtools::install_deps(dependencies = TRUE);if (!all(deps$package %in% installed.packages())) { message("missing: ", paste(setdiff(deps$package, installed.packages()), collapse=", ")); q(status = 1, save = "no")}'
Error: HTTP error 403.
  API rate limit exceeded for 35.224.112.202. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)
  Rate limit remaining: 0
  Rate limit reset at: 2018-11-14 14:15:45 UTC
Execution halted

@wesm wesm closed this in b541888 Nov 14, 2018
@jimhester
Copy link

If you delete the cache for the R builds it should fix the rate limiting, it looks like it is still using an older version of the remotes package, which is what devtools uses under the hood.

@javierluraschi
Copy link
Contributor

@romainfrancois Looks like this PR makes the R arrow bindings depend on an unreleased-version of arrow, the workaround is to build arrow from source. Is this intentional?

@wesm
Copy link
Member

wesm commented Nov 14, 2018

Well, it wouldn't be appropriate for the R library to release without the C++ libraries also releasing (unless R starts releasing separately... I would advise against for now). So for the time being we should be developing master-to-master until the next Arrow release

wesm pushed a commit that referenced this pull request Jan 4, 2019
The main entry point is the `csv_read()` function, all it does is create a `csv::TableReader` with the `csv_table_reader()` generic and then `$Read()` from it.

as in the #2947 for feather format, `csv_table_reader` is generic with the methods:
 - arrow::io::InputStream: calls the TableReader actor with the other options
 - character and fs_path: depending on the `mmap` option (TRUE by default) it opens the file with `mmap_open()` of `file_open()` and then calls the other method.

``` r
library(arrow)
tf <- tempfile()
readr::write_csv(iris, tf)

tab1 <- csv_read(tf)
tab1
#> arrow::Table
as_tibble(tab1)
#> # A tibble: 150 x 5
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <chr>
#>  1          5.1         3.5          1.4         0.2 setosa
#>  2          4.9         3            1.4         0.2 setosa
#>  3          4.7         3.2          1.3         0.2 setosa
#>  4          4.6         3.1          1.5         0.2 setosa
#>  5          5           3.6          1.4         0.2 setosa
#>  6          5.4         3.9          1.7         0.4 setosa
#>  7          4.6         3.4          1.4         0.3 setosa
#>  8          5           3.4          1.5         0.2 setosa
#>  9          4.4         2.9          1.4         0.2 setosa
#> 10          4.9         3.1          1.5         0.1 setosa
#> # … with 140 more rows
```

<sup>Created on 2018-11-13 by the [reprex package](https://reprex.tidyverse.org) (v0.2.1.9000)</sup>

Author: Romain Francois <romain@purrple.cat>

Closes #2949 from romainfrancois/ARROW-3760/csv_reader and squashes the following commits:

951e9f5 <Romain Francois> s/csv_read/read_csv_arrow/
7770ec5 <Romain Francois> not using readr:: at this point
bb13a76 <Romain Francois> rebase
83b5162 <Romain Francois> s/file_open/ReadableFile/
959020c <Romain Francois> No need to special use mmap for file path method
6e74003 <Romain Francois> going through CharacterVector makes sure this is a character vector
2585501 <Romain Francois> line breaks for readability
0ab8397 <Romain Francois> linting
09187e6 <Romain Francois> Expose arrow::csv::TableReader, functions csv_table_reader() + csv_read()
emkornfield pushed a commit to emkornfield/arrow that referenced this pull request Jan 10, 2019
The main entry point is the `csv_read()` function, all it does is create a `csv::TableReader` with the `csv_table_reader()` generic and then `$Read()` from it.

as in the apache#2947 for feather format, `csv_table_reader` is generic with the methods:
 - arrow::io::InputStream: calls the TableReader actor with the other options
 - character and fs_path: depending on the `mmap` option (TRUE by default) it opens the file with `mmap_open()` of `file_open()` and then calls the other method.

``` r
library(arrow)
tf <- tempfile()
readr::write_csv(iris, tf)

tab1 <- csv_read(tf)
tab1
#> arrow::Table
as_tibble(tab1)
#> # A tibble: 150 x 5
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <chr>
#>  1          5.1         3.5          1.4         0.2 setosa
#>  2          4.9         3            1.4         0.2 setosa
#>  3          4.7         3.2          1.3         0.2 setosa
#>  4          4.6         3.1          1.5         0.2 setosa
#>  5          5           3.6          1.4         0.2 setosa
#>  6          5.4         3.9          1.7         0.4 setosa
#>  7          4.6         3.4          1.4         0.3 setosa
#>  8          5           3.4          1.5         0.2 setosa
#>  9          4.4         2.9          1.4         0.2 setosa
#> 10          4.9         3.1          1.5         0.1 setosa
#> # … with 140 more rows
```

<sup>Created on 2018-11-13 by the [reprex package](https://reprex.tidyverse.org) (v0.2.1.9000)</sup>

Author: Romain Francois <romain@purrple.cat>

Closes apache#2949 from romainfrancois/ARROW-3760/csv_reader and squashes the following commits:

951e9f5 <Romain Francois> s/csv_read/read_csv_arrow/
7770ec5 <Romain Francois> not using readr:: at this point
bb13a76 <Romain Francois> rebase
83b5162 <Romain Francois> s/file_open/ReadableFile/
959020c <Romain Francois> No need to special use mmap for file path method
6e74003 <Romain Francois> going through CharacterVector makes sure this is a character vector
2585501 <Romain Francois> line breaks for readability
0ab8397 <Romain Francois> linting
09187e6 <Romain Francois> Expose arrow::csv::TableReader, functions csv_table_reader() + csv_read()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants