-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-3731: MVP to read parquet in R library #3230
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this initial work. I've added some comments inline.
I need to have a look at the parquet api on the python side as we try to be as compatible as possible.
This needs
- more low level api too, i.e. classes such as
parquet::arrow::FileReader
exposed on the R side. - a way to write parquet files too
- unit tests
I'll come back here with more comments once I have more information on both python and C++ api for parquet.
r/R/read_parquet.R
Outdated
tables = lapply(files, function(f) { | ||
return (as_tibble(shared_ptr(`arrow::Table`, read_parquet_file(f)))) | ||
}) | ||
do.call('rbind', tables) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd probably use vctrs::vec_rbind(!!!tables)
or dplyr::bind_rows(!!!tables)
instead of do.call
+ rbind
.
arrow
currently does not depend on dplyr
so I guess vctrs::vec_rbind(!!!tables)
is preferred. @hadley
r/R/read_parquet.R
Outdated
#' @param files a vector of filenames | ||
#' @export | ||
read_parquet = function(files) { | ||
tables = lapply(files, function(f) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'd probably use purrr::map()
or even purrr::map_dfr()
r/R/read_parquet.R
Outdated
#' @export | ||
read_parquet = function(files) { | ||
tables = lapply(files, function(f) { | ||
return (as_tibble(shared_ptr(`arrow::Table`, read_parquet_file(f)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could have only one parquet reader instead of one per file. But I need to read more about the parquet api first.
r/configure
Outdated
@@ -49,7 +49,8 @@ if [ "$INCLUDE_DIR" ] || [ "$LIB_DIR" ]; then | |||
elif [ "$PKGCONFIG_CFLAGS" ] || [ "$PKGCONFIG_LIBS" ]; then | |||
echo "Found pkg-config cflags and libs!" | |||
PKG_CFLAGS=${PKGCONFIG_CFLAGS} | |||
PKG_LIBS=${PKGCONFIG_LIBS} | |||
#PKG_LIBS=${PKGCONFIG_LIBS} | |||
PKG_LIBS="-larrow -lparquet" # hard coded! what is the right way to do this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should rather be addressed by updating the PKG_CONFIG_NAME
higher
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having this seems to work for me:
PKG_CONFIG_NAME="arrow parquet"
This needs an update to the README.Rmd
(and README.md
) to have -DARROW_PARQUET=ON
, i.e.:
cmake .. -DARROW_PARQUET=ON -DCMAKE_BUILD_TYPE=Release -DARROW_BOOST_USE_SHARED:BOOL=Off
r/src/parquetfilereader.cpp
Outdated
// [[Rcpp::export]] | ||
std::shared_ptr<arrow::Table> read_parquet_file(std::string filename) { | ||
std::shared_ptr<arrow::io::ReadableFile> infile; | ||
PARQUET_THROW_NOT_OK(arrow::io::ReadableFile::Open( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although the name is a big hint, I'm not sure what the macro does and if it should perhaps be replaced by something that plays well with Rcpp.
Seems like the python version of We can/should probably do the same here and maybe have an |
r/R/read_parquet.R
Outdated
@@ -0,0 +1,10 @@ | |||
#' Read parquet file from disk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs the license headers, you can just copy from any other R file.
https://travis-ci.org/apache/arrow/jobs/470349974#L493
r/src/parquetfilereader.cpp
Outdated
@@ -0,0 +1,38 @@ | |||
// // Licensed to the Apache Software Foundation (ASF) under one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cpp files must adhere to the style. You can enforce that with
./r/lint.sh --fix
@jeffwong-nflx I've incorporated your commits and then some in this branch on my fork: This needs more work until this can be merged to the upstream repo, but if you want to continue working on this, please send pull requests to that branch. |
Looking more at the python api: https://arrow.apache.org/docs/python/parquet.html#reading-and-writing-single-files it looks like Then, we'd need something like the For multiple files, we should follow the partitioned dataset api of https://arrow.apache.org/docs/python/parquet.html#partitioned-datasets-multiple-files |
I would recommend that you do not spend much energy implementing the ParquetDataset logic in R. I would like to port this from Python to C++, so we can maintain a common implementation https://issues.apache.org/jira/browse/ARROW-3764 |
It might be a good idea for someone to do that here in January so that it can be available to R and Ruby as soon as possible |
Definitely. Most of what the R package does is rely on the C++ api and make things available to R. I just assumed that the python code did that for ParquetDataset too. We'll wait for a C++ implementation to mature. |
I submitted a PR to @jeffwong-nflx fork to incorporate some changes on this PR: https://github.com/jeffwong-nflx/arrow/pull/2 |
@jeffwong-nflx would you mind if Romain picks up this branch? I think Romain can submit a new PR with your changes |
I can go either way. It all depends if @jeffwong-nflx wants to do more work, in which case I can guide/help and work on other things in the meantime. The other nicety is that it might give me ideas for a "how to contribute to the arrow R package" document. |
Follow up to initial parquet support
@romainfrancois let's consolidate onto your branch. This minimum functionality to read parquet files into R was what I was looking for in my application - I don't know if I can commit to getting the R library to have feature parity with the python library. |
Fair enough. What I'm proposing is that we merge this asap, and then I'll work out in another branch/pr, etc ... additional things for more parquet support. To merge we need to fix the last small issues, mostly about files that have not been removed (not sure why).
travis should pass then, and we can merge this PR to the apache repo. @wesm things are easier if we don't merge any other R PR for now (otherwise because of the centrality of the generated |
@jeffwong-nflx in addition, if you like, would you add a line in the DESCRIPTION file and mark yourself as a https://github.com/apache/arrow/blob/master/r/DESCRIPTION#L6 Something like:
|
@romainfrancois I added all the items to my fork, hope Travis CI passes! |
Travis CI failed, but it looks like the R part passed, and the python part failed. @wesm |
The problem looks unrelated. I restarted the job. We'll see what happens. Perhaps we'll need to rebase. |
Codecov Report
@@ Coverage Diff @@
## master #3230 +/- ##
==========================================
+ Coverage 88.18% 88.55% +0.37%
==========================================
Files 535 540 +5
Lines 72873 73104 +231
==========================================
+ Hits 64260 64740 +480
+ Misses 8506 8257 -249
Partials 107 107
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@@ -4,6 +4,7 @@ Version: 0.0.0.9000 | |||
Authors@R: c( | |||
person("Romain", "François", email = "romain@rstudio.com", role = c("aut", "cre")), | |||
person("Javier", "Luraschi", email = "javier@rstudio.com", role = c("ctb")), | |||
person("Jeffrey", "Wong", email = "jeffreyw@netflix.com", role = c("ctb")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know that we need to maintain these bylines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily.
I am contributing to [Arrow 3731](https://issues.apache.org/jira/browse/ARROW-3731). This PR has the minimum functionality to read parquet files into an arrow::Table, which can then be converted to a tibble. Multiple parquet files can be read inside `lapply`, and then concatenated at the end. Steps to compile 1) Build arrow and parquet c++ projects 2) In R run `devtools::load_all()` What I could use help with: The biggest challenge for me is my lack of experience with pkg-config. The R library has a `configure` file which uses pkg-config to figure out what c++ libraries to link to. Currently, `configure` looks up the Arrow project and links to -larrow only. We need it to also link to -lparquet. I do not know how to modify pkg-config's metadata to let it know to link to both -larrow and -lparquet Author: Jeffrey Wong <jeffreyw@netflix.com> Author: Romain Francois <romain@purrple.cat> Author: jeffwong-nflx <jeffreyw@netflix.com> Closes apache#3230 from jeffwong-nflx/master and squashes the following commits: c67fa3d <jeffwong-nflx> Merge pull request #3 from jeffwong-nflx/cleanup 1df3026 <Jeffrey Wong> don't hard code -larrow and -lparquet 8ccaa51 <Jeffrey Wong> cleanup 75ba5c9 <Jeffrey Wong> add contributor 56adad2 <jeffwong-nflx> Merge pull request #2 from romainfrancois/3731/parquet-2 7d6e64d <Romain Francois> read_parquet() only reading one parquet file, and gains a `as_tibble` argument e936b44 <Romain Francois> need parquet on travis too ff260c5 <Romain Francois> header was too commented, renamed to parquet.cpp 9e1897f <Romain Francois> styling etc ... 456c5d2 <Jeffrey Wong> read parquet files 22d89dd <Jeffrey Wong> hardcode -larrow and -lparquet
I am contributing to Arrow 3731. This PR has the minimum functionality to read parquet files into an arrow::Table, which can then be converted to a tibble. Multiple parquet files can be read inside
lapply
, and then concatenated at the end.Steps to compile
devtools::load_all()
What I could use help with:
The biggest challenge for me is my lack of experience with pkg-config. The R library has a
configure
file which uses pkg-config to figure out what c++ libraries to link to. Currently,configure
looks up the Arrow project and links to -larrow only. We need it to also link to -lparquet. I do not know how to modify pkg-config's metadata to let it know to link to both -larrow and -lparquet