Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-3731: MVP to read parquet in R library #3230

Closed
wants to merge 11 commits into from
Closed

ARROW-3731: MVP to read parquet in R library #3230

wants to merge 11 commits into from

Conversation

jeffwong-nflx
Copy link
Contributor

I am contributing to Arrow 3731. This PR has the minimum functionality to read parquet files into an arrow::Table, which can then be converted to a tibble. Multiple parquet files can be read inside lapply, and then concatenated at the end.

Steps to compile

  1. Build arrow and parquet c++ projects
  2. In R run devtools::load_all()

What I could use help with:
The biggest challenge for me is my lack of experience with pkg-config. The R library has a configure file which uses pkg-config to figure out what c++ libraries to link to. Currently, configure looks up the Arrow project and links to -larrow only. We need it to also link to -lparquet. I do not know how to modify pkg-config's metadata to let it know to link to both -larrow and -lparquet

@kou kou changed the title MVP to read parquet in R library [Arrow 3731] ARROW-3731: MVP to read parquet in R library Dec 20, 2018
Copy link
Contributor

@romainfrancois romainfrancois left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this initial work. I've added some comments inline.

I need to have a look at the parquet api on the python side as we try to be as compatible as possible.

This needs

  • more low level api too, i.e. classes such as parquet::arrow::FileReader exposed on the R side.
  • a way to write parquet files too
  • unit tests

I'll come back here with more comments once I have more information on both python and C++ api for parquet.

tables = lapply(files, function(f) {
return (as_tibble(shared_ptr(`arrow::Table`, read_parquet_file(f))))
})
do.call('rbind', tables)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd probably use vctrs::vec_rbind(!!!tables) or dplyr::bind_rows(!!!tables) instead of do.call + rbind.

arrow currently does not depend on dplyr so I guess vctrs::vec_rbind(!!!tables) is preferred. @hadley

#' @param files a vector of filenames
#' @export
read_parquet = function(files) {
tables = lapply(files, function(f) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'd probably use purrr::map() or even purrr::map_dfr()

#' @export
read_parquet = function(files) {
tables = lapply(files, function(f) {
return (as_tibble(shared_ptr(`arrow::Table`, read_parquet_file(f))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could have only one parquet reader instead of one per file. But I need to read more about the parquet api first.

r/configure Outdated
@@ -49,7 +49,8 @@ if [ "$INCLUDE_DIR" ] || [ "$LIB_DIR" ]; then
elif [ "$PKGCONFIG_CFLAGS" ] || [ "$PKGCONFIG_LIBS" ]; then
echo "Found pkg-config cflags and libs!"
PKG_CFLAGS=${PKGCONFIG_CFLAGS}
PKG_LIBS=${PKGCONFIG_LIBS}
#PKG_LIBS=${PKGCONFIG_LIBS}
PKG_LIBS="-larrow -lparquet" # hard coded! what is the right way to do this?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should rather be addressed by updating the PKG_CONFIG_NAME higher

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having this seems to work for me:

PKG_CONFIG_NAME="arrow parquet"

This needs an update to the README.Rmd (and README.md) to have -DARROW_PARQUET=ON, i.e.:

cmake .. -DARROW_PARQUET=ON -DCMAKE_BUILD_TYPE=Release -DARROW_BOOST_USE_SHARED:BOOL=Off

// [[Rcpp::export]]
std::shared_ptr<arrow::Table> read_parquet_file(std::string filename) {
std::shared_ptr<arrow::io::ReadableFile> infile;
PARQUET_THROW_NOT_OK(arrow::io::ReadableFile::Open(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the name is a big hint, I'm not sure what the macro does and if it should perhaps be replaced by something that plays well with Rcpp.

@romainfrancois
Copy link
Contributor

Seems like the python version of parquet.read_table returns an arrow::Table: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html#pyarrow.parquet.read_table

We can/should probably do the same here and maybe have an as_tibble argument like in read_feather to convert to tibble if set to TRUE.

@@ -0,0 +1,10 @@
#' Read parquet file from disk
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs the license headers, you can just copy from any other R file.
https://travis-ci.org/apache/arrow/jobs/470349974#L493

@@ -0,0 +1,38 @@
// // Licensed to the Apache Software Foundation (ASF) under one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cpp files must adhere to the style. You can enforce that with

./r/lint.sh --fix

@romainfrancois
Copy link
Contributor

@jeffwong-nflx I've incorporated your commits and then some in this branch on my fork:
https://github.com/romainfrancois/arrow/tree/3731/parquet

This needs more work until this can be merged to the upstream repo, but if you want to continue working on this, please send pull requests to that branch.

@romainfrancois
Copy link
Contributor

Looking more at the python api: https://arrow.apache.org/docs/python/parquet.html#reading-and-writing-single-files it looks like parquet.read_table reads a single file, which would simplify the code here, users may iterate themselves but it does not belong in the read_parquet function. maybe the function should be called parquet_read_table instead of read_parquet. Unfortunately we cannot nest packages in R, but once arrow is done one thing we could do is make a small convenience parquet package so that we could do e.g. parquet::read_table().

Then, we'd need something like the ParquetFile and ParquetWriter classes at the r level: https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing

For multiple files, we should follow the partitioned dataset api of https://arrow.apache.org/docs/python/parquet.html#partitioned-datasets-multiple-files

@wesm
Copy link
Member

wesm commented Jan 2, 2019

I would recommend that you do not spend much energy implementing the ParquetDataset logic in R. I would like to port this from Python to C++, so we can maintain a common implementation https://issues.apache.org/jira/browse/ARROW-3764

@wesm
Copy link
Member

wesm commented Jan 2, 2019

It might be a good idea for someone to do that here in January so that it can be available to R and Ruby as soon as possible

@romainfrancois
Copy link
Contributor

Definitely. Most of what the R package does is rely on the C++ api and make things available to R. I just assumed that the python code did that for ParquetDataset too. We'll wait for a C++ implementation to mature.

@romainfrancois
Copy link
Contributor

I submitted a PR to @jeffwong-nflx fork to incorporate some changes on this PR: https://github.com/jeffwong-nflx/arrow/pull/2

@wesm
Copy link
Member

wesm commented Jan 3, 2019

@jeffwong-nflx would you mind if Romain picks up this branch? I think Romain can submit a new PR with your changes

@romainfrancois
Copy link
Contributor

I can go either way. It all depends if @jeffwong-nflx wants to do more work, in which case I can guide/help and work on other things in the meantime.

The other nicety is that it might give me ideas for a "how to contribute to the arrow R package" document.

Follow up to initial parquet support
@jeffwong-nflx
Copy link
Contributor Author

@romainfrancois let's consolidate onto your branch. This minimum functionality to read parquet files into R was what I was looking for in my application - I don't know if I can commit to getting the R library to have feature parity with the python library.

@romainfrancois
Copy link
Contributor

Fair enough.

What I'm proposing is that we merge this asap, and then I'll work out in another branch/pr, etc ... additional things for more parquet support.

To merge we need to fix the last small issues, mostly about files that have not been removed (not sure why).

travis should pass then, and we can merge this PR to the apache repo.

@wesm things are easier if we don't merge any other R PR for now (otherwise because of the centrality of the generated RcppExports.* files, this typically needs a rebase, and that's not fun. I'm used to it so I don't mind, but I'd rather avoid it on external contributions).

@romainfrancois
Copy link
Contributor

@jeffwong-nflx in addition, if you like, would you add a line in the DESCRIPTION file and mark yourself as a cab, e.g.

https://github.com/apache/arrow/blob/master/r/DESCRIPTION#L6

Something like:

    person("Jeff", "Wong", email = "<your@email>", role = c("ctb")),

@jeffwong-nflx
Copy link
Contributor Author

@romainfrancois I added all the items to my fork, hope Travis CI passes!

@jeffwong-nflx
Copy link
Contributor Author

Travis CI failed, but it looks like the R part passed, and the python part failed. @wesm

@romainfrancois
Copy link
Contributor

The problem looks unrelated. I restarted the job. We'll see what happens. Perhaps we'll need to rebase.

@codecov-io
Copy link

Codecov Report

Merging #3230 into master will increase coverage by 0.37%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3230      +/-   ##
==========================================
+ Coverage   88.18%   88.55%   +0.37%     
==========================================
  Files         535      540       +5     
  Lines       72873    73104     +231     
==========================================
+ Hits        64260    64740     +480     
+ Misses       8506     8257     -249     
  Partials      107      107
Impacted Files Coverage Δ
cpp/src/gandiva/configuration.h 88.23% <0%> (-11.77%) ⬇️
python/pyarrow/__init__.py 65.78% <0%> (-4.36%) ⬇️
cpp/src/arrow/util/compression_snappy.cc 75% <0%> (-2.78%) ⬇️
cpp/src/arrow/type.h 89.67% <0%> (-1.89%) ⬇️
python/pyarrow/tests/test_array.py 97.28% <0%> (-1.84%) ⬇️
cpp/src/arrow/python/helpers.cc 78.26% <0%> (-1.4%) ⬇️
cpp/src/arrow/util/thread-pool-test.cc 98.14% <0%> (-0.78%) ⬇️
cpp/src/arrow/record_batch.cc 89.84% <0%> (-0.74%) ⬇️
cpp/src/arrow/io/hdfs-internal.cc 33.04% <0%> (-0.58%) ⬇️
cpp/src/plasma/io.cc 74.21% <0%> (-0.4%) ⬇️
... and 126 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1a8c8f0...c67fa3d. Read the comment docs.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -4,6 +4,7 @@ Version: 0.0.0.9000
Authors@R: c(
person("Romain", "François", email = "romain@rstudio.com", role = c("aut", "cre")),
person("Javier", "Luraschi", email = "javier@rstudio.com", role = c("ctb")),
person("Jeffrey", "Wong", email = "jeffreyw@netflix.com", role = c("ctb")),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that we need to maintain these bylines

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily.

@wesm wesm closed this in 5723ada Jan 5, 2019
emkornfield pushed a commit to emkornfield/arrow that referenced this pull request Jan 10, 2019
I am contributing to [Arrow 3731](https://issues.apache.org/jira/browse/ARROW-3731). This PR has the minimum functionality to read parquet files into an arrow::Table, which can then be converted to a tibble. Multiple parquet files can be read inside `lapply`, and then concatenated at the end.

Steps to compile
1) Build arrow and parquet c++ projects
2) In R run `devtools::load_all()`

What I could use help with:
The biggest challenge for me is my lack of experience with pkg-config. The R library has a `configure` file which uses pkg-config to figure out what c++ libraries to link to. Currently, `configure` looks up the Arrow project and links to -larrow only. We need it to also link to -lparquet. I do not know how to modify pkg-config's metadata to let it know to link to both -larrow and -lparquet

Author: Jeffrey Wong <jeffreyw@netflix.com>
Author: Romain Francois <romain@purrple.cat>
Author: jeffwong-nflx <jeffreyw@netflix.com>

Closes apache#3230 from jeffwong-nflx/master and squashes the following commits:

c67fa3d <jeffwong-nflx> Merge pull request #3 from jeffwong-nflx/cleanup
1df3026 <Jeffrey Wong> don't hard code -larrow and -lparquet
8ccaa51 <Jeffrey Wong> cleanup
75ba5c9 <Jeffrey Wong> add contributor
56adad2 <jeffwong-nflx> Merge pull request #2 from romainfrancois/3731/parquet-2
7d6e64d <Romain Francois> read_parquet() only reading one parquet file, and gains a `as_tibble` argument
e936b44 <Romain Francois> need parquet on travis too
ff260c5 <Romain Francois> header was too commented, renamed to parquet.cpp
9e1897f <Romain Francois> styling etc ...
456c5d2 <Jeffrey Wong> read parquet files
22d89dd <Jeffrey Wong> hardcode -larrow and -lparquet
@kou kou mentioned this pull request Feb 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants