Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-3731: MVP to read parquet in R library #3230

Closed
wants to merge 11 commits into from
2 changes: 2 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -307,6 +307,8 @@ matrix:
language: r
cache: packages
latex: false
env:
- ARROW_TRAVIS_PARQUET=1
before_install:
# Have to copy-paste this here because of how R's build steps work
- eval `python $TRAVIS_BUILD_DIR/ci/detect-changes.py`
Expand Down
2 changes: 2 additions & 0 deletions r/DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ Version: 0.0.0.9000
Authors@R: c(
person("Romain", "François", email = "romain@rstudio.com", role = c("aut", "cre")),
person("Javier", "Luraschi", email = "javier@rstudio.com", role = c("ctb")),
person("Jeffrey", "Wong", email = "jeffreyw@netflix.com", role = c("ctb")),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that we need to maintain these bylines

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily.

person("Apache Arrow", email = "dev@arrow.apache.org", role = c("aut", "cph"))
)
Description: R Integration to 'Apache' 'Arrow'.
Expand Down Expand Up @@ -61,6 +62,7 @@ Collate:
'memory_pool.R'
'message.R'
'on_exit.R'
'parquet.R'
'read_record_batch.R'
'read_table.R'
'reexports-bit64.R'
Expand Down
1 change: 1 addition & 0 deletions r/NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,7 @@ export(print.integer64)
export(read_arrow)
export(read_feather)
export(read_message)
export(read_parquet)
export(read_record_batch)
export(read_schema)
export(read_table)
Expand Down
4 changes: 4 additions & 0 deletions r/R/RcppExports.R

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

33 changes: 33 additions & 0 deletions r/R/parquet.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

#' Read parquet file from disk
#'
#' @param file a file path
#' @param as_tibble should the [arrow::Table][arrow__Table] be converted to a tibble.
#' @param ... currently ignored
#'
#' @return a [arrow::Table][arrow__Table], or a data frame if `as_tibble` is `TRUE`.
#'
#' @export
read_parquet <- function(file, as_tibble = TRUE, ...) {
tab <- shared_ptr(`arrow::Table`, read_parquet_file(f))
if (isTRUE(as_tibble)) {
tab <- as_tibble(tab)
}
tab
}
2 changes: 1 addition & 1 deletion r/README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ git clone https://github.com/apache/arrow.git
cd arrow/cpp && mkdir release && cd release

# It is important to statically link to boost libraries
cmake .. -DCMAKE_BUILD_TYPE=Release -DARROW_BOOST_USE_SHARED:BOOL=Off
cmake .. -DARROW_PARQUET=ON -DCMAKE_BUILD_TYPE=Release -DARROW_BOOST_USE_SHARED:BOOL=Off
make install
```

Expand Down
61 changes: 16 additions & 45 deletions r/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ git clone https://github.com/apache/arrow.git
cd arrow/cpp && mkdir release && cd release

# It is important to statically link to boost libraries
cmake .. -DCMAKE_BUILD_TYPE=Release -DARROW_BOOST_USE_SHARED:BOOL=Off
cmake .. -DARROW_PARQUET=ON -DCMAKE_BUILD_TYPE=Release -DARROW_BOOST_USE_SHARED:BOOL=Off
make install
```

Expand All @@ -38,48 +38,19 @@ tf <- tempfile()
#> # A tibble: 10 x 2
#> x y
#> <int> <dbl>
#> 1 1 -0.255
#> 2 2 -0.162
#> 3 3 -0.614
#> 4 4 -0.322
#> 5 5 0.0693
#> 6 6 -0.920
#> 7 7 -1.08
#> 8 8 0.658
#> 9 9 0.821
#> 10 10 0.539
arrow::write_arrow(tib, tf)

# read it back with pyarrow
pa <- import("pyarrow")
as_tibble(pa$open_file(tf)$read_pandas())
#> # A tibble: 10 x 2
#> x y
#> <int> <dbl>
#> 1 1 -0.255
#> 2 2 -0.162
#> 3 3 -0.614
#> 4 4 -0.322
#> 5 5 0.0693
#> 6 6 -0.920
#> 7 7 -1.08
#> 8 8 0.658
#> 9 9 0.821
#> 10 10 0.539
```

## Development

### Code style

We use Google C++ style in our C++ code. Check for style errors with

```
./lint.sh
```

You can fix the style issues with

#> 1 1 0.0855
#> 2 2 -1.68
#> 3 3 -0.0294
#> 4 4 -0.124
#> 5 5 0.0675
#> 6 6 1.64
#> 7 7 1.54
#> 8 8 -0.0209
#> 9 9 -0.982
#> 10 10 0.349
# arrow::write_arrow(tib, tf)

# # read it back with pyarrow
# pa <- import("pyarrow")
# as_tibble(pa$open_file(tf)$read_pandas())
```
./lint.sh --fix
```
4 changes: 2 additions & 2 deletions r/configure
Original file line number Diff line number Diff line change
Expand Up @@ -26,13 +26,13 @@
# R CMD INSTALL --configure-vars='INCLUDE_DIR=/.../include LIB_DIR=/.../lib'

# Library settings
PKG_CONFIG_NAME="arrow"
PKG_CONFIG_NAME="arrow parquet"
PKG_DEB_NAME="arrow"
PKG_RPM_NAME="arrow"
PKG_CSW_NAME="arrow"
PKG_BREW_NAME="apache-arrow"
PKG_TEST_HEADER="<arrow/api.h>"
PKG_LIBS="-larrow"
PKG_LIBS="-larrow -lparquet"

# Use pkg-config if available
pkg-config --version >/dev/null 2>&1
Expand Down
21 changes: 21 additions & 0 deletions r/man/read_parquet.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

12 changes: 12 additions & 0 deletions r/src/RcppExports.cpp

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

37 changes: 37 additions & 0 deletions r/src/parquet.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

#include <arrow/api.h>
#include <arrow/io/api.h>
#include <parquet/arrow/reader.h>
#include <parquet/arrow/writer.h>
#include <parquet/exception.h>

// [[Rcpp::export]]
std::shared_ptr<arrow::Table> read_parquet_file(std::string filename) {
std::shared_ptr<arrow::io::ReadableFile> infile;
PARQUET_THROW_NOT_OK(
arrow::io::ReadableFile::Open(filename, arrow::default_memory_pool(), &infile));

std::unique_ptr<parquet::arrow::FileReader> reader;
PARQUET_THROW_NOT_OK(
parquet::arrow::OpenFile(infile, arrow::default_memory_pool(), &reader));
std::shared_ptr<arrow::Table> table;
PARQUET_THROW_NOT_OK(reader->ReadTable(&table));

return table;
}