Skip to content

Commit 2714097

Browse files
romainfrancoisemkornfield
authored andcommitted
ARROW-3760: [R] Support Arrow CSV reader
The main entry point is the `csv_read()` function, all it does is create a `csv::TableReader` with the `csv_table_reader()` generic and then `$Read()` from it. as in the apache#2947 for feather format, `csv_table_reader` is generic with the methods: - arrow::io::InputStream: calls the TableReader actor with the other options - character and fs_path: depending on the `mmap` option (TRUE by default) it opens the file with `mmap_open()` of `file_open()` and then calls the other method. ``` r library(arrow) tf <- tempfile() readr::write_csv(iris, tf) tab1 <- csv_read(tf) tab1 #> arrow::Table as_tibble(tab1) #> # A tibble: 150 x 5 #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> <dbl> <dbl> <dbl> <dbl> <chr> #> 1 5.1 3.5 1.4 0.2 setosa #> 2 4.9 3 1.4 0.2 setosa #> 3 4.7 3.2 1.3 0.2 setosa #> 4 4.6 3.1 1.5 0.2 setosa #> 5 5 3.6 1.4 0.2 setosa #> 6 5.4 3.9 1.7 0.4 setosa #> 7 4.6 3.4 1.4 0.3 setosa #> 8 5 3.4 1.5 0.2 setosa #> 9 4.4 2.9 1.4 0.2 setosa #> 10 4.9 3.1 1.5 0.1 setosa #> # … with 140 more rows ``` <sup>Created on 2018-11-13 by the [reprex package](https://reprex.tidyverse.org) (v0.2.1.9000)</sup> Author: Romain Francois <romain@purrple.cat> Closes apache#2949 from romainfrancois/ARROW-3760/csv_reader and squashes the following commits: 951e9f5 <Romain Francois> s/csv_read/read_csv_arrow/ 7770ec5 <Romain Francois> not using readr:: at this point bb13a76 <Romain Francois> rebase 83b5162 <Romain Francois> s/file_open/ReadableFile/ 959020c <Romain Francois> No need to special use mmap for file path method 6e74003 <Romain Francois> going through CharacterVector makes sure this is a character vector 2585501 <Romain Francois> line breaks for readability 0ab8397 <Romain Francois> linting 09187e6 <Romain Francois> Expose arrow::csv::TableReader, functions csv_table_reader() + csv_read()
1 parent 0934cf7 commit 2714097

13 files changed

+488
-0
lines changed

r/DESCRIPTION

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,7 @@ Collate:
5555
'array.R'
5656
'buffer.R'
5757
'compute.R'
58+
'csv.R'
5859
'dictionary.R'
5960
'feather.R'
6061
'io.R'

r/NAMESPACE

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,11 @@ S3method(buffer,default)
3939
S3method(buffer,integer)
4040
S3method(buffer,numeric)
4141
S3method(buffer,raw)
42+
S3method(csv_table_reader,"arrow::csv::TableReader")
43+
S3method(csv_table_reader,"arrow::io::InputStream")
44+
S3method(csv_table_reader,character)
45+
S3method(csv_table_reader,default)
46+
S3method(csv_table_reader,fs_path)
4247
S3method(length,"arrow::Array")
4348
S3method(names,"arrow::RecordBatch")
4449
S3method(print,"arrow-enum")
@@ -92,6 +97,10 @@ export(boolean)
9297
export(buffer)
9398
export(cast_options)
9499
export(chunked_array)
100+
export(csv_convert_options)
101+
export(csv_parse_options)
102+
export(csv_read_options)
103+
export(csv_table_reader)
95104
export(date32)
96105
export(date64)
97106
export(decimal)
@@ -111,6 +120,7 @@ export(mmap_open)
111120
export(null)
112121
export(print.integer64)
113122
export(read_arrow)
123+
export(read_csv_arrow)
114124
export(read_feather)
115125
export(read_message)
116126
export(read_record_batch)
@@ -141,6 +151,7 @@ importFrom(glue,glue)
141151
importFrom(purrr,map)
142152
importFrom(purrr,map2)
143153
importFrom(purrr,map_int)
154+
importFrom(rlang,abort)
144155
importFrom(rlang,dots_n)
145156
importFrom(rlang,list2)
146157
importFrom(rlang,warn)

r/R/RcppExports.R

Lines changed: 20 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

r/R/csv.R

Lines changed: 182 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,182 @@
1+
# Licensed to the Apache Software Foundation (ASF) under one
2+
# or more contributor license agreements. See the NOTICE file
3+
# distributed with this work for additional information
4+
# regarding copyright ownership. The ASF licenses this file
5+
# to you under the Apache License, Version 2.0 (the
6+
# "License"); you may not use this file except in compliance
7+
# with the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing,
12+
# software distributed under the License is distributed on an
13+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
# KIND, either express or implied. See the License for the
15+
# specific language governing permissions and limitations
16+
# under the License.
17+
18+
#' @include R6.R
19+
20+
`arrow::csv::TableReader` <- R6Class("arrow::csv::TableReader", inherit = `arrow::Object`,
21+
public = list(
22+
Read = function() shared_ptr(`arrow::Table`, csv___TableReader__Read(self))
23+
)
24+
)
25+
26+
`arrow::csv::ReadOptions` <- R6Class("arrow::csv::ReadOptions", inherit = `arrow::Object`)
27+
`arrow::csv::ParseOptions` <- R6Class("arrow::csv::ParseOptions", inherit = `arrow::Object`)
28+
`arrow::csv::ConvertOptions` <- R6Class("arrow::csv::ConvertOptions", inherit = `arrow::Object`)
29+
30+
#' read options for the csv reader
31+
#'
32+
#' @param use_threads Whether to use the global CPU thread pool
33+
#' @param block_size Block size we request from the IO layer; also determines the size of chunks when use_threads is `TRUE`
34+
#'
35+
#' @export
36+
csv_read_options <- function(use_threads = TRUE, block_size = 1048576L) {
37+
shared_ptr(`arrow::csv::ReadOptions`, csv___ReadOptions__initialize(
38+
list(
39+
use_threads = use_threads,
40+
block_size = block_size
41+
)
42+
))
43+
}
44+
45+
#' Parsing options
46+
#'
47+
#' @param delimiter Field delimiter
48+
#' @param quoting Whether quoting is used
49+
#' @param quote_char Quoting character (if `quoting` is `TRUE`)
50+
#' @param double_quote Whether a quote inside a value is double-quoted
51+
#' @param escaping Whether escaping is used
52+
#' @param escape_char Escaping character (if `escaping` is `TRUE`)
53+
#' @param newlines_in_values Whether values are allowed to contain CR (`0x0d``) and LF (`0x0a``) characters
54+
#' @param ignore_empty_lines Whether empty lines are ignored. If false, an empty line represents
55+
#' @param header_rows Number of header rows to skip (including the first row containing column names)
56+
#'
57+
#' @export
58+
csv_parse_options <- function(
59+
delimiter = ",", quoting = TRUE, quote_char = '"',
60+
double_quote = TRUE, escaping = FALSE, escape_char = '\\',
61+
newlines_in_values = FALSE, ignore_empty_lines = TRUE,
62+
header_rows = 1L
63+
){
64+
shared_ptr(`arrow::csv::ParseOptions`, csv___ParseOptions__initialize(
65+
list(
66+
delimiter = delimiter,
67+
quoting = quoting,
68+
quote_char = quote_char,
69+
double_quote = double_quote,
70+
escaping = escaping,
71+
escape_char = escape_char,
72+
newlines_in_values = newlines_in_values,
73+
ignore_empty_lines = ignore_empty_lines,
74+
header_rows = header_rows
75+
)
76+
))
77+
}
78+
79+
#' Conversion Options for the csv reader
80+
#'
81+
#' @param check_utf8 Whether to check UTF8 validity of string columns
82+
#'
83+
#' @export
84+
csv_convert_options <- function(check_utf8 = TRUE){
85+
shared_ptr(`arrow::csv::ConvertOptions`, csv___ConvertOptions__initialize(
86+
list(
87+
check_utf8 = check_utf8
88+
)
89+
))
90+
}
91+
92+
#' CSV table reader
93+
#'
94+
#' @param file file
95+
#' @param read_options, see [csv_read_options()]
96+
#' @param parse_options, see [csv_parse_options()]
97+
#' @param convert_options, see [csv_convert_options()]
98+
#' @param ... additional parameters.
99+
#'
100+
#' @export
101+
csv_table_reader <- function(file,
102+
read_options = csv_read_options(),
103+
parse_options = csv_parse_options(),
104+
convert_options = csv_convert_options(),
105+
...
106+
){
107+
UseMethod("csv_table_reader")
108+
}
109+
110+
#' @importFrom rlang abort
111+
#' @export
112+
csv_table_reader.default <- function(file,
113+
read_options = csv_read_options(),
114+
parse_options = csv_parse_options(),
115+
convert_options = csv_convert_options(),
116+
...
117+
) {
118+
abort("unsupported")
119+
}
120+
121+
#' @export
122+
`csv_table_reader.character` <- function(file,
123+
read_options = csv_read_options(),
124+
parse_options = csv_parse_options(),
125+
convert_options = csv_convert_options(),
126+
...
127+
){
128+
csv_table_reader(fs::path_abs(file),
129+
read_options = read_options,
130+
parse_options = parse_options,
131+
convert_options = convert_options,
132+
...
133+
)
134+
}
135+
136+
#' @export
137+
`csv_table_reader.fs_path` <- function(file,
138+
read_options = csv_read_options(),
139+
parse_options = csv_parse_options(),
140+
convert_options = csv_convert_options(),
141+
...
142+
){
143+
csv_table_reader(ReadableFile(file),
144+
read_options = read_options,
145+
parse_options = parse_options,
146+
convert_options = convert_options,
147+
...
148+
)
149+
}
150+
151+
#' @export
152+
`csv_table_reader.arrow::io::InputStream` <- function(file,
153+
read_options = csv_read_options(),
154+
parse_options = csv_parse_options(),
155+
convert_options = csv_convert_options(),
156+
...
157+
){
158+
shared_ptr(`arrow::csv::TableReader`,
159+
csv___TableReader__Make(file, read_options, parse_options, convert_options)
160+
)
161+
}
162+
163+
#' @export
164+
`csv_table_reader.arrow::csv::TableReader` <- function(file,
165+
read_options = csv_read_options(),
166+
parse_options = csv_parse_options(),
167+
convert_options = csv_convert_options(),
168+
...
169+
){
170+
file
171+
}
172+
173+
#' Read csv file into an arrow::Table
174+
#'
175+
#' Use arrow::csv::TableReader from [csv_table_reader()]
176+
#'
177+
#' @param ... Used to construct an arrow::csv::TableReader
178+
#' @export
179+
read_csv_arrow <- function(...) {
180+
csv_table_reader(...)$Read()
181+
}
182+

r/man/csv_convert_options.Rd

Lines changed: 14 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

r/man/csv_parse_options.Rd

Lines changed: 33 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

r/man/csv_read_options.Rd

Lines changed: 16 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

r/man/csv_table_reader.Rd

Lines changed: 24 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

r/man/read_csv_arrow.Rd

Lines changed: 14 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)