ARROW-5510: [C++][Python][R][GLib] Implement Feather "V2" using Arrow IPC file format #6694

wesm · 2020-03-23T21:34:22Z

This is based on top of ARROW-7979, so I will need to rebase once that is merged.

Excluding the changes from ARROW-7979, this patch is a substantial code reduction in Feather-related code. I removed a lot of cruft from the V1 implementation and made things a lot simpler without altering the user-facing functionality.

To summarize:

V2 is exactly the Arrow IPC file format, with the option for the experimental "trivial" body buffer compression implemented in ARROW-7979. read_feather functions distinguish the files based on the magic bytes at the beginning of the file ("FEA1" versus "ARROW1")
A ipc::feather::WriteProperties struct has been introduced to allow setting the file version, as well as chunksize (since large tables are broken up into smaller chunks when writing), compression type, and compression level (compressor-specific)
LZ4 and ZSTD are the only codecs intended to be supported (also in line with mailing list discussion about IPC compression). The default is LZ4 unless -DARROW_WITH_LZ4=OFF in which case it's uncompressed
Unit tests in Python now test both versions
R tests are only running the V2 version. I'll need some help adding options to set the version as well as the compression type and compression level

Since 0.17.0 is likely to be released without formalizing IPC compression, I will plan to support an "ARROW:experimental_compression" metadata member in 0.17.0 Feather files.

Other notes:

Column decompression is currently serial. I'll work on making this parallel ASAP as it will impact benchmarks significantly.
Compression (both chunk-level and column-level) is serial. Write performance would be much improved, especially at higher compression levels, by compressing in parallel at least at the column level
Write performance could be improved by compressing chunks and writing them to disk concurrently. It's done serially at the moment, so will open a follow up JIRA about this

wesm · 2020-03-23T21:40:16Z

I'll work on some rough compression benchmarks with LZ4 and ZSTD using the datasets in https://ursalabs.org/blog/2019-10-columnar-perf/ to see how things are looking as far as file size and load times.

github-actions · 2020-03-23T21:46:29Z

https://issues.apache.org/jira/browse/ARROW-5510

nealrichardson · 2020-03-23T22:18:36Z

The R build would be fixed by running cd r && make doc (or, not to beat a dead horse, by merging #6411)

wesm · 2020-03-23T23:10:12Z

OK, I ran some simple benchmarks on my laptop (8-core i9 processor -- note that BOTH compression and decompression are SINGLE-THREADED currently) with the Fannie Mae and NYC Taxi datasets. The cases are:

Uncompressed IPC file
ZSTD with level 1 and 10
LZ4 (no configurable compression level)

Note that these ".feather" files are exactly Arrow IPC files

See file sizes:

-rw------- 1 wesm wesm  638796850 Mar 23 17:49 2016Q4_lz4_None.feather
-rw-r--r-- 1 wesm wesm 1942965715 Dec  9 21:04 2016Q4.txt
-rw------- 1 wesm wesm 5045771154 Mar 23 17:49 2016Q4_uncompressed_None.feather
-rw------- 1 wesm wesm  395365770 Mar 23 17:49 2016Q4_zstd_10.feather
-rw------- 1 wesm wesm  524043698 Mar 23 17:49 2016Q4_zstd_1.feather
-rw-r--r-- 1 wesm wesm 2728058790 Mar 23 16:46 yellow_tripdata_2010-01.csv
-rw------- 1 wesm wesm 1175453106 Mar 23 17:51 yellow_tripdata_2010-01_lz4_None.feather
-rw------- 1 wesm wesm 2505808570 Mar 23 17:50 yellow_tripdata_2010-01_uncompressed_None.feather
-rw------- 1 wesm wesm  651796626 Mar 23 17:51 yellow_tripdata_2010-01_zstd_10.feather
-rw------- 1 wesm wesm  821963122 Mar 23 17:50 yellow_tripdata_2010-01_zstd_1.feather

ZSTD achieves 90+% compression ratio on Fannie Mae and 60-75% on NYC Taxi
LZ4 has lower compression ratio, 87% on Fannie Mae and 54% on NYC Taxi

Here are the load times (note again that decompression is single-threaded):

# Load to Arrow Table

('fanniemae', None, None, 0.016037464141845703)  # MEMORY MAP
('fanniemae', 'zstd', 1, 2.873652696609497)
('fanniemae', 'zstd', 10, 1.4272994995117188)
('fanniemae', 'lz4', None, 0.8573403358459473)
('nyctaxi', None, None, 0.004047393798828125)  # MEMORY MAP
('nyctaxi', 'zstd', 1, 2.758413553237915)
('nyctaxi', 'zstd', 10, 2.0100138187408447)
('nyctaxi', 'lz4', None, 0.8240318298339844)

# Load to Arrow Table, convert to pandas

('fanniemae', None, None, 2.4117162227630615)
('fanniemae', 'zstd', 1, 5.116245985031128)
('fanniemae', 'zstd', 10, 3.9139928817749023)
('fanniemae', 'lz4', None, 3.5294902324676514)
('nyctaxi', None, None, 7.1993725299835205)
('nyctaxi', 'zstd', 1, 10.147839069366455)
('nyctaxi', 'zstd', 10, 8.913217782974243)
('nyctaxi', 'lz4', None, 8.480979204177856)

This looks pretty excellent to me, a huge benefit to users with less than 2x slowdown when considering conversion to pandas.DataFrame.

wesm · 2020-03-25T18:18:57Z

This is rebased and should be easier to review now

jorisvandenbossche

Some small comments on the python side

python/pyarrow/feather.py

wesm · 2020-03-27T17:05:03Z

I'll work on fixing the CI issues today. @nealrichardson if you could help me with exposing the version and compression options in R that is the last thing that's needed for R I think

python/pyarrow/feather.py

cpp/src/arrow/ipc/feather.h

wesm · 2020-03-27T21:27:24Z

Some refactoring is required in GLib (I forgot that GLib had Feather bindings). I'll try to do it myself.

nealrichardson · 2020-03-27T22:12:10Z

python/pyarrow/feather.py

Suggested change

For V2 files, the size of chunks to split the data into. None means use

For V2 files, the number of rows each chunk in the file should have.

Use a smaller chunksize when you need faster random row access.

None means use

wesm · 2020-03-27T22:22:26Z

@kou I just pushed changes to the GLib and Ruby bindings that follow this PR. I removed the FeatherTableWriter class and a number of reader methods that I removed in this patch without deprecation. My reasoning was that users are primarily interacting with these files as a one-shot operation, where they either read the whole file or a subset of columns using names or indices. If you would like to expose the new ipc::feather::WriteProperties I will leave that to you for follow up work.

Feel free to make any changes you feel are appropriate. I did the bare minimum to get the test suite passing.

r/R/feather.R

nealrichardson · 2020-03-27T22:41:22Z

What happens if I try to write to Feather V1 data types that aren't supported? Does it tell me that I should use V2?

wesm · 2020-03-27T23:07:59Z

@nealrichardson it'll fail with an unsupported type error. I can amend the error message to say "use the V2 format, Luke!"

nealrichardson · 2020-03-27T23:11:09Z

FYI

> test_check("arrow")
── 1. Error: feather read/write round trip (@test-feather.R#74)  ───────────────
Invalid: LZ4 doesn't support setting a compression level.
Backtrace:
 1. arrow:::expect_feather_roundtrip(...)
 2. arrow:::write_fun(tib, tf2)
 3. arrow::write_feather(x, f, compression = "lz4", compression_level = 3)
 4. arrow:::ipc___WriteFeather__RecordBatch(...)

Is this expected? So compression_level should only be a valid option if using zstd?

wesm · 2020-03-27T23:14:58Z

Is this expected? So compression_level should only be a valid option if using zstd?

Right. I will make it so that the LZ4 compression level is ignored anyhow, that is a nuisance.

c_glib/test/test-feather-file-reader.rb

wesm · 2020-03-29T20:59:42Z

Thanks everyone for the assistance. I'll make a couple of refinements per comments above and then merge this

Compiling again Draft initial implementation of Feather V2, consolidate files Refactor unit tests to test V1 and V2, not all passing yet Port Feather Python tests to use pytest fixture for version, get tests passing Update R bindings, remove methods and writer class removed in C++ Update feather.fbs

…docs

…xing?

… Python

… Feather V1 format

…Z4 as default compression from R if it is available

wesm · 2020-03-29T22:02:21Z

@nealrichardson while I'm waiting for the CI to run, if you have a moment to take a look at my R changes:

Accept either data.frame, arrow::Table or arrow::RecordBatch in write_feather
Use LZ4 as default compression if it's available (this is the Python and C++ library default also)

nealrichardson · 2020-03-29T22:09:47Z

@wesm LGTM. Only note was that you could probably collapse the data.frame/record batch to table conversion to a single check, but then you just did it :)

wesm · 2020-03-30T00:04:05Z

+1. Thanks all

…ild against latest Arrow C++ APIs **Overview** * The MEX functions ``featherreadmex`` and ``featherwritemex`` fail to build against the latest Arrow C++ APIs. These changes allow them to successfully build. * These changes require CMake version 3.20 or later in order to access the latest functionality exposed by [FindMatlab.cmake](https://cmake.org/cmake/help/latest/module/FindMatlab.html). We noticed that some Arrow project components, such as [Gandiva](https://arrow.apache.org/docs/developers/cpp/building.html?highlight=gandiva#cmake-version-requirements), require newer versions of CMake than the core Arrow C++ libraries. If version 3.20 is too new, we're happy to find an alternative. * We couldn't find a way to read and write a table description for feather V1 files using the latest APIs. It looks like support for reading and writing descriptions was modified in pull request #6694. For now, we've removed support for table descriptions. **Testing** * Built ``featherreadmex`` and ``featherwritemex`` on Windows 10 with Visual Studio 2019 * Built ``featherreadmex`` and ``featherwritemex`` on macOS Big Sur (11.2.3) with GNU Make 3.81 * Built ``featherreadmex`` and ``featherwritemex`` on Debian 10 with GNU Make GNU 4.2.1 * Ran all tests in ``tfeather`` and ``tfeathermex`` on all platforms in MATLAB R2021a **Future Directions** * We did not detect the build failures due to the lack of CI integration. We hope to add CI support soon and will follow up with a mailing list discussion to talk through the details. * These changes are temporary to allow us to have a clean slate to start developing the [MATLAB Interface to Apache Arrow](https://github.com/apache/arrow/blob/master/matlab/doc/matlab_interface_for_apache_arrow_design.md). * Eventually we would like to support the full ranges of data types for feather V1 and feather V2. * In order to modernize the code, we plan to migrate to the [C++ MEX](https://www.mathworks.com/help/matlab/cpp-mex-file-applications.html) and [MATLAB Data Array](https://www.mathworks.com/help/matlab/matlab-data-array.html) APIs. * We are going to follow up with another pull request to update the README.md to provide more detailed platform-specific development instructions. * The MATLAB based build system inside of the ``build_support`` folder is out of date. We are not sure if we want to maintain a separate MATLAB based build system along side the CMake based one. We will follow up on this in the future via the mailing list or Jira. We acknowledge there is a lot of information in this pull request. In the future, we will work in smaller increments. We felt a bigger pull request was necessary to get back to a working state. Thanks, Sarah Closes #10305 from sgilmore10/ARROW_12730 Lead-authored-by: sgilmore <sgilmore@mathworks.com> Co-authored-by: sgilmore10 <74676073+sgilmore10@users.noreply.github.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

wesm mentioned this pull request Mar 23, 2020

ARROW-7979: [C++] Add experimental buffer compression to IPC write path. Add "field" selection to read path. Migrate some APIs to Result<T>. Read/write Message metadata #6638

Closed

wesm force-pushed the feather-v2 branch from ef4a128 to 95d80f8 Compare March 25, 2020 18:16

jorisvandenbossche reviewed Mar 25, 2020

View reviewed changes

python/pyarrow/feather.py Outdated Show resolved Hide resolved

python/pyarrow/feather.py Outdated Show resolved Hide resolved

python/pyarrow/feather.py Outdated Show resolved Hide resolved

wesm force-pushed the feather-v2 branch from 95d80f8 to 1b0059d Compare March 26, 2020 21:04

wesm commented Mar 27, 2020

View reviewed changes

python/pyarrow/feather.py Outdated Show resolved Hide resolved

nealrichardson reviewed Mar 27, 2020

View reviewed changes

cpp/src/arrow/ipc/feather.h Outdated Show resolved Hide resolved

wesm force-pushed the feather-v2 branch from 1b0059d to dbbdbae Compare March 27, 2020 21:26

nealrichardson reviewed Mar 27, 2020

View reviewed changes

wesm requested a review from kou March 27, 2020 22:22

nealrichardson reviewed Mar 27, 2020

View reviewed changes

r/R/feather.R Outdated Show resolved Hide resolved

kou reviewed Mar 27, 2020

View reviewed changes

c_glib/test/test-feather-file-reader.rb Outdated Show resolved Hide resolved

wesm changed the title ~~ARROW-5510: [C++][Python][R] Implement Feather "V2" using Arrow IPC file format~~ ARROW-5510: [C++][Python][R][GLib] Implement Feather "V2" using Arrow IPC file format Mar 27, 2020

wesm added 5 commits March 29, 2020 16:00

Add lz4 support

bc0b514

Expose configurable chunksize

c1a352d

Set default Feather compression to LZ4 unless it is unavailable

21c16d2

Follow Feather changes in GLib and Ruby bindings

7854e2c

nealrichardson and others added 9 commits March 29, 2020 16:00

Wire up Feather write options in R, add tests all around, and update …

1a15fd8

…docs

Have I mentioned lately how I feel about automating lint and style fi…

b8f942d

…xing?

Don't allow compression_level unless using zstd

7526030

[GLib][Ruby] Improve writer API

6212d88

Enable zstd

b312d17

Use pointer instead of reference for consistency

a72a9a5

Code review comments, check unsupported V1 options more thoroughly in…

d41807f

… Python

Better error message when trying to write unsupported Arrow type with…

5ce53cb

… Feather V1 format

[R] Accept either RecordBatch or Table in arrow::write_feather. Use L…

5b16bb3

…Z4 as default compression from R if it is available

wesm force-pushed the feather-v2 branch from 04f9106 to 5b16bb3 Compare March 29, 2020 22:01

[R] more concise conversion to Table

4ebac36

wesm closed this in e03251c Mar 30, 2020

wesm deleted the feather-v2 branch March 30, 2020 00:58

sgilmore10 mentioned this pull request May 12, 2021

ARROW-12730: [MATLAB] Update featherreadmex and featherwritemex to build against latest Arrow C++ APIs #10305

Closed

asfimport mentioned this pull request Mar 30, 2020

[Format] Feather V2 based on Arrow IPC file format, with compression support #21958

Closed

-        For V2 files, the size of chunks to split the data into. None means use
+        For V2 files, the number of rows each chunk in the file should have.
+        Use a smaller chunksize when you need faster random row access.
+        None means use

ARROW-5510: [C++][Python][R][GLib] Implement Feather "V2" using Arrow IPC file format #6694

ARROW-5510: [C++][Python][R][GLib] Implement Feather "V2" using Arrow IPC file format #6694

Uh oh!

Conversation

wesm commented Mar 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wesm commented Mar 23, 2020

Uh oh!

github-actions bot commented Mar 23, 2020

Uh oh!

nealrichardson commented Mar 23, 2020

Uh oh!

wesm commented Mar 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wesm commented Mar 25, 2020

Uh oh!

jorisvandenbossche left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wesm commented Mar 27, 2020

Uh oh!

Uh oh!

Uh oh!

wesm commented Mar 27, 2020

Uh oh!

nealrichardson Mar 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm commented Mar 27, 2020

Uh oh!

Uh oh!

nealrichardson commented Mar 27, 2020

Uh oh!

wesm commented Mar 27, 2020

Uh oh!

nealrichardson commented Mar 27, 2020

Uh oh!

wesm commented Mar 27, 2020

Uh oh!

Uh oh!

wesm commented Mar 29, 2020

Uh oh!

wesm commented Mar 29, 2020

Uh oh!

nealrichardson commented Mar 29, 2020

Uh oh!

wesm commented Mar 30, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wesm commented Mar 23, 2020 •

edited

Loading

wesm commented Mar 23, 2020 •

edited

Loading

jorisvandenbossche left a comment •

edited

Loading

nealrichardson Mar 27, 2020 •

edited

Loading