Skip to content

Conversation

@wesm
Copy link
Member

@wesm wesm commented Mar 23, 2020

This is based on top of ARROW-7979, so I will need to rebase once that is merged.

Excluding the changes from ARROW-7979, this patch is a substantial code reduction in Feather-related code. I removed a lot of cruft from the V1 implementation and made things a lot simpler without altering the user-facing functionality.

To summarize:

  • V2 is exactly the Arrow IPC file format, with the option for the experimental "trivial" body buffer compression implemented in ARROW-7979. read_feather functions distinguish the files based on the magic bytes at the beginning of the file ("FEA1" versus "ARROW1")
  • A ipc::feather::WriteProperties struct has been introduced to allow setting the file version, as well as chunksize (since large tables are broken up into smaller chunks when writing), compression type, and compression level (compressor-specific)
  • LZ4 and ZSTD are the only codecs intended to be supported (also in line with mailing list discussion about IPC compression). The default is LZ4 unless -DARROW_WITH_LZ4=OFF in which case it's uncompressed
  • Unit tests in Python now test both versions
  • R tests are only running the V2 version. I'll need some help adding options to set the version as well as the compression type and compression level

Since 0.17.0 is likely to be released without formalizing IPC compression, I will plan to support an "ARROW:experimental_compression" metadata member in 0.17.0 Feather files.

Other notes:

  • Column decompression is currently serial. I'll work on making this parallel ASAP as it will impact benchmarks significantly.
  • Compression (both chunk-level and column-level) is serial. Write performance would be much improved, especially at higher compression levels, by compressing in parallel at least at the column level
  • Write performance could be improved by compressing chunks and writing them to disk concurrently. It's done serially at the moment, so will open a follow up JIRA about this

@wesm
Copy link
Member Author

wesm commented Mar 23, 2020

I'll work on some rough compression benchmarks with LZ4 and ZSTD using the datasets in https://ursalabs.org/blog/2019-10-columnar-perf/ to see how things are looking as far as file size and load times.

@github-actions
Copy link

@nealrichardson
Copy link
Member

The R build would be fixed by running cd r && make doc (or, not to beat a dead horse, by merging #6411)

@wesm
Copy link
Member Author

wesm commented Mar 23, 2020

OK, I ran some simple benchmarks on my laptop (8-core i9 processor -- note that BOTH compression and decompression are SINGLE-THREADED currently) with the Fannie Mae and NYC Taxi datasets. The cases are:

  • Uncompressed IPC file
  • ZSTD with level 1 and 10
  • LZ4 (no configurable compression level)

Note that these ".feather" files are exactly Arrow IPC files

See file sizes:

-rw------- 1 wesm wesm  638796850 Mar 23 17:49 2016Q4_lz4_None.feather
-rw-r--r-- 1 wesm wesm 1942965715 Dec  9 21:04 2016Q4.txt
-rw------- 1 wesm wesm 5045771154 Mar 23 17:49 2016Q4_uncompressed_None.feather
-rw------- 1 wesm wesm  395365770 Mar 23 17:49 2016Q4_zstd_10.feather
-rw------- 1 wesm wesm  524043698 Mar 23 17:49 2016Q4_zstd_1.feather
-rw-r--r-- 1 wesm wesm 2728058790 Mar 23 16:46 yellow_tripdata_2010-01.csv
-rw------- 1 wesm wesm 1175453106 Mar 23 17:51 yellow_tripdata_2010-01_lz4_None.feather
-rw------- 1 wesm wesm 2505808570 Mar 23 17:50 yellow_tripdata_2010-01_uncompressed_None.feather
-rw------- 1 wesm wesm  651796626 Mar 23 17:51 yellow_tripdata_2010-01_zstd_10.feather
-rw------- 1 wesm wesm  821963122 Mar 23 17:50 yellow_tripdata_2010-01_zstd_1.feather
  • ZSTD achieves 90+% compression ratio on Fannie Mae and 60-75% on NYC Taxi
  • LZ4 has lower compression ratio, 87% on Fannie Mae and 54% on NYC Taxi

Here are the load times (note again that decompression is single-threaded):

# Load to Arrow Table

('fanniemae', None, None, 0.016037464141845703)  # MEMORY MAP
('fanniemae', 'zstd', 1, 2.873652696609497)
('fanniemae', 'zstd', 10, 1.4272994995117188)
('fanniemae', 'lz4', None, 0.8573403358459473)
('nyctaxi', None, None, 0.004047393798828125)  # MEMORY MAP
('nyctaxi', 'zstd', 1, 2.758413553237915)
('nyctaxi', 'zstd', 10, 2.0100138187408447)
('nyctaxi', 'lz4', None, 0.8240318298339844)

# Load to Arrow Table, convert to pandas

('fanniemae', None, None, 2.4117162227630615)
('fanniemae', 'zstd', 1, 5.116245985031128)
('fanniemae', 'zstd', 10, 3.9139928817749023)
('fanniemae', 'lz4', None, 3.5294902324676514)
('nyctaxi', None, None, 7.1993725299835205)
('nyctaxi', 'zstd', 1, 10.147839069366455)
('nyctaxi', 'zstd', 10, 8.913217782974243)
('nyctaxi', 'lz4', None, 8.480979204177856)

This looks pretty excellent to me, a huge benefit to users with less than 2x slowdown when considering conversion to pandas.DataFrame.

@wesm
Copy link
Member Author

wesm commented Mar 25, 2020

This is rebased and should be easier to review now

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small comments on the python side

@wesm
Copy link
Member Author

wesm commented Mar 27, 2020

I'll work on fixing the CI issues today. @nealrichardson if you could help me with exposing the version and compression options in R that is the last thing that's needed for R I think

@wesm
Copy link
Member Author

wesm commented Mar 27, 2020

Some refactoring is required in GLib (I forgot that GLib had Feather bindings). I'll try to do it myself.

Copy link
Member

@nealrichardson nealrichardson Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For V2 files, the size of chunks to split the data into. None means use
For V2 files, the number of rows each chunk in the file should have.
Use a smaller chunksize when you need faster random row access.
None means use

@wesm
Copy link
Member Author

wesm commented Mar 27, 2020

@kou I just pushed changes to the GLib and Ruby bindings that follow this PR. I removed the FeatherTableWriter class and a number of reader methods that I removed in this patch without deprecation. My reasoning was that users are primarily interacting with these files as a one-shot operation, where they either read the whole file or a subset of columns using names or indices. If you would like to expose the new ipc::feather::WriteProperties I will leave that to you for follow up work.

Feel free to make any changes you feel are appropriate. I did the bare minimum to get the test suite passing.

@wesm wesm requested a review from kou March 27, 2020 22:22
@nealrichardson
Copy link
Member

What happens if I try to write to Feather V1 data types that aren't supported? Does it tell me that I should use V2?

@wesm
Copy link
Member Author

wesm commented Mar 27, 2020

@nealrichardson it'll fail with an unsupported type error. I can amend the error message to say "use the V2 format, Luke!"

@nealrichardson
Copy link
Member

FYI

> test_check("arrow")
── 1. Error: feather read/write round trip (@test-feather.R#74)  ───────────────
Invalid: LZ4 doesn't support setting a compression level.
Backtrace:
 1. arrow:::expect_feather_roundtrip(...)
 2. arrow:::write_fun(tib, tf2)
 3. arrow::write_feather(x, f, compression = "lz4", compression_level = 3)
 4. arrow:::ipc___WriteFeather__RecordBatch(...)

Is this expected? So compression_level should only be a valid option if using zstd?

@wesm
Copy link
Member Author

wesm commented Mar 27, 2020

Is this expected? So compression_level should only be a valid option if using zstd?

Right. I will make it so that the LZ4 compression level is ignored anyhow, that is a nuisance.

@wesm wesm changed the title ARROW-5510: [C++][Python][R] Implement Feather "V2" using Arrow IPC file format ARROW-5510: [C++][Python][R][GLib] Implement Feather "V2" using Arrow IPC file format Mar 27, 2020
@wesm
Copy link
Member Author

wesm commented Mar 29, 2020

Thanks everyone for the assistance. I'll make a couple of refinements per comments above and then merge this

wesm added 5 commits March 29, 2020 16:00
Compiling again

Draft initial implementation of Feather V2, consolidate files

Refactor unit tests to test V1 and V2, not all passing yet

Port Feather Python tests to use pytest fixture for version, get tests passing

Update R bindings, remove methods and writer class removed in C++

Update feather.fbs
@wesm
Copy link
Member Author

wesm commented Mar 29, 2020

@nealrichardson while I'm waiting for the CI to run, if you have a moment to take a look at my R changes:

  • Accept either data.frame, arrow::Table or arrow::RecordBatch in write_feather
  • Use LZ4 as default compression if it's available (this is the Python and C++ library default also)

@nealrichardson
Copy link
Member

@wesm LGTM. Only note was that you could probably collapse the data.frame/record batch to table conversion to a single check, but then you just did it :)

@wesm
Copy link
Member Author

wesm commented Mar 30, 2020

+1. Thanks all

@wesm wesm closed this in e03251c Mar 30, 2020
@wesm wesm deleted the feather-v2 branch March 30, 2020 00:58
kou pushed a commit that referenced this pull request Jun 26, 2021
…ild against latest Arrow C++ APIs

**Overview**
* The MEX functions ``featherreadmex`` and ``featherwritemex`` fail to build against the latest Arrow C++ APIs. These changes allow them to successfully build.
* These changes require CMake version 3.20 or later in order to access the latest functionality exposed by [FindMatlab.cmake](https://cmake.org/cmake/help/latest/module/FindMatlab.html). We noticed that some Arrow project components, such as [Gandiva](https://arrow.apache.org/docs/developers/cpp/building.html?highlight=gandiva#cmake-version-requirements), require newer versions of CMake than the core Arrow C++ libraries.  If version 3.20 is too new, we're happy to find an alternative.
* We couldn't find a way to read and write a table description for feather V1 files using the latest APIs. It looks like support for reading and writing descriptions was modified in pull request #6694. For now, we've removed support for table descriptions.

**Testing**
* Built ``featherreadmex`` and ``featherwritemex`` on Windows 10 with Visual Studio 2019
* Built ``featherreadmex`` and ``featherwritemex`` on macOS Big Sur (11.2.3) with GNU Make 3.81
* Built ``featherreadmex`` and ``featherwritemex`` on Debian 10 with GNU Make GNU 4.2.1
* Ran all tests in ``tfeather`` and ``tfeathermex`` on all platforms in MATLAB R2021a

**Future Directions**
* We did not detect the build failures due to the lack of CI integration. We hope to add CI support soon and will follow up with a mailing list discussion to talk through the details.
* These changes are temporary to allow us to have a clean slate to start developing the  [MATLAB Interface to Apache Arrow](https://github.com/apache/arrow/blob/master/matlab/doc/matlab_interface_for_apache_arrow_design.md).
* Eventually we would like to support the full ranges of data types for feather V1 and feather V2.
* In order to modernize the code, we plan to migrate to the [C++ MEX](https://www.mathworks.com/help/matlab/cpp-mex-file-applications.html) and [MATLAB Data Array](https://www.mathworks.com/help/matlab/matlab-data-array.html) APIs.
* We are going to follow up with another pull request to update the README.md to provide more detailed platform-specific development instructions.
* The MATLAB based build system inside of the ``build_support`` folder is out of date.  We are not sure if we want to maintain a separate MATLAB based build system along side the CMake based one. We will follow up on this in the future via the mailing list or Jira.

We acknowledge there is a lot of information in this pull request. In the future, we will work in smaller increments. We felt a bigger pull request was necessary to get back to a working state.

Thanks,
Sarah

Closes #10305 from sgilmore10/ARROW_12730

Lead-authored-by: sgilmore <sgilmore@mathworks.com>
Co-authored-by: sgilmore10 <74676073+sgilmore10@users.noreply.github.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants