- 
                Notifications
    You must be signed in to change notification settings 
- Fork 3.9k
ARROW-5510: [C++][Python][R][GLib] Implement Feather "V2" using Arrow IPC file format #6694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| I'll work on some rough compression benchmarks with LZ4 and ZSTD using the datasets in https://ursalabs.org/blog/2019-10-columnar-perf/ to see how things are looking as far as file size and load times. | 
| The R build would be fixed by running  | 
| OK, I ran some simple benchmarks on my laptop (8-core i9 processor -- note that BOTH compression and decompression are SINGLE-THREADED currently) with the Fannie Mae and NYC Taxi datasets. The cases are: 
 Note that these ".feather" files are exactly Arrow IPC files See file sizes: 
 Here are the load times (note again that decompression is single-threaded): This looks pretty excellent to me, a huge benefit to users with less than 2x slowdown when considering conversion to  | 
| This is rebased and should be easier to review now | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small comments on the python side
| I'll work on fixing the CI issues today. @nealrichardson if you could help me with exposing the version and compression options in R that is the last thing that's needed for R I think | 
| Some refactoring is required in GLib (I forgot that GLib had Feather bindings). I'll try to do it myself. | 
        
          
                python/pyarrow/feather.py
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| For V2 files, the size of chunks to split the data into. None means use | |
| For V2 files, the number of rows each chunk in the file should have. | |
| Use a smaller chunksize when you need faster random row access. | |
| None means use | 
| @kou I just pushed changes to the GLib and Ruby bindings that follow this PR. I removed the FeatherTableWriter class and a number of reader methods that I removed in this patch without deprecation. My reasoning was that users are primarily interacting with these files as a one-shot operation, where they either read the whole file or a subset of columns using names or indices. If you would like to expose the new  Feel free to make any changes you feel are appropriate. I did the bare minimum to get the test suite passing. | 
| What happens if I try to write to Feather V1 data types that aren't supported? Does it tell me that I should use V2? | 
| @nealrichardson it'll fail with an unsupported type error. I can amend the error message to say "use the V2 format, Luke!" | 
| FYI Is this expected? So compression_level should only be a valid option if using zstd? | 
| 
 Right. I will make it so that the LZ4 compression level is ignored anyhow, that is a nuisance. | 
| Thanks everyone for the assistance. I'll make a couple of refinements per comments above and then merge this | 
Compiling again Draft initial implementation of Feather V2, consolidate files Refactor unit tests to test V1 and V2, not all passing yet Port Feather Python tests to use pytest fixture for version, get tests passing Update R bindings, remove methods and writer class removed in C++ Update feather.fbs
… Feather V1 format
…Z4 as default compression from R if it is available
| @nealrichardson while I'm waiting for the CI to run, if you have a moment to take a look at my R changes: 
 | 
| @wesm LGTM. Only note was that you could probably collapse the data.frame/record batch to table conversion to a single check, but then you just did it :) | 
| +1. Thanks all | 
…ild against latest Arrow C++ APIs **Overview** * The MEX functions ``featherreadmex`` and ``featherwritemex`` fail to build against the latest Arrow C++ APIs. These changes allow them to successfully build. * These changes require CMake version 3.20 or later in order to access the latest functionality exposed by [FindMatlab.cmake](https://cmake.org/cmake/help/latest/module/FindMatlab.html). We noticed that some Arrow project components, such as [Gandiva](https://arrow.apache.org/docs/developers/cpp/building.html?highlight=gandiva#cmake-version-requirements), require newer versions of CMake than the core Arrow C++ libraries. If version 3.20 is too new, we're happy to find an alternative. * We couldn't find a way to read and write a table description for feather V1 files using the latest APIs. It looks like support for reading and writing descriptions was modified in pull request #6694. For now, we've removed support for table descriptions. **Testing** * Built ``featherreadmex`` and ``featherwritemex`` on Windows 10 with Visual Studio 2019 * Built ``featherreadmex`` and ``featherwritemex`` on macOS Big Sur (11.2.3) with GNU Make 3.81 * Built ``featherreadmex`` and ``featherwritemex`` on Debian 10 with GNU Make GNU 4.2.1 * Ran all tests in ``tfeather`` and ``tfeathermex`` on all platforms in MATLAB R2021a **Future Directions** * We did not detect the build failures due to the lack of CI integration. We hope to add CI support soon and will follow up with a mailing list discussion to talk through the details. * These changes are temporary to allow us to have a clean slate to start developing the [MATLAB Interface to Apache Arrow](https://github.com/apache/arrow/blob/master/matlab/doc/matlab_interface_for_apache_arrow_design.md). * Eventually we would like to support the full ranges of data types for feather V1 and feather V2. * In order to modernize the code, we plan to migrate to the [C++ MEX](https://www.mathworks.com/help/matlab/cpp-mex-file-applications.html) and [MATLAB Data Array](https://www.mathworks.com/help/matlab/matlab-data-array.html) APIs. * We are going to follow up with another pull request to update the README.md to provide more detailed platform-specific development instructions. * The MATLAB based build system inside of the ``build_support`` folder is out of date. We are not sure if we want to maintain a separate MATLAB based build system along side the CMake based one. We will follow up on this in the future via the mailing list or Jira. We acknowledge there is a lot of information in this pull request. In the future, we will work in smaller increments. We felt a bigger pull request was necessary to get back to a working state. Thanks, Sarah Closes #10305 from sgilmore10/ARROW_12730 Lead-authored-by: sgilmore <sgilmore@mathworks.com> Co-authored-by: sgilmore10 <74676073+sgilmore10@users.noreply.github.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
This is based on top of ARROW-7979, so I will need to rebase once that is merged.
Excluding the changes from ARROW-7979, this patch is a substantial code reduction in Feather-related code. I removed a lot of cruft from the V1 implementation and made things a lot simpler without altering the user-facing functionality.
To summarize:
read_featherfunctions distinguish the files based on the magic bytes at the beginning of the file ("FEA1" versus "ARROW1")ipc::feather::WritePropertiesstruct has been introduced to allow setting the file version, as well as chunksize (since large tables are broken up into smaller chunks when writing), compression type, and compression level (compressor-specific)Since 0.17.0 is likely to be released without formalizing IPC compression, I will plan to support an "ARROW:experimental_compression" metadata member in 0.17.0 Feather files.
Other notes: