- 
                Notifications
    You must be signed in to change notification settings 
- Fork 3.9k
ARROW-7979: [C++] Add experimental buffer compression to IPC write path. Add "field" selection to read path. Migrate some APIs to Result<T>. Read/write Message metadata #6638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| I'll fix the glib issues and other things tomorrow | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't read a lot of this, so just a few comments.
        
          
                cpp/src/arrow/ipc/reader.cc
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are at most 3 or 4 buffers in a Arrow array, and they have very different sizes. I don't think parallelizing at this level makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. The parallelization needs to happen at the field-level, which will require some refactoring. I'll open a follow up JIRA since it's out of scope for this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Lots of comments, mostly code style/nits
        
          
                cpp/src/arrow/dataset/file_ipc.cc
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: avoid using Result::Value unless wrapping a result returning API into a status returning API
| auto status = ipc::RecordBatchFileReader::Open(std::move(input)).Value(&reader); | |
| if (!status.ok()) { | |
| return status.WithMessage("Could not open IPC input source '", source.path(), | |
| "': ", status.message()); | |
| } | |
| auto maybe_reader = ipc::RecordBatchFileReader::Open(std::move(input)); | |
| auto status = maybe_reader.status(); | |
| if (!status.ok()) { | |
| return status.WithMessage("Could not open IPC input source '", source.path(), | |
| "': ", status.message()); | |
| } | |
| return std::move(maybe_reader).ValueOrDie(); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is using Result::Value bad? Not obvious from reading the docstrings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not bad, just less preferred due to mixing argout Status return and Result usage
11fa5ad    to
    2d17d32      
    Compare
  
    | I think I've addressed almost all of the comments. I'll try to get all the builds passing (C++/Python/R/GLib all passing for me locally) | 
| Hmm looks like I broke something related to Maps I'll fix that plus the Windows issue tomorrow. If I could get the thumbs up/down on the big picture stuff on the meantime that would be great | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
big picture: +1
| @lidavidm some of the Flight integration tests failed with this error, do you know what it could be about?  | 
| The relevant bit is  | 
| There's https://issues.apache.org/jira/browse/ARROW-6528 which is about unit tests, I'll add a related JIRA about fixing this in integration too. | 
| @lidavidm thanks, I'll set the integration test task to run serially for now | 
| The UBSAN failure is a weird one. It's fixed by the Flatbuffers 1.12 upgrade ARROW-8178, so I'll get that merged once the builds run and then rebase this | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to review this in a few days.
| @nealrichardson any advice about the R test failure? The output log didn't give any clues and all the tests pass for me locally on Linux | 
| Hmm. It looks like it just dies when loading the package to run the 32-bit version of the tests, but the 64-bit version runs fine. | 
| Cross-referencing these results. The compression looks highly beneficial in terms of reduction in message size for modest decompression time I would guess that this would yield significant throughput benefits for Flight users. cc @lidavidm I'll update the mailing list thread about this | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
We can merge this once the R problem is fixed.
        
          
                cpp/src/arrow/flight/client.cc
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add const IpcWriteOptions& options to GrpcStreamWriter::Open() instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should do that. It will have to happen after @lidavidm's refactoring goes in. Opening a JIRA https://issues.apache.org/jira/browse/ARROW-8190
add empty reader_internal.h Some more temporary cutting away / simplification Ongoing refactoring More refactoring and cleaning of Feather V1 More cleaning, simplification Cleaning fixes Get feather.cc compiling again Slightly more working Feather tests passing again Some code reorganization, simply IPC writing a bit Start writer_internal.h Start writer_internal.h Incomplete refactoring Externalize enough of IPC writer machinery to be able to customize for Feather Remove overly elaborate externalization of IPC writer internals, implement compression as part of IpcOptions Work on serializing custom_metadata in Message Fix compilation Test empty metadata case also iwyu Some more work on compression complete more of decompression Decompression draft Implement various deprecations, and get compression tests passing Revert Feather changes
Because they are provided in 0.16.0. Keeping them provides backward compatibility.
| This still fails. EDIT: What I wrote about  I'm not sure why this is failing. I'm going to rebase but we probably need to get some debugging help or instructions how to run this build locally on Windows | 
| I built the project in 32-bit mode with  | 
| @wesm assuming you have a windows box with mingw stuff, 
 See https://github.com/apache/arrow/blob/master/.github/workflows/r.yml#L157-L177 | 
| @nealrichardson OK I'll give it a try tomorrow. We are definitely weak when it comes to Windows issues, very few developers are equipped to debug these kinds of issues locally | 
| @nealrichardson I don't have the "mingw stuff" (Python uses MSVC to build things), what other machine setup do I have to do outside of installing RTools? | 
| I believe you need to install https://www.msys2.org/. The R Windows GHA uses https://github.com/numworks/setup-msys2 for that; Kou added some scripts  | 
This reverts commit db5a9a8.
| I'm working on this, but I have almost no idea what I'm doing. For example, there are so many different shells (normal Windows shell, Powershell, msys2 shell), how do I get the commands listed at https://github.com/apache/arrow/blob/master/.github/workflows/cpp.yml#L278 to work? Flying blind | 
| It seems I'm supposed to be using Powershell. It would be nice at some point to have easy-to-follow source build instructions for R on Windows so the next person who runs into a problem doesn't have to dig through all these CI scripts to figure out what is what | 
| I'm giving up, can't figure out how to configure my environment to be able to run https://github.com/apache/arrow/blob/master/ci/scripts/r_windows_build.sh after about an hour of trying. I will need help from someone else to find out why the 32-bit R build is broken | 
| @wesm I've reproduced the 32-bit windows R segfault and pushed a skip for the test so that this is unblocked. The code that triggers the segfault is in the IPC format Datasets code, presumably in the scan operation because it succeeds in instantiating the dataset. @bkietz maybe you can take a look too and see if you see something suspect there? | 
| Before I forget (high mental load average right now), here's how I reproduced it: 
 setwd("Z:\\arrow\\r\\tests/")
library("testthat")
test_check("arrow", reporter="location")That gave a bunch of output, until it crashed and I was dumped back to the command prompt: So the error was something happening after L173 in test-dataset.R. | 
| "Dev / Source Release and Merge Script" job failure will be fixed by #6711. | 
| Thank you @nealrichardson, that's definitely helpful information. I looked at file_ipc.cc and didn't see anything especially concerning. I don't know if it's related but there are various usages of  | 
| One change that this PR makes is making  | 
| Beats me. I'll leave it to you to decide whether to merge this and create a followup for that or whether you want to hold this to debug further. If it were me, I'd do the former: IPC-format Datasets aren't a critical feature. And I suspect you don't feel like debugging 32-bit windows any further either ;) | 
| OK, I'll merge this. Is there someone who can help us get a backtrace with gdb so we can try to find out what is happening? | 
I apologize for the complexity of this patch, there was some necessary refactoring and then as I needed to add a
IpcReadOptionsstruct, it made sense to port some APIs from Status to Result.Summary of what's here
MemoryPool*arguments in IPC functions to options structsSome other small changes:
check_metadataoption toRecordBatch::EqualsCodec::GetCompressionTypemethod for looking upCompression::typegiven a codec nameI don't have size/perf benchmarks about how the compression helps with file sizes or read performance, so I'll do that next in the course of completing ARROW-5510 ("Feather V2" file format)