Skip to content

Conversation

@wesm
Copy link
Member

@wesm wesm commented Mar 17, 2020

I apologize for the complexity of this patch, there was some necessary refactoring and then as I needed to add a IpcReadOptions struct, it made sense to port some APIs from Status to Result.

Summary of what's here

  • Add IpcReadOptions struct
  • Rename IpcOptions to IpcWriteOptions
  • Serialize and write custom_metadata field in Message Flatbuffer struct
  • Move MemoryPool* arguments in IPC functions to options structs
  • Adds EXPERIMENTAL compression option to IpcOptions (which should be renamed IpcWriteOptions) which causes each buffer in a record batch body to be separately compressed with e.g. ZSTD. This is intended to be used internally in Feather V2 and not exported for general use. Based on the mailing list discussion, if we were to adopt this in the IPC protocol then this can be promoted to non-experimental
  • Add "included_fields" option to select a subset of fields to load from a record batch when doing an IPC load. The motivation for this was to avoid unnecessary decompression when reading a subset of columns from an IPC stream
  • Migrate most IPC read APIs to use Result, deprecate old methods
  • Deprecate Status-returning IPC write method

Some other small changes:

  • Add check_metadata option to RecordBatch::Equals
  • Add Codec::GetCompressionType method for looking up Compression::type given a codec name

I don't have size/perf benchmarks about how the compression helps with file sizes or read performance, so I'll do that next in the course of completing ARROW-5510 ("Feather V2" file format)

@github-actions
Copy link

@wesm
Copy link
Member Author

wesm commented Mar 17, 2020

I'll fix the glib issues and other things tomorrow

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't read a lot of this, so just a few comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are at most 3 or 4 buffers in a Arrow array, and they have very different sizes. I don't think parallelizing at this level makes sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. The parallelization needs to happen at the field-level, which will require some refactoring. I'll open a follow up JIRA since it's out of scope for this PR

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Lots of comments, mostly code style/nits

Comment on lines +44 to 46
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: avoid using Result::Value unless wrapping a result returning API into a status returning API

Suggested change
auto status = ipc::RecordBatchFileReader::Open(std::move(input)).Value(&reader);
if (!status.ok()) {
return status.WithMessage("Could not open IPC input source '", source.path(),
"': ", status.message());
}
auto maybe_reader = ipc::RecordBatchFileReader::Open(std::move(input));
auto status = maybe_reader.status();
if (!status.ok()) {
return status.WithMessage("Could not open IPC input source '", source.path(),
"': ", status.message());
}
return std::move(maybe_reader).ValueOrDie();

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is using Result::Value bad? Not obvious from reading the docstrings

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not bad, just less preferred due to mixing argout Status return and Result usage

@wesm wesm force-pushed the ARROW-7979 branch 2 times, most recently from 11fa5ad to 2d17d32 Compare March 19, 2020 21:53
@wesm
Copy link
Member Author

wesm commented Mar 19, 2020

I think I've addressed almost all of the comments. I'll try to get all the builds passing (C++/Python/R/GLib all passing for me locally)

@wesm
Copy link
Member Author

wesm commented Mar 20, 2020

Hmm looks like I broke something related to Maps

################# FAILURES #################
FAILED TEST: map C++ producing,  C++ consuming

FAILED TEST: map C++ producing,  Java consuming

FAILED TEST: map C++ producing,  JS consuming

FAILED TEST: map Java producing,  C++ consuming

FAILED TEST: map JS producing,  C++ consuming

FAILED TEST: map C++ producing,  C++ consuming

FAILED TEST: map C++ producing,  Java consuming

FAILED TEST: map Java producing,  C++ consuming

I'll fix that plus the Windows issue tomorrow. If I could get the thumbs up/down on the big picture stuff on the meantime that would be great

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

big picture: +1

@wesm
Copy link
Member Author

wesm commented Mar 20, 2020

@lidavidm some of the Flight integration tests failed with this error, do you know what it could be about?

  File "/arrow/dev/archery/archery/integration/tester_cpp.py", line 91, in flight_server
    raise RuntimeError(
RuntimeError: Flight-C++ server did not start properly, stdout:
/opt/conda/envs/arrow/lib/libarrow.so.100(+0x12faf48)[0x7f9280b1ef48]


stderr:
E0320 17:56:33.746761714   10694 socket_utils_common_posix.cc:222] check for SO_REUSEPORT: {"created":"@1584726993.746751914","description":"SO_REUSEPORT unavailable on compiling system","file":"../src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":190}
E0320 17:56:33.746901914   10694 server_chttp2.cc:40]        {"created":"@1584726993.746866914","description":"No address added out of total 1 resolved","file":"../src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":395,"referenced_errors":[{"created":"@1584726993.746865414","description":"Failed to add any wildcard listeners","file":"../src/core/lib/iomgr/tcp_server_posix.cc","file_line":342,"referenced_errors":[{"created":"@1584726993.746848214","description":"Address family not supported by protocol","errno":97,"file":"../src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":420,"os_error":"Address family not supported by protocol","syscall":"socket","target_address":"[::]:33625"},{"created":"@1584726993.746865114","description":"Unable to configure socket","fd":5,"file":"../src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1584726993.746862214","description":"Address already in use","errno":98,"file":"../src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":189,"os_error":"Address already in use","syscall":"bind"}]}]}]}
/arrow/cpp/src/arrow/flight/test_integration_server.cc:165:  Check failed: _s.ok() Operation failed: g_server->Init(options)
Bad status: Unknown error: Server did not start properly

@lidavidm
Copy link
Member

The relevant bit is Address already in use so it seems the Flight tests are not safe to be run in parallel, even with the port allocation code in the test runner. We should eventually have the server bind on port 0 and have the integration test runner parse the port out of its output to avoid races like this.

@lidavidm
Copy link
Member

lidavidm commented Mar 20, 2020

There's https://issues.apache.org/jira/browse/ARROW-6528 which is about unit tests, I'll add a related JIRA about fixing this in integration too.

https://issues.apache.org/jira/browse/ARROW-8176

@wesm
Copy link
Member Author

wesm commented Mar 20, 2020

@lidavidm thanks, I'll set the integration test task to run serially for now

@wesm
Copy link
Member Author

wesm commented Mar 20, 2020

The UBSAN failure is a weird one. It's fixed by the Flatbuffers 1.12 upgrade ARROW-8178, so I'll get that merged once the builds run and then rebase this

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to review this in a few days.

@wesm
Copy link
Member Author

wesm commented Mar 22, 2020

@nealrichardson any advice about the R test failure? The output log didn't give any clues and all the tests pass for me locally on Linux

@nealrichardson
Copy link
Member

Hmm. It looks like it just dies when loading the package to run the 32-bit version of the tests, but the 64-bit version runs fine.

@wesm
Copy link
Member Author

wesm commented Mar 23, 2020

Cross-referencing these results. The compression looks highly beneficial in terms of reduction in message size for modest decompression time

#6694 (comment)

I would guess that this would yield significant throughput benefits for Flight users. cc @lidavidm

I'll update the mailing list thread about this

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

We can merge this once the R problem is fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add const IpcWriteOptions& options to GrpcStreamWriter::Open() instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should do that. It will have to happen after @lidavidm's refactoring goes in. Opening a JIRA https://issues.apache.org/jira/browse/ARROW-8190

wesm added 7 commits March 23, 2020 20:04
add empty reader_internal.h

Some more temporary cutting away / simplification

Ongoing refactoring

More refactoring and cleaning of Feather V1

More cleaning, simplification

Cleaning

fixes

Get feather.cc compiling again

Slightly more working

Feather tests passing again

Some code reorganization, simply IPC writing a bit

Start writer_internal.h

Start writer_internal.h

Incomplete refactoring

Externalize enough of IPC writer machinery to be able to customize for Feather

Remove overly elaborate externalization of IPC writer internals, implement compression as part of IpcOptions

Work on serializing custom_metadata in Message

Fix compilation

Test empty metadata case also

iwyu

Some more work on compression

complete more of decompression

Decompression draft

Implement various deprecations, and get compression tests passing

Revert Feather changes
@wesm
Copy link
Member Author

wesm commented Mar 24, 2020

This still fails.

EDIT: What I wrote about STOP_OR_VALUE was not right.

I'm not sure why this is failing. I'm going to rebase but we probably need to get some debugging help or instructions how to run this build locally on Windows

@wesm
Copy link
Member Author

wesm commented Mar 24, 2020

I built the project in 32-bit mode with -m32 on Linux and the tests pass fine. Not sure how to proceed with debugging

@nealrichardson
Copy link
Member

@wesm assuming you have a windows box with mingw stuff,

  1. Go to https://cran.r-project.org/bin/windows/ and install R
  2. Follow the Rtools link there and install Rtools 3.5
  3. Build the C++ library with makepkg-mingw following https://github.com/apache/arrow/blob/master/ci/scripts/r_windows_build.sh
  4. Set the file path of the resulting .zip file of the libs as the env var RWINLIB_LOCAL and then install the R package

See https://github.com/apache/arrow/blob/master/.github/workflows/r.yml#L157-L177

@wesm
Copy link
Member Author

wesm commented Mar 24, 2020

@nealrichardson OK I'll give it a try tomorrow. We are definitely weak when it comes to Windows issues, very few developers are equipped to debug these kinds of issues locally

@wesm
Copy link
Member Author

wesm commented Mar 24, 2020

@nealrichardson I don't have the "mingw stuff" (Python uses MSVC to build things), what other machine setup do I have to do outside of installing RTools?

@nealrichardson
Copy link
Member

I believe you need to install https://www.msys2.org/. The R Windows GHA uses https://github.com/numworks/setup-msys2 for that; Kou added some scripts ci/scripts/msys* that the Ruby jobs use.

@wesm
Copy link
Member Author

wesm commented Mar 24, 2020

I'm working on this, but I have almost no idea what I'm doing. For example, there are so many different shells (normal Windows shell, Powershell, msys2 shell), how do I get the commands listed at

https://github.com/apache/arrow/blob/master/.github/workflows/cpp.yml#L278

to work? Flying blind

@wesm
Copy link
Member Author

wesm commented Mar 24, 2020

It seems I'm supposed to be using Powershell. It would be nice at some point to have easy-to-follow source build instructions for R on Windows so the next person who runs into a problem doesn't have to dig through all these CI scripts to figure out what is what

@wesm
Copy link
Member Author

wesm commented Mar 24, 2020

I'm giving up, can't figure out how to configure my environment to be able to run https://github.com/apache/arrow/blob/master/ci/scripts/r_windows_build.sh after about an hour of trying. I will need help from someone else to find out why the 32-bit R build is broken

@nealrichardson
Copy link
Member

@wesm I've reproduced the 32-bit windows R segfault and pushed a skip for the test so that this is unblocked. The code that triggers the segfault is in the IPC format Datasets code, presumably in the scan operation because it succeeds in instantiating the dataset. @bkietz maybe you can take a look too and see if you see something suspect there?

@nealrichardson
Copy link
Member

nealrichardson commented Mar 24, 2020

Before I forget (high mental load average right now), here's how I reproduced it:

  1. Forget msys2 and all that. I used ursa-labs/arrow-r-nightly to build the C++ library for Windows on this branch and publish the artifact to bintray with an arbitrary version number (0.16.0.666666): https://github.com/ursa-labs/arrow-r-nightly/runs/531644004
  2. In my local Windows 10 VM with R and Rtools installed, (a) set the arbitrary version number in r/DESCRIPTION to match the thing I put on bintray, and (b) remotes::install_local("Z:\\arrow\\r") to install dependencies and then the package itself, compiling the bindings from source.
  3. Since this only crashed on 32-bit R, run that from the Command Prompt. C:\Program Files\R\R-3.6.0\bin\i386\R.exe
  4. Since running R with gdb on Windows sounds terrible, I instead ran the tests with more verbosity, using the "location" test reporter, so that I could see where exactly it died:
setwd("Z:\\arrow\\r\\tests/")
library("testthat")
test_check("arrow", reporter="location")

That gave a bunch of output, until it crashed and I was dumped back to the command prompt:

...
Start test: Simple interface for datasets (custom ParquetFileFormat)
  test-dataset.R#124:1 [success]
End test: Simple interface for datasets (custom ParquetFileFormat)

Start test: Hive partitioning
  test-dataset.R#129:1 [success]
  test-dataset.R#130:1 [success]
End test: Hive partitioning

Start test: Partitioning inference
  test-dataset.R#144:1 [success]
  test-dataset.R#145:1 [success]
  test-dataset.R#158:1 [success]
  test-dataset.R#159:1 [success]
End test: Partitioning inference

Start test: IPC/Arrow format data
  test-dataset.R#172:1 [success]
  test-dataset.R#173:1 [success]

C:\Program Files>

So the error was something happening after L173 in test-dataset.R.

@kou
Copy link
Member

kou commented Mar 25, 2020

"Dev / Source Release and Merge Script" job failure will be fixed by #6711.

@wesm
Copy link
Member Author

wesm commented Mar 25, 2020

Thank you @nealrichardson, that's definitely helpful information.

I looked at file_ipc.cc and didn't see anything especially concerning. I don't know if it's related but there are various usages of Result::ValueOrDie in the datasets implementation, we could turn on logging to see of one of those "shouldn't fail" conditions is occurring somehow

@wesm
Copy link
Member Author

wesm commented Mar 25, 2020

One change that this PR makes is making RecordBatchFileReader an abstract class with pure virtual methods. I don't see why that would cause a problem but...

@nealrichardson
Copy link
Member

Beats me. I'll leave it to you to decide whether to merge this and create a followup for that or whether you want to hold this to debug further. If it were me, I'd do the former: IPC-format Datasets aren't a critical feature. And I suspect you don't feel like debugging 32-bit windows any further either ;)

@wesm
Copy link
Member Author

wesm commented Mar 25, 2020

OK, I'll merge this. Is there someone who can help us get a backtrace with gdb so we can try to find out what is happening?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants