C++ implementation of C data interface #5608

pitrou · 2019-10-08T17:06:37Z

No description provided.

codecov-io · 2019-10-08T22:02:34Z

Codecov Report

Merging #5608 into master will not change coverage.
The diff coverage is n/a.

@@          Coverage Diff           @@
##           master   #5608   +/-   ##
======================================
  Coverage    89.6%   89.6%           
======================================
  Files         102     102           
  Lines        6636    6636           
  Branches     1501    1501           
======================================
  Hits         5946    5946           
  Misses        679     679           
  Partials       11      11

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4204428...f6cba66. Read the comment docs.

kszucs · 2019-10-15T21:06:37Z

@ursabot build

kszucs · 2019-10-15T21:27:51Z

@ursabot build

pitrou · 2019-10-17T20:39:27Z

python/pyarrow/tests/test_cffi.py

@xhochy You may be interested in this proof-of-concept.

pitrou · 2019-11-26T17:04:08Z

@wesm This is basically ready now (assuming the spec is voted on and accepted).

pitrou · 2019-11-26T17:24:43Z

@pravindra Do you think this can help the JNI layer?

emkornfield · 2019-11-27T08:18:59Z

cpp/src/arrow/record_batch.h

Are you intentionally making acopy of the data?

Suggested change

std::vector<std::shared_ptr<Array>> columns() const;

const ArrayVector& columns() const;

emkornfield · 2019-11-27T08:23:58Z

cpp/src/arrow/c/bridge.h

nit: use Result<>?

I'm a bit ambivalent about the multi-out function (as I said on the discussion threads about Result) but for the single-out functions using Result makes sense.

emkornfield · 2019-11-27T08:25:00Z

@pitrou I'll try to take a closer look at this, this week (its a lot of code) if no one else gets to it sooner.

pitrou · 2019-11-27T21:54:25Z

Note to self: need to add support for extension types.

pravindra

@pitrou thanks for pointing this out. I'm looking at this in the context of transferring arrow record batches between java and cpp - both for gandiva and misc readers.

The implementation in this PR has the schema (by this, i mean field names and types) and the buffers (address & len) combined in a single structure (struct ArrowArray). So, passing a record-batch from java to cpp will involve a transfer of the schema and the buffers - this would be inefficient in case of gandiva since we could be transferring millions of record batches all with the same schema. The schema itself is a complex structure (recursive structure), and passing it between java/cpp has a non-trivial cost (must encode in some protobuf/flatbuf, or have jni layer access java classes).
can we instead have a model where we de-link the two i.e have separate ImportSchema and ImportRecordBatch. This way, the ImportRecordBatch can simply pass along two arrays of longs (one for addresses, one for lengths) and is efficient.

We use the latter model in gandiva : schema is passed in a protobuf and the record-batches are passed around as arrays of longs. This is pretty close the IPC model used in arrow too. Can we please consider this model ?

https://github.com/apache/arrow/blob/master/cpp/src/gandiva/jni/jni_common.cc#L505

pitrou · 2019-11-28T10:10:35Z

@pravindra That's an interesting point, thank you.

I should add some benchmarks here to measure the cost of exporting an array or record batch. Which typical number of columns are we talking about?

can we instead have a model where we de-link the two i.e have separate ImportSchema and ImportRecordBatch. This way, the ImportRecordBatch can simply pass along two arrays of longs (one for addresses, one for lengths) and is efficient.

That is a slightly different model. In particular the C data interface reproduces the logical array nesting, and also exposes dictionaries. It makes it more approachable for implementors (no flattening involved), but is less optimized than a simple array of buffer pointers.

pravindra · 2019-11-29T07:43:38Z

Based on benchmark schemas (tpch or tpc-ds), the median should be in the range of 10-15 columns. but, there could be extremes in production usage.

Sorry, I've very little knowledge of R/Python. but, how does the c structure (ArrowArray) get exchanged back and forth between say, python and cpp ? For java -> cpp, I don't think there is a simple/efficient way to do this. Would it be better to standardize on a protobuf or flatbuffer, for all languages ?

@emkornfield @tianchen92 @BryanCutler - any opinions on this ? We already transfer record batches between java and cpp for gandiva, and I think we'll do more of this with the new readers being added in cpp. It'll be nice if we narrow down to one method of exchanging arrow batches between java and cpp.

pitrou · 2019-11-29T11:02:25Z

Sorry, I've very little knowledge of R/Python. but, how does the c structure (ArrowArray) get exchanged back and forth between say, python and cpp ?

Python allocates the structure and passes the pointer to C++. The structure itself is owned by Python. I would be surprised if Java didn't allow to do that. Example:
https://github.com/apache/arrow/pull/5608/files#diff-d398398f3b466c6bd33c5ade6df9d572R33

jacques-n · 2019-11-30T22:26:13Z

I think there's people coming at this from different perspectives. Some are focused on multi-tenant pipelined use cases and others on generally single-tenant and fully (or mostly) addressable datasets. The combination of schema and data in the C API doesn't make sense for the former while it does for the latter.

In general, there are no mechanisms in Java to build C structs so I don't see it as desirable to adopt the C struct pattern over using the existing flatbuf based schema and record header patterns.

There is a need for dealing with cross-boundary release but that can be addressed with a pretty simple add-on mechanism as opposed to rewriting everything.

wesm · 2019-12-01T02:51:24Z

In general, there are no mechanisms in Java to build C structs so I don't see it as desirable to adopt the C struct pattern over using the existing flatbuf based schema and record header patterns.

FWIw, I believe SWIG can generate a wrapper interface to create and manipulate C structs directly. There may be some other tools

jacques-n · 2019-12-02T02:53:54Z

We could use swig or similar but I fail to see the benefit of this other form of communication over the original pattern, for Java at least.

wesm

I read this as carefully as I could. Overall aside from the question about the offset arithmetic with nested arrays (which seemed wrong to me, but I might be misunderstanding it) the unit testing looks thorough / comprehensive. It seems like you could add support for extension types in a follow up PR

wesm · 2019-12-03T20:45:17Z

cpp/src/arrow/array/builder_primitive.h

How interesting

wesm · 2019-12-03T21:31:52Z

cpp/src/arrow/c/bridge.h

I'm a bit ambivalent about the multi-out function (as I said on the discussion threads about Result) but for the single-out functions using Result makes sense.

r/src/py-to-r.cpp

wesm · 2019-12-03T22:40:40Z

cpp/src/arrow/c/bridge.cc

It's not necessary to use the trailing underscores on the members of this struct since it's POD

cpp/src/arrow/c/bridge.cc

wesm · 2019-12-04T01:20:16Z

cpp/src/arrow/c/bridge.cc

Seems like a reasonable compromise, and not too complex in this form. You could also call this root_ or something for clarity

cpp/src/arrow/c/bridge.cc

xhochy · 2019-12-04T15:38:33Z

r/R/py-to-r.R

To support more Arrow structures here, we possibly need to extend the C-Protocol in future a bit more or would the current protocol suffice for implementing py_to_r.pyarrow.lib.Table?

I think the current interface would suffice, as Table would have to be "shredded" into RecordBatches and then each exported over

Another possibility would be to export each chunk of each table column separately to exactly preserve the chunking structure of the table

Shredding seems ok from the API perspective but introduces a lot of unnecessary chunking on most columns. We probably have no consistent way of merging the chunks again on the receiver side.

Exporting each chunk of the columns separately is probably the better thing as it will keep the chunking of the table at the original level but as we don't provide an official API, I would expect similar but slightly varying "proprietary" APIs/implementations to pop up.

@xhochy That's a slight risk indeed. We don't really know if this specification will be widely used, and in which ways. Another possible avenue is to distinguish schema and data representation if performance turns out to be critical for some applications (as mentioned in the JNI discussion). If those issues pop up in real use, it will be a motivation for a "C data interface v2".

bkietz · 2019-12-03T19:53:59Z

cpp/src/arrow/array/builder_primitive.h

I prefer the explicit static cast to the comment.

Sure, but it doesn't work with struct DayMilliseconds, which is why it was changed.

Maybe it'd be helpful to remove the // zeros in favor of that explanation in a comment?

bkietz · 2019-12-03T22:05:01Z

cpp/src/arrow/testing/gtest_util.cc

I'd prefer explicit actual and expected, removes a bit of mental overhead:

Suggested change

void AssertFieldEqual(const Field& lhs, const Field& rhs) {

void AssertFieldEqual(const Field& actual, const Field& expected) {

lhs / rhs is what's used above in AssertSchemaEqual.

bkietz · 2019-12-03T22:05:37Z

cpp/src/arrow/util/key_value_metadata.h

bkietz · 2019-12-03T22:06:19Z

cpp/src/arrow/record_batch.h

Suggested change

std::vector<std::shared_ptr<Array>> columns() const;

const ArrayVector& columns() const;

bkietz · 2019-12-03T22:07:07Z

cpp/src/arrow/record_batch.h

I think instead of silently dropping the struct's null bitmap it'd be better to emit an error

bkietz · 2019-12-03T22:10:28Z

cpp/src/arrow/c/helpers.h

For clarity, maybe define ArrowMarkReleased(struct ArrowArray* array)?

bkietz · 2019-12-03T22:15:50Z

cpp/src/arrow/c/bridge.cc

Instead of defining custom new/delete operators, could you use arrow::stl::allocator<ExportedArrayPrivateData>::allocate, construct?

I'm sorry, I don't understand the suggestion. What does it become in C++?

As a new unit test in arrow/stl_test.cc, for example:

TEST(allocator, Construction) { struct NonDefaultConstructible { NonDefaultConstructible(int i, int j) : i_(i), j_(j) {} int i_, j_; }; // wrap a memory pool in an allocator auto pool = default_memory_pool(); allocator<NonDefaultConstructible> alloc(pool); // allocate and construct an object on pool's heap NonDefaultConstructible* data = alloc.allocate(1); alloc.construct(data, 999, -999); ASSERT_EQ(data->i_, -data->j_); ASSERT_EQ(sizeof(uint64_t), pool->bytes_allocated()); // destruct and free the object from pool's heap alloc.destroy(data); alloc.deallocate(data, 1); ASSERT_EQ(0, pool->bytes_allocated()); }

bkietz · 2019-12-05T17:55:33Z

cpp/src/arrow/c/bridge.cc

Boilerplate could be reduced (I believe clang-format would one-line most of these) by defining ArrayExporter::SetFormat(std::string):

Suggested change

export_.format_ = "tdm";

return SetFormat("tdm");

Doesn't sound terribly better, though.

It'd decrease each of these by 3 lines. Concision doesn't always aid readability but I think in this case it would

emkornfield · 2020-02-03T04:05:14Z

@pitrou can this be closed given that the take 2 one is out?

pitrou · 2020-02-03T09:07:00Z

Closing in favour of PR #6026.

pitrou force-pushed the c-data-interface-impl branch 3 times, most recently from 5a10bd7 to cd0685a Compare October 8, 2019 18:34

brills mentioned this pull request Oct 8, 2019

RFC: TFX Standardized Inputs tensorflow/community#162

Merged

pitrou force-pushed the c-data-interface-impl branch 19 times, most recently from 0b26b42 to fea4ab7 Compare October 15, 2019 20:39

pitrou force-pushed the c-data-interface-impl branch 3 times, most recently from bcc65d1 to 5b1856a Compare October 16, 2019 21:10

pitrou commented Oct 17, 2019

View reviewed changes

python/pyarrow/tests/test_cffi.py Outdated

Copy link

Member Author

pitrou Oct 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xhochy You may be interested in this proof-of-concept.

pitrou force-pushed the c-data-interface-impl branch 2 times, most recently from b966705 to 568d2b8 Compare November 26, 2019 17:02

pitrou marked this pull request as ready for review November 26, 2019 17:03

emkornfield reviewed Nov 27, 2019

View reviewed changes

pravindra reviewed Nov 28, 2019

View reviewed changes

pitrou force-pushed the c-data-interface-impl branch from 568d2b8 to f6cba66 Compare November 29, 2019 11:00

wesm reviewed Dec 4, 2019

View reviewed changes

pitrou force-pushed the c-data-interface-impl branch from f6cba66 to 7574dad Compare December 4, 2019 14:35

xhochy reviewed Dec 4, 2019

View reviewed changes

bkietz reviewed Dec 5, 2019

View reviewed changes

pitrou added 4 commits December 9, 2019 16:19

Draft C++ implementation of C data interface

f67b146

Update flag names

e13f1e5

Address review comments

ee70b56

Return Result in one-output Import functions

639cce5

pitrou force-pushed the c-data-interface-impl branch from 2fa67d9 to 639cce5 Compare December 9, 2019 15:22

pitrou closed this Feb 3, 2020

pitrou deleted the c-data-interface-impl branch February 3, 2020 09:07

	std::vector<std::shared_ptr<Array>> columns() const;
	const ArrayVector& columns() const;

	void AssertFieldEqual(const Field& lhs, const Field& rhs) {
	void AssertFieldEqual(const Field& actual, const Field& expected) {

Uh oh!

C++ implementation of C data interface #5608

C++ implementation of C data interface #5608

Uh oh!

Conversation

pitrou commented Oct 8, 2019

Uh oh!

codecov-io commented Oct 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

kszucs commented Oct 15, 2019

Uh oh!

kszucs commented Oct 15, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pitrou commented Nov 26, 2019

Uh oh!

pitrou commented Nov 26, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emkornfield Nov 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emkornfield commented Nov 27, 2019

Uh oh!

pitrou commented Nov 27, 2019

Uh oh!

pravindra left a comment

Choose a reason for hiding this comment

Uh oh!

pitrou commented Nov 28, 2019

Uh oh!

pravindra commented Nov 29, 2019

Uh oh!

pitrou commented Nov 29, 2019

Uh oh!

jacques-n commented Nov 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wesm commented Dec 1, 2019

Uh oh!

jacques-n commented Dec 2, 2019

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-io commented Oct 8, 2019 •

edited

Loading

emkornfield Nov 27, 2019 •

edited

Loading

jacques-n commented Nov 30, 2019 •

edited

Loading