-
Couldn't load subscription status.
- Fork 3.9k
C++ implementation of C data interface #5608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
5a10bd7 to
cd0685a
Compare
Codecov Report
@@ Coverage Diff @@
## master #5608 +/- ##
======================================
Coverage 89.6% 89.6%
======================================
Files 102 102
Lines 6636 6636
Branches 1501 1501
======================================
Hits 5946 5946
Misses 679 679
Partials 11 11Continue to review full report at Codecov.
|
0b26b42 to
fea4ab7
Compare
|
@ursabot build |
1 similar comment
|
@ursabot build |
bcc65d1 to
5b1856a
Compare
python/pyarrow/tests/test_cffi.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xhochy You may be interested in this proof-of-concept.
b966705 to
568d2b8
Compare
|
@wesm This is basically ready now (assuming the spec is voted on and accepted). |
|
@pravindra Do you think this can help the JNI layer? |
cpp/src/arrow/record_batch.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you intentionally making acopy of the data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| std::vector<std::shared_ptr<Array>> columns() const; | |
| const ArrayVector& columns() const; |
cpp/src/arrow/c/bridge.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use Result<>?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit ambivalent about the multi-out function (as I said on the discussion threads about Result) but for the single-out functions using Result makes sense.
|
@pitrou I'll try to take a closer look at this, this week (its a lot of code) if no one else gets to it sooner. |
|
Note to self: need to add support for extension types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou thanks for pointing this out. I'm looking at this in the context of transferring arrow record batches between java and cpp - both for gandiva and misc readers.
-
The implementation in this PR has the schema (by this, i mean field names and types) and the buffers (address & len) combined in a single structure (struct ArrowArray). So, passing a record-batch from java to cpp will involve a transfer of the schema and the buffers - this would be inefficient in case of gandiva since we could be transferring millions of record batches all with the same schema. The schema itself is a complex structure (recursive structure), and passing it between java/cpp has a non-trivial cost (must encode in some protobuf/flatbuf, or have jni layer access java classes).
-
can we instead have a model where we de-link the two i.e have separate ImportSchema and ImportRecordBatch. This way, the ImportRecordBatch can simply pass along two arrays of longs (one for addresses, one for lengths) and is efficient.
We use the latter model in gandiva : schema is passed in a protobuf and the record-batches are passed around as arrays of longs. This is pretty close the IPC model used in arrow too. Can we please consider this model ?
https://github.com/apache/arrow/blob/master/cpp/src/gandiva/jni/jni_common.cc#L505
|
@pravindra That's an interesting point, thank you. I should add some benchmarks here to measure the cost of exporting an array or record batch. Which typical number of columns are we talking about?
That is a slightly different model. In particular the C data interface reproduces the logical array nesting, and also exposes dictionaries. It makes it more approachable for implementors (no flattening involved), but is less optimized than a simple array of buffer pointers. |
|
Based on benchmark schemas (tpch or tpc-ds), the median should be in the range of 10-15 columns. but, there could be extremes in production usage. Sorry, I've very little knowledge of R/Python. but, how does the c structure (ArrowArray) get exchanged back and forth between say, python and cpp ? For java -> cpp, I don't think there is a simple/efficient way to do this. Would it be better to standardize on a protobuf or flatbuffer, for all languages ? @emkornfield @tianchen92 @BryanCutler - any opinions on this ? We already transfer record batches between java and cpp for gandiva, and I think we'll do more of this with the new readers being added in cpp. It'll be nice if we narrow down to one method of exchanging arrow batches between java and cpp. |
568d2b8 to
f6cba66
Compare
Python allocates the structure and passes the pointer to C++. The structure itself is owned by Python. I would be surprised if Java didn't allow to do that. Example: |
|
I think there's people coming at this from different perspectives. Some are focused on multi-tenant pipelined use cases and others on generally single-tenant and fully (or mostly) addressable datasets. The combination of schema and data in the C API doesn't make sense for the former while it does for the latter. In general, there are no mechanisms in Java to build C structs so I don't see it as desirable to adopt the C struct pattern over using the existing flatbuf based schema and record header patterns. There is a need for dealing with cross-boundary release but that can be addressed with a pretty simple add-on mechanism as opposed to rewriting everything. |
FWIw, I believe SWIG can generate a wrapper interface to create and manipulate C structs directly. There may be some other tools |
|
We could use swig or similar but I fail to see the benefit of this other form of communication over the original pattern, for Java at least. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read this as carefully as I could. Overall aside from the question about the offset arithmetic with nested arrays (which seemed wrong to me, but I might be misunderstanding it) the unit testing looks thorough / comprehensive. It seems like you could add support for extension types in a follow up PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How interesting
cpp/src/arrow/c/bridge.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit ambivalent about the multi-out function (as I said on the discussion threads about Result) but for the single-out functions using Result makes sense.
cpp/src/arrow/c/bridge.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not necessary to use the trailing underscores on the members of this struct since it's POD
cpp/src/arrow/c/bridge.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a reasonable compromise, and not too complex in this form. You could also call this root_ or something for clarity
f6cba66 to
7574dad
Compare
r/R/py-to-r.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To support more Arrow structures here, we possibly need to extend the C-Protocol in future a bit more or would the current protocol suffice for implementing py_to_r.pyarrow.lib.Table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current interface would suffice, as Table would have to be "shredded" into RecordBatches and then each exported over
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another possibility would be to export each chunk of each table column separately to exactly preserve the chunking structure of the table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shredding seems ok from the API perspective but introduces a lot of unnecessary chunking on most columns. We probably have no consistent way of merging the chunks again on the receiver side.
Exporting each chunk of the columns separately is probably the better thing as it will keep the chunking of the table at the original level but as we don't provide an official API, I would expect similar but slightly varying "proprietary" APIs/implementations to pop up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xhochy That's a slight risk indeed. We don't really know if this specification will be widely used, and in which ways. Another possible avenue is to distinguish schema and data representation if performance turns out to be critical for some applications (as mentioned in the JNI discussion). If those issues pop up in real use, it will be a motivation for a "C data interface v2".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the explicit static cast to the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but it doesn't work with struct DayMilliseconds, which is why it was changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it'd be helpful to remove the // zeros in favor of that explanation in a comment?
cpp/src/arrow/testing/gtest_util.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer explicit actual and expected, removes a bit of mental overhead:
| void AssertFieldEqual(const Field& lhs, const Field& rhs) { | |
| void AssertFieldEqual(const Field& actual, const Field& expected) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lhs / rhs is what's used above in AssertSchemaEqual.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
cpp/src/arrow/record_batch.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| std::vector<std::shared_ptr<Array>> columns() const; | |
| const ArrayVector& columns() const; |
cpp/src/arrow/record_batch.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of silently dropping the struct's null bitmap it'd be better to emit an error
cpp/src/arrow/c/helpers.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarity, maybe define ArrowMarkReleased(struct ArrowArray* array)?
cpp/src/arrow/c/bridge.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of defining custom new/delete operators, could you use arrow::stl::allocator<ExportedArrayPrivateData>::allocate, construct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, I don't understand the suggestion. What does it become in C++?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a new unit test in arrow/stl_test.cc, for example:
TEST(allocator, Construction) {
struct NonDefaultConstructible {
NonDefaultConstructible(int i, int j) : i_(i), j_(j) {}
int i_, j_;
};
// wrap a memory pool in an allocator
auto pool = default_memory_pool();
allocator<NonDefaultConstructible> alloc(pool);
// allocate and construct an object on pool's heap
NonDefaultConstructible* data = alloc.allocate(1);
alloc.construct(data, 999, -999);
ASSERT_EQ(data->i_, -data->j_);
ASSERT_EQ(sizeof(uint64_t), pool->bytes_allocated());
// destruct and free the object from pool's heap
alloc.destroy(data);
alloc.deallocate(data, 1);
ASSERT_EQ(0, pool->bytes_allocated());
}
cpp/src/arrow/c/bridge.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Boilerplate could be reduced (I believe clang-format would one-line most of these) by defining ArrayExporter::SetFormat(std::string):
| export_.format_ = "tdm"; | |
| return SetFormat("tdm"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't sound terribly better, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd decrease each of these by 3 lines. Concision doesn't always aid readability but I think in this case it would
2fa67d9 to
639cce5
Compare
|
@pitrou can this be closed given that the take 2 one is out? |
|
Closing in favour of PR #6026. |
No description provided.