Skip to content

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Oct 8, 2019

No description provided.

@pitrou pitrou force-pushed the c-data-interface-impl branch 3 times, most recently from 5a10bd7 to cd0685a Compare October 8, 2019 18:34
@codecov-io
Copy link

codecov-io commented Oct 8, 2019

Codecov Report

Merging #5608 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@          Coverage Diff           @@
##           master   #5608   +/-   ##
======================================
  Coverage    89.6%   89.6%           
======================================
  Files         102     102           
  Lines        6636    6636           
  Branches     1501    1501           
======================================
  Hits         5946    5946           
  Misses        679     679           
  Partials       11      11

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4204428...f6cba66. Read the comment docs.

@pitrou pitrou force-pushed the c-data-interface-impl branch 19 times, most recently from 0b26b42 to fea4ab7 Compare October 15, 2019 20:39
@kszucs
Copy link
Member

kszucs commented Oct 15, 2019

@ursabot build

1 similar comment
@kszucs
Copy link
Member

kszucs commented Oct 15, 2019

@ursabot build

@pitrou pitrou force-pushed the c-data-interface-impl branch 3 times, most recently from bcc65d1 to 5b1856a Compare October 16, 2019 21:10
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xhochy You may be interested in this proof-of-concept.

@pitrou pitrou force-pushed the c-data-interface-impl branch 2 times, most recently from b966705 to 568d2b8 Compare November 26, 2019 17:02
@pitrou pitrou marked this pull request as ready for review November 26, 2019 17:03
@pitrou
Copy link
Member Author

pitrou commented Nov 26, 2019

@wesm This is basically ready now (assuming the spec is voted on and accepted).

@pitrou
Copy link
Member Author

pitrou commented Nov 26, 2019

@pravindra Do you think this can help the JNI layer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you intentionally making acopy of the data?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::vector<std::shared_ptr<Array>> columns() const;
const ArrayVector& columns() const;

Copy link
Contributor

@emkornfield emkornfield Nov 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use Result<>?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit ambivalent about the multi-out function (as I said on the discussion threads about Result) but for the single-out functions using Result makes sense.

@emkornfield
Copy link
Contributor

@pitrou I'll try to take a closer look at this, this week (its a lot of code) if no one else gets to it sooner.

@pitrou
Copy link
Member Author

pitrou commented Nov 27, 2019

Note to self: need to add support for extension types.

Copy link
Contributor

@pravindra pravindra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou thanks for pointing this out. I'm looking at this in the context of transferring arrow record batches between java and cpp - both for gandiva and misc readers.

  • The implementation in this PR has the schema (by this, i mean field names and types) and the buffers (address & len) combined in a single structure (struct ArrowArray). So, passing a record-batch from java to cpp will involve a transfer of the schema and the buffers - this would be inefficient in case of gandiva since we could be transferring millions of record batches all with the same schema. The schema itself is a complex structure (recursive structure), and passing it between java/cpp has a non-trivial cost (must encode in some protobuf/flatbuf, or have jni layer access java classes).

  • can we instead have a model where we de-link the two i.e have separate ImportSchema and ImportRecordBatch. This way, the ImportRecordBatch can simply pass along two arrays of longs (one for addresses, one for lengths) and is efficient.

We use the latter model in gandiva : schema is passed in a protobuf and the record-batches are passed around as arrays of longs. This is pretty close the IPC model used in arrow too. Can we please consider this model ?

https://github.com/apache/arrow/blob/master/cpp/src/gandiva/jni/jni_common.cc#L505

@pitrou
Copy link
Member Author

pitrou commented Nov 28, 2019

@pravindra That's an interesting point, thank you.

I should add some benchmarks here to measure the cost of exporting an array or record batch. Which typical number of columns are we talking about?

can we instead have a model where we de-link the two i.e have separate ImportSchema and ImportRecordBatch. This way, the ImportRecordBatch can simply pass along two arrays of longs (one for addresses, one for lengths) and is efficient.

That is a slightly different model. In particular the C data interface reproduces the logical array nesting, and also exposes dictionaries. It makes it more approachable for implementors (no flattening involved), but is less optimized than a simple array of buffer pointers.

@pravindra
Copy link
Contributor

Based on benchmark schemas (tpch or tpc-ds), the median should be in the range of 10-15 columns. but, there could be extremes in production usage.

Sorry, I've very little knowledge of R/Python. but, how does the c structure (ArrowArray) get exchanged back and forth between say, python and cpp ? For java -> cpp, I don't think there is a simple/efficient way to do this. Would it be better to standardize on a protobuf or flatbuffer, for all languages ?

@emkornfield @tianchen92 @BryanCutler - any opinions on this ? We already transfer record batches between java and cpp for gandiva, and I think we'll do more of this with the new readers being added in cpp. It'll be nice if we narrow down to one method of exchanging arrow batches between java and cpp.

@pitrou pitrou force-pushed the c-data-interface-impl branch from 568d2b8 to f6cba66 Compare November 29, 2019 11:00
@pitrou
Copy link
Member Author

pitrou commented Nov 29, 2019

Sorry, I've very little knowledge of R/Python. but, how does the c structure (ArrowArray) get exchanged back and forth between say, python and cpp ?

Python allocates the structure and passes the pointer to C++. The structure itself is owned by Python. I would be surprised if Java didn't allow to do that. Example:
https://github.com/apache/arrow/pull/5608/files#diff-d398398f3b466c6bd33c5ade6df9d572R33

@jacques-n
Copy link
Contributor

jacques-n commented Nov 30, 2019

I think there's people coming at this from different perspectives. Some are focused on multi-tenant pipelined use cases and others on generally single-tenant and fully (or mostly) addressable datasets. The combination of schema and data in the C API doesn't make sense for the former while it does for the latter.

In general, there are no mechanisms in Java to build C structs so I don't see it as desirable to adopt the C struct pattern over using the existing flatbuf based schema and record header patterns.

There is a need for dealing with cross-boundary release but that can be addressed with a pretty simple add-on mechanism as opposed to rewriting everything.

@wesm
Copy link
Member

wesm commented Dec 1, 2019

In general, there are no mechanisms in Java to build C structs so I don't see it as desirable to adopt the C struct pattern over using the existing flatbuf based schema and record header patterns.

FWIw, I believe SWIG can generate a wrapper interface to create and manipulate C structs directly. There may be some other tools

@jacques-n
Copy link
Contributor

We could use swig or similar but I fail to see the benefit of this other form of communication over the original pattern, for Java at least.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read this as carefully as I could. Overall aside from the question about the offset arithmetic with nested arrays (which seemed wrong to me, but I might be misunderstanding it) the unit testing looks thorough / comprehensive. It seems like you could add support for extension types in a follow up PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How interesting

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit ambivalent about the multi-out function (as I said on the discussion threads about Result) but for the single-out functions using Result makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessary to use the trailing underscores on the members of this struct since it's POD

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a reasonable compromise, and not too complex in this form. You could also call this root_ or something for clarity

@pitrou pitrou force-pushed the c-data-interface-impl branch from f6cba66 to 7574dad Compare December 4, 2019 14:35
r/R/py-to-r.R Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To support more Arrow structures here, we possibly need to extend the C-Protocol in future a bit more or would the current protocol suffice for implementing py_to_r.pyarrow.lib.Table?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current interface would suffice, as Table would have to be "shredded" into RecordBatches and then each exported over

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possibility would be to export each chunk of each table column separately to exactly preserve the chunking structure of the table

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shredding seems ok from the API perspective but introduces a lot of unnecessary chunking on most columns. We probably have no consistent way of merging the chunks again on the receiver side.

Exporting each chunk of the columns separately is probably the better thing as it will keep the chunking of the table at the original level but as we don't provide an official API, I would expect similar but slightly varying "proprietary" APIs/implementations to pop up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xhochy That's a slight risk indeed. We don't really know if this specification will be widely used, and in which ways. Another possible avenue is to distinguish schema and data representation if performance turns out to be critical for some applications (as mentioned in the JNI discussion). If those issues pop up in real use, it will be a motivation for a "C data interface v2".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the explicit static cast to the comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but it doesn't work with struct DayMilliseconds, which is why it was changed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it'd be helpful to remove the // zeros in favor of that explanation in a comment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer explicit actual and expected, removes a bit of mental overhead:

Suggested change
void AssertFieldEqual(const Field& lhs, const Field& rhs) {
void AssertFieldEqual(const Field& actual, const Field& expected) {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lhs / rhs is what's used above in AssertSchemaEqual.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::vector<std::shared_ptr<Array>> columns() const;
const ArrayVector& columns() const;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of silently dropping the struct's null bitmap it'd be better to emit an error

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity, maybe define ArrowMarkReleased(struct ArrowArray* array)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of defining custom new/delete operators, could you use arrow::stl::allocator<ExportedArrayPrivateData>::allocate, construct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I don't understand the suggestion. What does it become in C++?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a new unit test in arrow/stl_test.cc, for example:

TEST(allocator, Construction) {
  struct NonDefaultConstructible {
    NonDefaultConstructible(int i, int j) : i_(i), j_(j) {}
    int i_, j_;
  };

  // wrap a memory pool in an allocator
  auto pool = default_memory_pool();
  allocator<NonDefaultConstructible> alloc(pool);

  // allocate and construct an object on pool's heap
  NonDefaultConstructible* data = alloc.allocate(1);
  alloc.construct(data, 999, -999);

  ASSERT_EQ(data->i_, -data->j_);
  ASSERT_EQ(sizeof(uint64_t), pool->bytes_allocated());

  // destruct and free the object from pool's heap
  alloc.destroy(data);
  alloc.deallocate(data, 1);

  ASSERT_EQ(0, pool->bytes_allocated());
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Boilerplate could be reduced (I believe clang-format would one-line most of these) by defining ArrayExporter::SetFormat(std::string):

Suggested change
export_.format_ = "tdm";
return SetFormat("tdm");

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't sound terribly better, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd decrease each of these by 3 lines. Concision doesn't always aid readability but I think in this case it would

@pitrou pitrou force-pushed the c-data-interface-impl branch from 2fa67d9 to 639cce5 Compare December 9, 2019 15:22
@emkornfield
Copy link
Contributor

@pitrou can this be closed given that the take 2 one is out?

@pitrou
Copy link
Member Author

pitrou commented Feb 3, 2020

Closing in favour of PR #6026.

@pitrou pitrou closed this Feb 3, 2020
@pitrou pitrou deleted the c-data-interface-impl branch February 3, 2020 09:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants