ARROW-7913: [C++][Python][R] C++ implementation of C data protocol #6026

pitrou · 2019-12-12T17:30:53Z

TODO:

add extension array support

github-actions · 2019-12-12T17:45:27Z

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

pitrou · 2019-12-12T18:01:34Z

@wesm @emkornfield

jacques-n

A couple of comments. The biggest question I have is about the use pattern and whether we need some concept of iteration.

One other thing, it feels like there are some other changes in the changeset that aren't related to the c data interface (beyond some prereq's to make it work).

cpp/src/arrow/c/abi.h

jacques-n · 2019-12-13T19:01:16Z

python/pyarrow/cffi.py

is there a way to do this without duplication?

It depends on the FFI layer. Some may be able to parse arbitrary header files (such as cpp/src/arrow/c/abi.h in this PR).

cpp/src/arrow/c/abi.h

jacques-n · 2019-12-13T19:07:15Z

cpp/src/arrow/c/abi.h

I can't remember. Do we have required alignment here? Anything we should note in the header?

Alignment and padding specification is simply part of the host system's C ABI. By making this a C struct (not an explicitly defined binary encoding), we follow the C ABI.

I wasn't talking about the alignment of the information in the C struct, I was talking about the alignment of the buffers themselves. Don't we declare elsewhere that buffers should be 8 or 64 byte aligned (?).

In the columnar format, in-memory alignment is a recommendation but not a requirement, so I think the same would hold here. As my intuition, when referencing "other people's memory" in process it would be onerous to require a memory copy to satisfy a pointer alignment requirement.

The IPC message protocol does enforce alignment and padding at the serialization boundary, however.

should that be a flag then? The memory is guaranteed to be aligned. Do all the libraries handle non-aligned memory?

Having a flag seems like a good idea. I believe all the libraries do handle non-aligned memory

Memory alignment nowadays is more an optimization than a hardware requirement, on common architectures at least. x86 happily accepts unaligned memory accesses, and so does ARM64. Legacy architectures like Sparc may have stricter requirements. See also https://news.ycombinator.com/item?id=14769797

I think a flag may just add boilerplate to the implementation (you have to check all buffers are suitably aligned) without any practical value, because consumers are unlikely to do anything with it (why error out if a buffer is unaligned? your CPU / OS combination is still able to handle it fine).

I think that is true for individual values e.g. int32 but not for wider simd instructions. I also think that this influences the compiler's ability to do certain vectorized optimizations. The flag would be optional so if you are using a library or use case that guarantees this, you'd set it. Otherwise, you wouldn't (even if your buffers could be aligned). I don't think anyone would actually scan buffers in case they are aligned (which is what it seems like you describe).

but not for wider simd instructions

That does not seem to be the case for modern instruction sets and implementations. See x86 and ARM. I may be missing some important data point, though.

In any case, which kind of alignment would this flag enforce? 64-bytes alignment?
(for reference, the C++ implementation probably wouldn't set this flag because, while it makes an effort to allocate aligned buffers, it also accepts slice buffers or buffers allocated by foreign runtimes e.g. CPython)

jacques-n · 2019-12-16T18:39:14Z

cpp/src/arrow/c/abi.h

offset and n_buffers seem out of place here. What is the rationale for including these? n_buffers is known by definition so why add it here? offset seems like an optimization that could solved by shifting the pointers rather than providing an offset.

Note, I understand that offset can allow one to avoid bit-shifting but that seems like something we shouldn't support or tell people to do.

The cpp code supports slices of arrays/record-batches without doing a copy, by just tracking the offset (even for the validity buffers). shifting the pointers doesn't work for validity bitmaps.

Yeah, without the offset member, zero-copy slicing is simply not possible. This would be a serious deficiency

This is consciously not supported at the arrow format level (unless my memory fails me). I don't see why we should support it here. This addition would make this coupled to the cpp implementation (and incompatible with other implementations), which it shouldn't be.

It seems clear that we have different opinions about whether this pattern should be supported. I'm strongly against it for both technical reasons and because it isn't supported by the Arrow specification. I'm sure we came to our conclusions through different experiences. Let me provide a little background on my perspective.

The java codebase actually did have this concept at one point (well before it was part of the Arrow project) and we chose to eliminate it as we ultimately came to the conclusion that it made the wrong tradeoffs between efficient calculations and efficient slicing. What we discovered was that any time you were trying to do zero-copy slicing, doing so on an eight value boundary was a much better design choice than have the additional complexity to deal with bit-shifting on the computation side.

I see excluding this attribute/option as a prudent design choice that kernel developers will benefit from. In general, I expect there to be far more computation algorithms and kernels being written than slicing logic. The number of times that computational code becomes more complex on the processing side with the offset field seems to far outweigh any benefits of byte-splitting on the slicing side.

It would be helpful to better understand why you feel this is so crippling? Which specific algorithms do you see impacted by disallowing byte-splitting? How common are these and what do you think the cost is of not supporting this. How does that compare with your sense of the cost of dealing with this in every computational algorithm?

If this data interface were restricted to the domain of computational kernel evaluation, then I would agree with you, but there is a wider diversity of use cases that the project supports. Note that in Dremio the slice-and-copy pattern is less onerous because record batches are generally limited in size to 64K elements and less, so the calculus is different when record batches can be arbitrarily large in general.

So, first we must consider it already as a given that most implementations are going to provide zero-copy slicing as a feature for in-memory data. The question then is how to handle slices with offset that is not a multiple of 8. There are 3 options:

Disallow such slicing at the C interface boundary, forcing copy-materialization for all use cases

Allow slicing, but regularize sliced bitmaps (and the data buffer for sliced BooleanArray) for all computational kernels, i.e. assume an offset of 0 for all kernels

Allow slicing and implement kernels for non-zero offsets

The 2nd option seems like the most practical -- the interface is always zero copy and kernel developers don't bother trying to optimize for non-zero offsets. It does mean that some functions that are unable to act on sliced data must check whether the offset is 0 and if not, convert the input to be offset-0 based (allocating a new bitmap and copying over the bits if necessary)

I'm not going to block up the work over this but it would feel like a shame to me to have a C data interface which is not zero copy all of the time.

I don't think this would be tenable in C++

Why not? Where is the critical need for allowing non-regularized slicing? Requiring that anything that is slicing the data does so div-8 boundary seems like a pretty low-burden constraint. Are you seeing lots of algorithms where this is high burden?

I'm not arguing here for copy materialization, I'm saying we should make non-div-8 slices invalid and have the ability for zero copy slices and computational simplicity.

Requiring that anything that is slicing the data does so div-8 boundary seems like a pretty low-burden constraint

I think the use-case is a user chooses an arbitrary slice. E.g. I have datatable, and I choose to slice rows 1 to N. the library shouldn't throw an exception here, so it is up to implementors to either do the copy or keep the slice index. But perhaps there is a third option?

Where is the critical need for allowing non-regularized slicing?

As background, I started my career in financial analytics where the world is more oriented around time series data where slicing and analytics on large memory-mapped data sets is the general expectation. Think application of rolling time series windows functions. Consider as an example such system oriented around this kind of data https://en.wikipedia.org/wiki/Kdb%2B

This kind of analytics is very different from batch processing SQL based workloads. In particular, with time series data it is not possible to constrain slices to multiples of 8.

Apache Arrow already has a lot of users from the financial analytics world and many of them are going to be using C++. If you start requiring memory copying every time you request an arbitrary window of data (big or small, a window could be 1KB or 1GB) from a large time series you cripple this use case.

jacques-n · 2019-12-16T21:58:36Z

cpp/src/arrow/c/abi.h

I noticed elsewhere that this is declared optional. Why would we make name optional? Isn't that an anti-pattern?

In C++ land, a top-level Arrow type doesn't have a name. There's a distinction between "field" and "type": a field is a type + a name; but a Arrow array only has a type, not a field.

Maybe state that in the definition, that only a top-level name should generally be empty. I don't think all libraries can handle having null names at non schema roots. The language here suggests that it doesn't really matter when most of the time, it should be defined.

Since a null name can easily be interpreted as an empty name, is it a problem (e.g. in Java) if child nodes have empty names?

jacques-n · 2019-12-16T23:59:14Z

cpp/src/arrow/c/abi.h

I don't understand your comment...

ATM, one needs to know schema information to consume or copy buffers. At a minimum I need both how many buffers and how long each buffer is. Right now, the definition includes one but not the other. Why not exclude both (or include both)?

wesm · 2019-12-23T18:53:42Z

To restate a point regarding the offset member at the top level of this discussion: this is not a C++-specific implementation detail. offset is part of the columnar data structure in 5 out of 6 Arrow implementations (since it is required to allow for slicing), Java being the only one that does not provide it.

jacques-n · 2019-12-26T18:34:07Z

To restate a point regarding the offset member at the top level of this discussion: this is not a C++-specific implementation detail. offset is part of the columnar data structure in 5 out of 6 Arrow implementations (since it is required to allow for slicing), Java being the only one that does not provide it.

I replied on my thoughts to this in the inline discussion.

A couple of key points:

It is important to note that this is not part of the formal specification. It's unfortunate that implementations started using this as it is at odds of the project goal of avoiding zero-serialization on the IPC boundary. I'd argue that the individual implementations should be focused on removing this.
In general, it is overly general to state that this disallows zero-copy slicing. One can absolutely still do zero-copy slicing without the offset field. The only requirement is that slicing happen on a byte boundary (which equates to happening at a multiple of 8 logical values). This seems like a pretty low requirement for most situations where one might need to do slicing that greatly simplifies and improves performance of a large number of processing algorithms.

pitrou · 2019-12-26T18:55:11Z

It is worth pointing out that in the current spec proposal, support for a non-zero offset is optional:

Producers MAY specify that they will only produce 0-offset arrays to
ease implementation of consumer code.
Consumers MAY decide not to support non-0-offset arrays, but they
should document this limitation.

Perhaps another wording would be better?

jacques-n · 2020-01-02T19:19:20Z

It is worth pointing out that in the current spec proposal, support for a non-zero offset is optional:

Producers MAY specify that they will only produce 0-offset arrays to
ease implementation of consumer code.
Consumers MAY decide not to support non-0-offset arrays, but they
should document this limitation.

Perhaps another wording would be better?

To clarify, I'm awaiting further feedback from @wesm per my questions above. My preference is to remove the field entirely as I think it starts to create a shadow standard that reaches beyond what the core standard supports.

wesm · 2020-01-07T18:15:49Z

It's unfortunate that implementations started using this as it is at odds of the project goal of avoiding zero-serialization on the IPC boundary. I'd argue that the individual implementations should be focused on removing this.

I don't think this would be tenable in C++ at least. Having zero-copy slicing does not increase the effective amount of computational work or memory allocation involved. You are choosing between doing the bitmap allocate-and-copy eagerly (when invoking arr->Slice) or lazily (at the IPC boundary, what we currently do). There are plenty of cases where the bitmap allocate-and-copy is never needed, and so to force it eagerly and unconditionally seems wasteful.

There is a separate question which is whether kernel developers should care about computational function evaluation against sliced data with offset that is not a multiple of 8. As I said above, I think it would be fine if libraries established the policy of forcing regularization of sliced bitpacked data (validity bitmaps or Boolean data) prior to function evaluation.

wesm · 2020-01-07T18:37:41Z

To summarize, I am OK with removing the offset member but I don't think that leaving it does much harm.

jacques-n · 2020-01-09T23:52:01Z

To summarize, I am OK with removing the offset member but I don't think that leaving it does much harm.

As I've said elsewhere, I'm very much against this.

jacques-n · 2020-01-09T23:57:00Z

Did you guys have a discussion of whether it makes sense to declare these as a packed representation? I've been thinking about reasonable approaches with Java (which is cross platform) and stating that this is packed seems at first glance like it would be cleaner. That being said, it seems like packing declaration still isn't well defined cross platform (at least upon first glance).

emkornfield · 2020-01-10T07:01:27Z

Did you guys have a discussion of whether it makes sense to declare these as a packed representation? I've been thinking about reasonable approaches with Java (which is cross platform) and stating that this is packed seems at first glance like it would be cleaner. That being said, it seems like packing declaration still isn't well defined cross platform (at least upon first glance).

Can you elaborate on the benefit you think packing provides? I'm not sure why it would be cleaner?

jacques-n · 2020-01-10T21:06:26Z

Did you guys have a discussion of whether it makes sense to declare these as a packed representation? I've been thinking about reasonable approaches with Java (which is cross platform) and stating that this is packed seems at first glance like it would be cleaner. That being said, it seems like packing declaration still isn't well defined cross platform (at least upon first glance).

Can you elaborate on the benefit you think packing provides? I'm not sure why it would be cleaner?

In a JVM, you don't want to construct a struct in C code. You want to allocate a direct byte buffer and then fill in the information. If you declare things as packed, you have to worry less about the changes in different platforms to generate this chunk of memory. It still doesn't solve endianess but since we already disallow big endian in the java library that isn't a major concern. (Note that this is all ignoring the release invocation, which I've thought less about.)

wesm · 2020-01-13T16:40:29Z

As I've said elsewhere, I'm very much against this.

As I commented above, the slicing use case is mostly motivated by non-SQL workloads (e.g. time series analytics). We can remove the offset and take the "wait and see" approach but my prediction will be that it will show up as an issue possibly necessitating an interface v2 in the future. I can live with that

wesm · 2020-01-14T16:30:22Z

As a way to break the impasse I suggest the following:

Remove offset member from specification
ExportArray function should materialize sliced bitpacked data. This means this function needs to be able to fail (since it can allocate memory / is not guaranteed to be zero copy)

I plan to start a discussion on the mailing list to see if anyone wants to share information about their time series / non-SQL workloads. I represent some of these by proxy (since Ursa Labs is sponsored partly by some large financial institutions) but it would be good if others would share their use cases directly.

pitrou · 2020-01-15T11:08:55Z

Any other comments? @xhochy @emkornfield @lidavidm @kou

lidavidm · 2020-01-16T03:04:05Z

I am ok with this with or without the slice field. I do think it would be useful, but having the ability to share data between implementations is already an improvement. There is also the possibility that an application which really needs slicing will simply communicate the offsets out-of-band with the data.

Might it be worth reserving space for the offset field?

I don't have much to say about workloads offhand. This functionality hasn't been something that we had thought about or wished for until quite recently; the workload I'm investigating would take advantage of slicing if present but wouldn't be blocked if it isn't there in a first iteration.

emkornfield · 2020-01-17T06:35:26Z

Might it be worth reserving space for the offset field?
A separate API might be cleaner if we need to add this later. So we don't need to force clients to validate the reserved space is equal to zero (otherwise the semantics of the batch would break if we decided to start using that field). I don't have a strong opinion here one way or another.

In regards to general review, please don't wait for me, I'm not sure when I will get to it.

CCing @brills in case he has any thoughts on the proposal from a TFX perspective.

brills · 2020-01-18T00:08:40Z

Being able to pass sliced Arrays through this API is important to our (TFX) use cases, and we hope that this API is light-weight enough so that our own utility functions that takes / outputs Arrow entities can be isolated from pyarrow without worrying about overhead.

We rely on Arrow's zero-copy slicing heavily. We store machine learning training data in Arrow RecordBatches and one common operation is to slice a larger batch into smaller ones, or take slices that meet certain criteria (e.g. "filter by feature value") and these operations may potentially result in "unaligned" slices (may apply to both null bitmaps and the values array).

It looks like that the Arrow C++ implementation is a superset of the Arrow format spec. Could this C-interface spec define an "extension" so that C++ (and others) can implement while Java can ignore (or error out)?

jacques-n · 2020-01-19T23:21:44Z

It looks like that the Arrow C++ implementation is a superset of the Arrow format spec. Could this C-interface spec define an "extension" so that C++ (and others) can implement while Java can ignore (or error out)?

This is exactly what I think we should avoid. It fragments the standard.

I plan to start a discussion on the mailing list to see if anyone wants to share information about their time series / non-SQL workloads. I represent some of these by proxy (since Ursa Labs is sponsored partly by some large financial institutions) but it would be good if others would share their use cases directly.

I don't think the issue here is related to workload type. By introducing offset you're degrading the processing side of Arrow to try to enhance the transport one. From experience, I think that is a bad idea. It goes in contradiction of the goal to support both equally without serialization/deserialization.

I've yet to see or hear about an algorithm that couldn't make a choice to do 8-bit-aligned slices when producing data or splitting it [comments on tfx examples below]. I understand that people may have built stuff that depends on unaligned slicing. We've done it in the past too. That isn't the same as saying that the algorithm couldn't or shouldn't accommodate this constraint.

For clarity for users, I'd be inclined to introduce two interfaces: one that is guaranteed succesful and will never allocate e.g. slice8() and one that may allocate/copy e.g. slice(). People can pick which pattern they want and understand that there is benefit to not building things that slice arbitrarily.

slice a larger batch into smaller ones, or take slices that meet certain criteria (e.g. "filter by feature value")

In the first case, I don't see why you'd be unable to slice on 8-value offsets. In the latter case, some part of your algorithm is packing the values in one category at a time (or else they wouldn't be colocated). In that case, I'd suggest you either start new batches for each category as you're packing.

dhirschfeld · 2020-01-19T23:52:45Z

understand that there is benefit to not building things that slice arbitrarily.

I'm an arrow user coming from the quant-finance area where we operate on large timeseries datasets. The performance of the algorithms we implement are dependant on not copying multiple GB of data. The algorithms fundamentally work on arbitrarily-sized time windows so we can't just take a multiple of 8 without implementing our own offset logic on top of each slice.

@wesm has pretty accurately voiced my concerns. I'm just piping up to say they're not just theoretical.

jacques-n · 2020-01-20T00:56:36Z

The most important point here that I think people are glossing over is: the ability to arbitrarily slice data was not a feature of the specification. Undermining the specification with a new alternative specification fragments the community and the project.

I'm an arrow user coming from the quant-finance area where we operate on large timeseries datasets. The performance of the algorithms we implement are dependant on not copying multiple GB of data. The algorithms fundamentally work on arbitrarily-sized time windows so we can't just take a multiple of 8 without implementing our own offset logic on top of each slice.

@dhirschfeld, thanks for sharing your thoughts and experiences. It's very much appreciated.

There are a couple of things I would say to what I believe you're implying about your usecase (apologize if I don't capture these correctly). It sounds like:

You generate very large contiguous arrays/vectors of data.
The data is pre-ordered by time
You frequently slice to get certain segments of time

I would suggest that (1) is a Bad Idea. (I believe Wes and I disagree on this.) In general, I haven't seen benchmarks that argue for very large chunks of data to be contiguously managed. Once you get beyond a certain size, the cost of indirection on occasion is irrelevant. On the flipside, there are a large number of really good reason to keep batches of data smaller including pipelining and memory fragmentation. If (1) is avoided, the frequency of problematic boundary slicing goes down and you're talking about KBs to copy, not GBs in the unfortunate case you can't work around it. (As an aside, I feel guilty that we didn't enforce this in the specification. It may have made it initially slightly harder for people to adopt arrow but would have ultimately been a favor to everyone--full disclosure, I believe Wes and I disagree on this as well :D )

If (2) is true and the common resolution of (3) is known, I would suggest that you should be batching and slicing your data based on this coarseness already. I understand that you may not do that today and it would be hard work to change. Even so, I don't think that is a good enough reason to change the format.

Furthermore on (3), I rarely see scenarios where people don't further have to apply a predicate mask on data workloads beyond how they are initially organized. If selection predicates are being applied, applying them initially to mask boundary values seems like a simple and consistent solution.

I agree with everyone here that supporting arbitrary offsets makes many problems simpler but I see it is a local maximum, not a global one.

wesm · 2020-01-20T03:48:34Z

@jacques-n I think there is some conflation of the question of data chunking (and large contiguous allocations) with the issue of slicing and 8-record multiples. (That being said, on the subject of large allocations, I maintain my position that maintaining a 32-bit length constraint on arrays/record batches would have closed off certain important user communities from ever adopting Arrow -- many of my sources of funding for Arrow development are coming from parties that rely on this, for example.)

What we are saying about the time series / financial analytics use cases is that requiring 8-record multiples while also ensuring zero copy is not possible, regardless of the chunking or size of the dataset. So when you say users should be "batching and slicing your data based on this coarseness", this is not possible in general -- data slices will start on arbitrary offsets.

Time series data is collected and processed using window-type algorithms (think grouping tick data into 5 minute buckets) -- algorithms determine the window edges then process the data falling inside the window. So what will happen here is that auxiliary data structures will have to be created to provide a zero-copy view onto unsliceable Arrow data, or otherwise a copy must be materialized if you want to deal purely with Arrow constructs.

Note also that analytics will commonly change the data binning rules on the fly (e.g. from 1min to 5min to 30min, 1 hour, 24 hour, etc.), so any predetermined data batch sizes may work for one analysis but not for another.

jacques-n · 2020-01-20T06:07:47Z

@wesm, we seem to continue to be talking past each other again. I greatly respect your opinion and think we'll have to agree to disagree.

I rarely see conversation about many of the things that inform my opinions around this such as fragmentation, pipelining, null evaluation and runtime code generation. I appreciate that others may worry less about those things (or maybe worry about them in different ways) but I continue to see arbitrary slicing as a local maximum that is detrimental to those and many other things.

Given that, I'm still -1 against this merging when it has the offset/arbitrary slices. I see offset as a mistake and in conflict with the Arrow spec. Like any thing that causes serialization/deserialization overhead at the ipc boundary, I think we should try to avoid it lest we weaken the original promises of the project.

pitrou · 2020-01-20T13:05:04Z

I see offset as a mistake and in conflict with the Arrow spec. Like any thing that causes serialization/deserialization overhead at the ipc boundary, I think we should try to avoid it lest we weaken the original promises of the project.

Offset already exists in C++ land and in several other implementations as well, and it won't disappear from them. It already causes overhead at the IPC boundary - this PR doesn't change anything to that.

If selection predicates are being applied, applying them initially to mask boundary values seems like a simple and consistent solution.

What you're saying is that you can already emulate offset behaviour, at little cost - so there seems to be no practical reason for you to oppose it, only an ideological one.

jacques-n · 2020-01-20T14:57:46Z

There are multiple practical reasons to keep interfaces consistent with the spec and keep spec versus implementation concerns separate.

I'm sorry you don't like my perspective but it's based entirely on practical concerns. Having written a complete distributed sql engine on top of Arrow gives one a very practical perspective.

wesm · 2020-01-20T17:15:28Z

It's important to note that the use cases from time series and finance that I'm talking about are notoriously not solved by SQL-based data processing systems at all. This is a common frustration for the financial industry which had led to the development of many proprietary and commercial analytical systems tailored for these needs. At least on the C++ library side, I feel obligated to look after these interests which have not been historically served by the SQL community.

If some of our intended users (e.g. TFX) are not going to be able to use this interface without slicing, I am wondering if we should reframe the problem statement to that of creating a minimalist C interface for third party projects to interoperate with the Arrow C++ library only. Then this becomes an endogenous concern of the C++ project

jacques-n · 2020-01-20T17:41:01Z

It's important to note that the use cases from time series and finance that I'm talking about are notoriously not solved by SQL-based data processing systems at all. This is a common frustration for the financial industry which had led to the development of many proprietary and commercial analytical systems tailored for these needs. At least on the C++ library side, I feel obligated to look after these interests which have not been historically served by the SQL community.

It's important to note that my comments were in response to the suggestion that my perspective wasn't come from a practical place, not to dismiss the realities of different use cases. However, we should try to proceed spec-first as we introduce new use cases to avoid situations where we have walled gardens.

I am wondering if we should reframe the problem statement to that of creating a minimalist C interface for third party projects to interoperate with the Arrow C++ library only. Then this becomes an endogenous concern of the C++ project

I think that we've come to the point where that is a false reframe. You may recall that my initial reservations about this work drove us to frame the problem in that way. I stepped away from the issue. However, during development and early socialization of this work it became clear that separating a C ABI from the core concept of "Arrow compatible" is very difficult, if not impossible. This led us down the path of separating schema from data and the current impasse. My only suggestion at this point is to develop the code in an independent repository outside the auspices of the Arrow project.

wesm · 2020-01-20T17:44:39Z

My only suggestion at this point is to develop the code in an independent repository outside the auspices of the Arrow project.

That seems a bit extreme. I think as an "API" for users of the C++ library, our goal is to provide a minimalistic way for people to create the ArrayData data structure without taking on as much code dependency. As purely a detail of the C++ project with no aspirations beyond that, I do not think this does harm.

jacques-n · 2020-01-21T21:58:10Z

After further reflection, I'm in support of merging this with the inclusion of the offset field. While it may not be desirable for some reasons, it seems like it is better to do that and have this be a generalized interface for adoption than compartmentalize things between sub-codebases.

+1

pitrou · 2020-01-23T16:58:50Z

Rebased.

pitrou · 2020-02-24T16:36:17Z

Superseded by PR #6483

pitrou changed the title ~~Draft C++ implementation of C data interface (Take 2)~~ C++ implementation of C data interface (Take 2) Dec 12, 2019

pitrou force-pushed the c-data-interface-impl2 branch from 6b1670d to 5b65b17 Compare December 12, 2019 17:32

pitrou force-pushed the c-data-interface-impl2 branch from 5b65b17 to 600a163 Compare December 12, 2019 17:56

jacques-n reviewed Dec 13, 2019

View reviewed changes

pitrou force-pushed the c-data-interface-impl2 branch 3 times, most recently from dddb8de to e99baf7 Compare December 16, 2019 17:10

jacques-n reviewed Dec 16, 2019

View reviewed changes

pitrou force-pushed the c-data-interface-impl2 branch from e99baf7 to 16fc6df Compare January 23, 2020 16:58

pitrou mentioned this pull request Feb 3, 2020

C++ implementation of C data interface #5608

Closed

Draft C++ implementation of C data interface (Take 2)

24f4648

pitrou force-pushed the c-data-interface-impl2 branch from 32f55b7 to 24f4648 Compare February 11, 2020 18:49

Merge branch 'master' into c-data-interface-impl2

3baca40

nealrichardson changed the title ~~C++ implementation of C data interface (Take 2)~~ ARROW-7913: [C++][Python][R] C++ implementation of C data protocol Feb 21, 2020

pitrou closed this Feb 24, 2020

pitrou deleted the c-data-interface-impl2 branch February 24, 2020 16:36

asfimport mentioned this pull request Mar 2, 2020

[C++][Python][R] C++ implementation of C data interface #24134

Closed

Uh oh!

ARROW-7913: [C++][Python][R] C++ implementation of C data protocol #6026

ARROW-7913: [C++][Python][R] C++ implementation of C data protocol #6026

Uh oh!

Conversation

pitrou commented Dec 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 12, 2019

Uh oh!

pitrou commented Dec 12, 2019

Uh oh!

jacques-n left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jacques-n Dec 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jacques-n Dec 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm Jan 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pitrou Dec 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm commented Dec 23, 2019

Uh oh!

jacques-n commented Dec 26, 2019

pitrou commented Dec 12, 2019 •

edited

Loading

jacques-n Dec 16, 2019 •

edited

Loading

jacques-n Dec 16, 2019 •

edited

Loading

wesm Jan 13, 2020 •

edited

Loading

pitrou Dec 22, 2019 •

edited

Loading

wesm commented Jan 14, 2020 •

edited

Loading

pitrou commented Jan 15, 2020 •

edited

Loading

jacques-n commented Jan 20, 2020 •

edited

Loading