-
Couldn't load subscription status.
- Fork 3.9k
ARROW-7913: [C++][Python][R] C++ implementation of C data protocol #6026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
6b1670d to
5b65b17
Compare
|
Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? Then could you also rename pull request title in the following format? See also: |
5b65b17 to
600a163
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of comments. The biggest question I have is about the use pattern and whether we need some concept of iteration.
One other thing, it feels like there are some other changes in the changeset that aren't related to the c data interface (beyond some prereq's to make it work).
python/pyarrow/cffi.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a way to do this without duplication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on the FFI layer. Some may be able to parse arbitrary header files (such as cpp/src/arrow/c/abi.h in this PR).
cpp/src/arrow/c/abi.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't remember. Do we have required alignment here? Anything we should note in the header?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alignment and padding specification is simply part of the host system's C ABI. By making this a C struct (not an explicitly defined binary encoding), we follow the C ABI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't talking about the alignment of the information in the C struct, I was talking about the alignment of the buffers themselves. Don't we declare elsewhere that buffers should be 8 or 64 byte aligned (?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the columnar format, in-memory alignment is a recommendation but not a requirement, so I think the same would hold here. As my intuition, when referencing "other people's memory" in process it would be onerous to require a memory copy to satisfy a pointer alignment requirement.
The IPC message protocol does enforce alignment and padding at the serialization boundary, however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should that be a flag then? The memory is guaranteed to be aligned. Do all the libraries handle non-aligned memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a flag seems like a good idea. I believe all the libraries do handle non-aligned memory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Memory alignment nowadays is more an optimization than a hardware requirement, on common architectures at least. x86 happily accepts unaligned memory accesses, and so does ARM64. Legacy architectures like Sparc may have stricter requirements. See also https://news.ycombinator.com/item?id=14769797
I think a flag may just add boilerplate to the implementation (you have to check all buffers are suitably aligned) without any practical value, because consumers are unlikely to do anything with it (why error out if a buffer is unaligned? your CPU / OS combination is still able to handle it fine).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is true for individual values e.g. int32 but not for wider simd instructions. I also think that this influences the compiler's ability to do certain vectorized optimizations. The flag would be optional so if you are using a library or use case that guarantees this, you'd set it. Otherwise, you wouldn't (even if your buffers could be aligned). I don't think anyone would actually scan buffers in case they are aligned (which is what it seems like you describe).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but not for wider simd instructions
That does not seem to be the case for modern instruction sets and implementations. See x86 and ARM. I may be missing some important data point, though.
In any case, which kind of alignment would this flag enforce? 64-bytes alignment?
(for reference, the C++ implementation probably wouldn't set this flag because, while it makes an effort to allocate aligned buffers, it also accepts slice buffers or buffers allocated by foreign runtimes e.g. CPython)
dddb8de to
e99baf7
Compare
cpp/src/arrow/c/abi.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
offset and n_buffers seem out of place here. What is the rationale for including these? n_buffers is known by definition so why add it here? offset seems like an optimization that could solved by shifting the pointers rather than providing an offset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, I understand that offset can allow one to avoid bit-shifting but that seems like something we shouldn't support or tell people to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cpp code supports slices of arrays/record-batches without doing a copy, by just tracking the offset (even for the validity buffers). shifting the pointers doesn't work for validity bitmaps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, without the offset member, zero-copy slicing is simply not possible. This would be a serious deficiency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is consciously not supported at the arrow format level (unless my memory fails me). I don't see why we should support it here. This addition would make this coupled to the cpp implementation (and incompatible with other implementations), which it shouldn't be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems clear that we have different opinions about whether this pattern should be supported. I'm strongly against it for both technical reasons and because it isn't supported by the Arrow specification. I'm sure we came to our conclusions through different experiences. Let me provide a little background on my perspective.
The java codebase actually did have this concept at one point (well before it was part of the Arrow project) and we chose to eliminate it as we ultimately came to the conclusion that it made the wrong tradeoffs between efficient calculations and efficient slicing. What we discovered was that any time you were trying to do zero-copy slicing, doing so on an eight value boundary was a much better design choice than have the additional complexity to deal with bit-shifting on the computation side.
I see excluding this attribute/option as a prudent design choice that kernel developers will benefit from. In general, I expect there to be far more computation algorithms and kernels being written than slicing logic. The number of times that computational code becomes more complex on the processing side with the offset field seems to far outweigh any benefits of byte-splitting on the slicing side.
It would be helpful to better understand why you feel this is so crippling? Which specific algorithms do you see impacted by disallowing byte-splitting? How common are these and what do you think the cost is of not supporting this. How does that compare with your sense of the cost of dealing with this in every computational algorithm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this data interface were restricted to the domain of computational kernel evaluation, then I would agree with you, but there is a wider diversity of use cases that the project supports. Note that in Dremio the slice-and-copy pattern is less onerous because record batches are generally limited in size to 64K elements and less, so the calculus is different when record batches can be arbitrarily large in general.
So, first we must consider it already as a given that most implementations are going to provide zero-copy slicing as a feature for in-memory data. The question then is how to handle slices with offset that is not a multiple of 8. There are 3 options:
- Disallow such slicing at the C interface boundary, forcing copy-materialization for all use cases
- Allow slicing, but regularize sliced bitmaps (and the data buffer for sliced BooleanArray) for all computational kernels, i.e. assume an offset of 0 for all kernels
- Allow slicing and implement kernels for non-zero offsets
The 2nd option seems like the most practical -- the interface is always zero copy and kernel developers don't bother trying to optimize for non-zero offsets. It does mean that some functions that are unable to act on sliced data must check whether the offset is 0 and if not, convert the input to be offset-0 based (allocating a new bitmap and copying over the bits if necessary)
I'm not going to block up the work over this but it would feel like a shame to me to have a C data interface which is not zero copy all of the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this would be tenable in C++
Why not? Where is the critical need for allowing non-regularized slicing? Requiring that anything that is slicing the data does so div-8 boundary seems like a pretty low-burden constraint. Are you seeing lots of algorithms where this is high burden?
I'm not arguing here for copy materialization, I'm saying we should make non-div-8 slices invalid and have the ability for zero copy slices and computational simplicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requiring that anything that is slicing the data does so div-8 boundary seems like a pretty low-burden constraint
I think the use-case is a user chooses an arbitrary slice. E.g. I have datatable, and I choose to slice rows 1 to N. the library shouldn't throw an exception here, so it is up to implementors to either do the copy or keep the slice index. But perhaps there is a third option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the critical need for allowing non-regularized slicing?
As background, I started my career in financial analytics where the world is more oriented around time series data where slicing and analytics on large memory-mapped data sets is the general expectation. Think application of rolling time series windows functions. Consider as an example such system oriented around this kind of data https://en.wikipedia.org/wiki/Kdb%2B
This kind of analytics is very different from batch processing SQL based workloads. In particular, with time series data it is not possible to constrain slices to multiples of 8.
Apache Arrow already has a lot of users from the financial analytics world and many of them are going to be using C++. If you start requiring memory copying every time you request an arbitrary window of data (big or small, a window could be 1KB or 1GB) from a large time series you cripple this use case.
cpp/src/arrow/c/abi.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed elsewhere that this is declared optional. Why would we make name optional? Isn't that an anti-pattern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In C++ land, a top-level Arrow type doesn't have a name. There's a distinction between "field" and "type": a field is a type + a name; but a Arrow array only has a type, not a field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe state that in the definition, that only a top-level name should generally be empty. I don't think all libraries can handle having null names at non schema roots. The language here suggests that it doesn't really matter when most of the time, it should be defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since a null name can easily be interpreted as an empty name, is it a problem (e.g. in Java) if child nodes have empty names?
cpp/src/arrow/c/abi.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand your comment...
ATM, one needs to know schema information to consume or copy buffers. At a minimum I need both how many buffers and how long each buffer is. Right now, the definition includes one but not the other. Why not exclude both (or include both)?
|
To restate a point regarding the |
I replied on my thoughts to this in the inline discussion. A couple of key points:
|
|
It is worth pointing out that in the current spec proposal, support for a non-zero
Perhaps another wording would be better? |
To clarify, I'm awaiting further feedback from @wesm per my questions above. My preference is to remove the field entirely as I think it starts to create a shadow standard that reaches beyond what the core standard supports. |
I don't think this would be tenable in C++ at least. Having zero-copy slicing does not increase the effective amount of computational work or memory allocation involved. You are choosing between doing the bitmap allocate-and-copy eagerly (when invoking There is a separate question which is whether kernel developers should care about computational function evaluation against sliced data with offset that is not a multiple of 8. As I said above, I think it would be fine if libraries established the policy of forcing regularization of sliced bitpacked data (validity bitmaps or Boolean data) prior to function evaluation. |
|
To summarize, I am OK with removing the |
As I've said elsewhere, I'm very much against this. |
|
Did you guys have a discussion of whether it makes sense to declare these as a packed representation? I've been thinking about reasonable approaches with Java (which is cross platform) and stating that this is packed seems at first glance like it would be cleaner. That being said, it seems like packing declaration still isn't well defined cross platform (at least upon first glance). |
Can you elaborate on the benefit you think packing provides? I'm not sure why it would be cleaner? |
In a JVM, you don't want to construct a struct in C code. You want to allocate a direct byte buffer and then fill in the information. If you declare things as packed, you have to worry less about the changes in different platforms to generate this chunk of memory. It still doesn't solve endianess but since we already disallow big endian in the java library that isn't a major concern. (Note that this is all ignoring the release invocation, which I've thought less about.) |
As I commented above, the slicing use case is mostly motivated by non-SQL workloads (e.g. time series analytics). We can remove the |
|
As a way to break the impasse I suggest the following:
I plan to start a discussion on the mailing list to see if anyone wants to share information about their time series / non-SQL workloads. I represent some of these by proxy (since Ursa Labs is sponsored partly by some large financial institutions) but it would be good if others would share their use cases directly. |
|
Any other comments? @xhochy @emkornfield @lidavidm @kou |
|
I am ok with this with or without the slice field. I do think it would be useful, but having the ability to share data between implementations is already an improvement. There is also the possibility that an application which really needs slicing will simply communicate the offsets out-of-band with the data. Might it be worth reserving space for the offset field? I don't have much to say about workloads offhand. This functionality hasn't been something that we had thought about or wished for until quite recently; the workload I'm investigating would take advantage of slicing if present but wouldn't be blocked if it isn't there in a first iteration. |
In regards to general review, please don't wait for me, I'm not sure when I will get to it. CCing @brills in case he has any thoughts on the proposal from a TFX perspective. |
|
Being able to pass sliced Arrays through this API is important to our (TFX) use cases, and we hope that this API is light-weight enough so that our own utility functions that takes / outputs Arrow entities can be isolated from pyarrow without worrying about overhead. We rely on Arrow's zero-copy slicing heavily. We store machine learning training data in Arrow RecordBatches and one common operation is to slice a larger batch into smaller ones, or take slices that meet certain criteria (e.g. "filter by feature value") and these operations may potentially result in "unaligned" slices (may apply to both null bitmaps and the values array). It looks like that the Arrow C++ implementation is a superset of the Arrow format spec. Could this C-interface spec define an "extension" so that C++ (and others) can implement while Java can ignore (or error out)? |
This is exactly what I think we should avoid. It fragments the standard.
I don't think the issue here is related to workload type. By introducing offset you're degrading the processing side of Arrow to try to enhance the transport one. From experience, I think that is a bad idea. It goes in contradiction of the goal to support both equally without serialization/deserialization. I've yet to see or hear about an algorithm that couldn't make a choice to do 8-bit-aligned slices when producing data or splitting it [comments on tfx examples below]. I understand that people may have built stuff that depends on unaligned slicing. We've done it in the past too. That isn't the same as saying that the algorithm couldn't or shouldn't accommodate this constraint. For clarity for users, I'd be inclined to introduce two interfaces: one that is guaranteed succesful and will never allocate e.g. slice8() and one that may allocate/copy e.g. slice(). People can pick which pattern they want and understand that there is benefit to not building things that slice arbitrarily.
In the first case, I don't see why you'd be unable to slice on 8-value offsets. In the latter case, some part of your algorithm is packing the values in one category at a time (or else they wouldn't be colocated). In that case, I'd suggest you either start new batches for each category as you're packing. |
I'm an arrow user coming from the quant-finance area where we operate on large timeseries datasets. The performance of the algorithms we implement are dependant on not copying multiple GB of data. The algorithms fundamentally work on arbitrarily-sized time windows so we can't just take a multiple of 8 without implementing our own offset logic on top of each slice. @wesm has pretty accurately voiced my concerns. I'm just piping up to say they're not just theoretical. |
|
The most important point here that I think people are glossing over is: the ability to arbitrarily slice data was not a feature of the specification. Undermining the specification with a new alternative specification fragments the community and the project.
@dhirschfeld, thanks for sharing your thoughts and experiences. It's very much appreciated. There are a couple of things I would say to what I believe you're implying about your usecase (apologize if I don't capture these correctly). It sounds like:
I would suggest that (1) is a Bad Idea. (I believe Wes and I disagree on this.) In general, I haven't seen benchmarks that argue for very large chunks of data to be contiguously managed. Once you get beyond a certain size, the cost of indirection on occasion is irrelevant. On the flipside, there are a large number of really good reason to keep batches of data smaller including pipelining and memory fragmentation. If (1) is avoided, the frequency of problematic boundary slicing goes down and you're talking about KBs to copy, not GBs in the unfortunate case you can't work around it. (As an aside, I feel guilty that we didn't enforce this in the specification. It may have made it initially slightly harder for people to adopt arrow but would have ultimately been a favor to everyone--full disclosure, I believe Wes and I disagree on this as well :D ) If (2) is true and the common resolution of (3) is known, I would suggest that you should be batching and slicing your data based on this coarseness already. I understand that you may not do that today and it would be hard work to change. Even so, I don't think that is a good enough reason to change the format. Furthermore on (3), I rarely see scenarios where people don't further have to apply a predicate mask on data workloads beyond how they are initially organized. If selection predicates are being applied, applying them initially to mask boundary values seems like a simple and consistent solution. I agree with everyone here that supporting arbitrary offsets makes many problems simpler but I see it is a local maximum, not a global one. |
|
@jacques-n I think there is some conflation of the question of data chunking (and large contiguous allocations) with the issue of slicing and 8-record multiples. (That being said, on the subject of large allocations, I maintain my position that maintaining a 32-bit length constraint on arrays/record batches would have closed off certain important user communities from ever adopting Arrow -- many of my sources of funding for Arrow development are coming from parties that rely on this, for example.) What we are saying about the time series / financial analytics use cases is that requiring 8-record multiples while also ensuring zero copy is not possible, regardless of the chunking or size of the dataset. So when you say users should be "batching and slicing your data based on this coarseness", this is not possible in general -- data slices will start on arbitrary offsets. Time series data is collected and processed using window-type algorithms (think grouping tick data into 5 minute buckets) -- algorithms determine the window edges then process the data falling inside the window. So what will happen here is that auxiliary data structures will have to be created to provide a zero-copy view onto unsliceable Arrow data, or otherwise a copy must be materialized if you want to deal purely with Arrow constructs. Note also that analytics will commonly change the data binning rules on the fly (e.g. from 1min to 5min to 30min, 1 hour, 24 hour, etc.), so any predetermined data batch sizes may work for one analysis but not for another. |
|
@wesm, we seem to continue to be talking past each other again. I greatly respect your opinion and think we'll have to agree to disagree. I rarely see conversation about many of the things that inform my opinions around this such as fragmentation, pipelining, null evaluation and runtime code generation. I appreciate that others may worry less about those things (or maybe worry about them in different ways) but I continue to see arbitrary slicing as a local maximum that is detrimental to those and many other things. Given that, I'm still -1 against this merging when it has the offset/arbitrary slices. I see offset as a mistake and in conflict with the Arrow spec. Like any thing that causes serialization/deserialization overhead at the ipc boundary, I think we should try to avoid it lest we weaken the original promises of the project. |
Offset already exists in C++ land and in several other implementations as well, and it won't disappear from them. It already causes overhead at the IPC boundary - this PR doesn't change anything to that.
What you're saying is that you can already emulate |
|
There are multiple practical reasons to keep interfaces consistent with the spec and keep spec versus implementation concerns separate. I'm sorry you don't like my perspective but it's based entirely on practical concerns. Having written a complete distributed sql engine on top of Arrow gives one a very practical perspective. |
|
It's important to note that the use cases from time series and finance that I'm talking about are notoriously not solved by SQL-based data processing systems at all. This is a common frustration for the financial industry which had led to the development of many proprietary and commercial analytical systems tailored for these needs. At least on the C++ library side, I feel obligated to look after these interests which have not been historically served by the SQL community. If some of our intended users (e.g. TFX) are not going to be able to use this interface without slicing, I am wondering if we should reframe the problem statement to that of creating a minimalist C interface for third party projects to interoperate with the Arrow C++ library only. Then this becomes an endogenous concern of the C++ project |
It's important to note that my comments were in response to the suggestion that my perspective wasn't come from a practical place, not to dismiss the realities of different use cases. However, we should try to proceed spec-first as we introduce new use cases to avoid situations where we have walled gardens.
I think that we've come to the point where that is a false reframe. You may recall that my initial reservations about this work drove us to frame the problem in that way. I stepped away from the issue. However, during development and early socialization of this work it became clear that separating a C ABI from the core concept of "Arrow compatible" is very difficult, if not impossible. This led us down the path of separating schema from data and the current impasse. My only suggestion at this point is to develop the code in an independent repository outside the auspices of the Arrow project. |
That seems a bit extreme. I think as an "API" for users of the C++ library, our goal is to provide a minimalistic way for people to create the |
|
After further reflection, I'm in support of merging this with the inclusion of the offset field. While it may not be desirable for some reasons, it seems like it is better to do that and have this be a generalized interface for adoption than compartmentalize things between sub-codebases. +1 |
e99baf7 to
16fc6df
Compare
|
Rebased. |
32f55b7 to
24f4648
Compare
|
Superseded by PR #6483 |
TODO: