-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how do we deal with Int32
overflows?
#75
Comments
There is a pretty active arrow mailing list, maybe ask there? |
@wesm would be fantastic if you could help us out here. Essentially we are trying to understand what the correct way to handle more than 2^31 - 1 rows in arrow and feather is. I've read through the arrow spec and that didn't really clear it up for me. |
We've allowed 64-bit Arrow lengths for almost a year. Keeping things under INT32_MAX is a guideline, though, for better compatibility with Java. Could you let me know what problem you're having? Would you all like to join the Apache Arrow project at some point? It would facilitate cross-language collaboration. |
Thanks! So the info in the "Array lengths" section in https://arrow.apache.org/docs/memory_layout.html is just outdated and we should ignore that? Can we also just use 64-bit lengths in Feather and still be fully compatible with the python and R implementation? I'm not entirely clear what joining the arrow project means, but as long as I don't commit to do work it sounds great. I would certainly like to strengthen the cross-language collaboration, and if I can help do that from julia's side of things, I'm game. |
It means building a Julia developer community in the Apache project (https://www.apache.org/foundation/how-it-works.html). It's about community and governance more than anything. As one symptom of working over here by yourselves, you aren't interacting with the other Arrow developers very often. Yes -- use 64-bit lengths. The Feather metadata uses 64 bits. I will open a JIRA about updating that document about array lengths (64-bit permitted, < 32-bit preferred) |
Cool, thanks! And also thanks for the pointer on the Apache project. I totally agree that we (the julia community) should interact much more with things that are going on in these other communities. I'll have to find some time to dig through all the material in that link, but it sure looks super useful. I have some more ambitious plans around all of this here on the Berkeley campus, I'll make sure to reach out if that moves beyond vague ideas. |
Awesome! This makes things incredibly simple, essentially no work is required except to ensure that lengths are I agree that we should try to get involved with Apache, though I don't expect my own interest in this stuff to go beyond my commitment to getting Arrow.jl and Feather.jl up and running reliably. (Am curious to hear @davidanthoff grand plans though 😉) |
Yes, thanks for chiming in here @wesm. In the past, various limitations on julia package registration/installation prevented the code here (or Arrow) from being properly integrated upstream, but improvements on julia master (and soon to be released 0.7/1.0) allow for more flexible arrangements, such as letting the code live side-by-side other implementations. We certainly have interest in integrating efforts with the parent project, so as things progress here, we can work on the integration as well. |
I wouldn't let issues related to the structure of the codebase(s) get in the way. If we need to create you a separate |
Sounds great! I'll work w/ @ExpandingMan and we can get the arrow-julia code he's been working on in a state to integrate. |
OK, the main thing I will ask is that you keep track of whose IP is involved. There is code living in multiple repos from what I can tell, and so the import / clearance process may be somewhat complex (we have to verify copyright and distribution rights for all code imported into an Apache project) |
This sounds great! Thanks Wes I just (happily) watched the process of merging in the go implementation and I think moving the Julia one under the apache umbrella would be great. All: we should put together a todo list of what needs to happen to get this "in a state to integrate." I can likely pitch in, but would spend my time better if there was a specific set of tasks we were working towards |
Arrow.jl was created from scratch by me using only the one arrow formatting document and the existing Feather.jl as reference. I deliberately didn't even look at the C++ source code because my goal was to make as "Julian" a package as possible. I'm not sure what would be involved with |
@wesm , what do you guys do about the offset values (i.e. for string columns)? I don't see any way in the Feather metadata to specify their type. As far as I can tell the C++ feather implementation always seems to make the offset values |
@ExpandingMan Feather is limited to < 2 GB of total string data per column. This could be fixed by changing the Feather format to use the Arrow streaming/file messaging formats (i.e. chunked files instead of monolithic columns) internally. I'd like to do this as soon as we are able to ship R bindings of Arrow. Presumably Julia would follow. It would be nice to have the Julia code in the same place so we can do integration testing of patches |
Ah, I see, so you are maintaining that limit. That was the whole reason we were asking about the Anyway, I had been giving some thought to how to write more data than can fit in memory and I was indeed thinking that multiple feather files would be the absolute simplest thing to do. Fortunately, in Julia implementing that should be really simple, we'll just need to know what the metadata will look like. From our perspective it would be perfectly fine if every file has a FlatBuffer of some kind, reading them in has very low overhead. |
I've been reading up on what happened over on the Python side to deal with arrays with length that excedes
typemax(Int32)
. This is a problem because Arrow specifies that all lengths should be described withInt32
(they say this is to keep people from having contiguous arrays that are too large, but I have to say that it's hard for me to see this as a good justification for this particular case). This issue and this PR sort of make it seem as if they are indeed storing contiguous arrays of length greater thantypemax(Int32)
but they somehow read them in chunks. I've been avoiding digging through their C++ code in great detail, but I was wondering if @quinnj perhaps has some insight on how this issue might be dealt with. It would seem to me that the most sensible solution would be to change the Feather metadata so that a column can be composed of arbitrarily many Arrow "primitive arrays" but that doesn't seem to be what they've done at all.The text was updated successfully, but these errors were encountered: