-
Notifications
You must be signed in to change notification settings - Fork 796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement StringViewArray
and BinaryViewArray
reading/writing in parquet
#5530
Comments
@mapleFU brought up a good point on #5557 (comment) that I wanted to record here. The observation is that that the special handling StringView / BinaryView will be substantially more code in the parquet decoder. The reason for adding the special case is to avoid a copy For example as I understand it, from the parquet encodings doc
So the data looks like this (length prefix, followed by the bytes):
To make a StringArray, those bytes must be copied to a new buffer so they are contiguous:
However, for a StringView array, the raw bytes can be used without copying
|
In parquet, For some view, this might work for some encoding( like PLAIN and DELTA_LENGTH_BYTE_ARRAY), but might be tricky for DELTA_BYTE_ARRAY. Besides, do we need to "copy" the buffer when we'd like to clone or concat the array? |
This is a good point. Maybe something we could consider is to support setting the desired type directly on the reader Something like let reader = ArrowReaderBuilder::try_new(...)
// override any schema declared in the file
.with_schema(schema)
.build()?; If we supported a similar API for the writer than we could write |
Yes, my view is that:
|
I think the current writer has the very nice property that RecordBatches round trp cleanly -- they can be written to parquet and then re-read and will be equal. Thus, yy personal preference is that the writer should (by default) write the metadata that matches the data it was passed (so in this case specify I think it would be valuable to add an option on the writer to write in "compatiblity" mode (and store string/binary data with StringArray metadata, for example) |
This may differ from how we treat "view"( For both list view and string view ). If we think it's a part of string or list, it might better be string/list in metadata. If we regard it as a completely different type, we should regard it as string-view/list-view For ipc, they're completely different type. For parquet, they stored nothing different, however parquet uses ipc schema here. I believe a new file can be read well, but the nested info might be lost. But anyway this might not severe |
Thanks @mapleFU -- I don't fully understand the concern -- if the embedded metadata in a parquet file says it was written as a |
IMO at least, a release without |
I see -- I think we are both in agreement that the user of the Rust parquet crate should be able to decide how they want to handle this situation. We may just have different opinions on what a better default is. As long as both options are possible, I could be convinced either way on the default value |
Yes, I'm also ok with writing a string view type, and maybe we can generate a test file first |
Update here is that we have pretty fast support for reading into StringView and BinaryView arrays now thanks to @ariesdevil and @XiangpengHao so closing this ticket There is follow on work tracked in #5904 in case anyone wants to try and make it faster |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
This is part of the larger project to implement
StringViewArray
-- see #5374In #5481 we added support for
StringViewArray
andByteViewArray
.The parquet crate has a reader and writer for reading/writing parquet data to arrow:
Describe the solution you'd like
I would like to be able to read a
StringViewArray
andBinaryViewArray
directly from the reader and writer with no data copies (so the raw byte values are not copied).Describe alternatives you've considered
For example, I think we need to add the support to the writer here
arrow-rs/parquet/src/arrow/arrow_writer/mod.rs
Lines 719 to 732 in f41c2a4
Additional context
The reader/writer already handles
DictionaryArray
s which I think could serve as a model for the view arrays.@ariesdevil reports they are working on this feature #5374 (comment)
The text was updated successfully, but these errors were encountered: