-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Support Arrow IPC Stream Files #18457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Support Arrow IPC Stream Files #18457
Conversation
532ca54 to
99ebe62
Compare
| // correct offset which is a lot of duplicate I/O. We're opting to avoid | ||
| // that entirely by only acting on a single partition and reading sequentially. | ||
| Ok(None) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is perhaps the weightiest decision in this PR. if we want to repartition a file in the ipc stream format then we need to read from the beginning of the file for each partition, or figure out another way to create the ad-hoc equivalent of the ipc file format footer so we can minimize duplicate reads (likely by reading the entire file all the way through once and then caching the result in memory for the execution plan to use for each partition)
| ); | ||
|
|
||
| let meta_len = [meta_len[0], meta_len[1], meta_len[2], meta_len[3]]; | ||
| let meta_len = i32::from_le_bytes(meta_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be possible to (manually) manipulate the file's bytes in such a way that it produces a negative i32 here.
Then below the casting to usize will lead to problems.
What is the reason meta_len to be i32 instead of u32 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, maybe I was referring to the spec: https://arrow.apache.org/docs/format/Columnar.html#encapsulated-message-format
And saw it say <metadata_size: int32> so I defaulted to i32 🤔
Checking for valid i32 (aka non-negative) does sound reasonable for robustness
| let statistics = &self.projected_statistics; | ||
| Ok(statistics | ||
| .clone() | ||
| .expect("projected_statistics must be set")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it need to panic here ? Would it be better to return an Err ?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we need the panic here, I'm mirroring the other Arrow FileSource. I'm happy to change both but I'll need to track down the implications of this tonight. I was trying to minimize changes in behavior since I'm a new contributor.
Best I can tell it was introduced in #14224
| ); | ||
|
|
||
| let meta_len = [meta_len[0], meta_len[1], meta_len[2], meta_len[3]]; | ||
| let meta_len = i32::from_le_bytes(meta_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, maybe I was referring to the spec: https://arrow.apache.org/docs/format/Columnar.html#encapsulated-message-format
And saw it say <metadata_size: int32> so I defaulted to i32 🤔
Checking for valid i32 (aka non-negative) does sound reasonable for robustness
| } | ||
|
|
||
| // Custom implementation of inferring schema. Should eventually be moved upstream to arrow-rs. | ||
| // See <https://github.com/apache/arrow-rs/issues/5021> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't fully reviewed this PR, but just curious if you've managed to check if this code has been upstream to arrow-rs by now and we might be able to leverage it's code?
Which issue does this PR close?
register_arrowor similar #16688.Rationale for this change
Currently Datafusion can only read Arrow files if the're in the File format, not the Stream format. I work with a bunch of Stream format files and wanted native support.
What changes are included in this PR?
To accomplish the above, this PR splits the Arrow datasource into two separate implementations (
ArrowStream*andArrowFile*) with a facade on top to differentiate between the formats at query planning time.Are these changes tested?
Yes, there are end-to-end sqllogictests along with tests for the changes within datasource-arrow.
Are there any user-facing changes?
Technically yes, in that we support a new format now. I'm not sure which documentation would need to be updated?