Skip to content

Conversation

@corasaurus-hex
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

Currently Datafusion can only read Arrow files if the're in the File format, not the Stream format. I work with a bunch of Stream format files and wanted native support.

What changes are included in this PR?

To accomplish the above, this PR splits the Arrow datasource into two separate implementations (ArrowStream* and ArrowFile*) with a facade on top to differentiate between the formats at query planning time.

Are these changes tested?

Yes, there are end-to-end sqllogictests along with tests for the changes within datasource-arrow.

Are there any user-facing changes?

Technically yes, in that we support a new format now. I'm not sure which documentation would need to be updated?

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) datasource Changes to the datasource crate labels Nov 3, 2025
// correct offset which is a lot of duplicate I/O. We're opting to avoid
// that entirely by only acting on a single partition and reading sequentially.
Ok(None)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is perhaps the weightiest decision in this PR. if we want to repartition a file in the ipc stream format then we need to read from the beginning of the file for each partition, or figure out another way to create the ad-hoc equivalent of the ipc file format footer so we can minimize duplicate reads (likely by reading the entire file all the way through once and then caching the result in memory for the execution plan to use for each partition)

);

let meta_len = [meta_len[0], meta_len[1], meta_len[2], meta_len[3]];
let meta_len = i32::from_le_bytes(meta_len);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be possible to (manually) manipulate the file's bytes in such a way that it produces a negative i32 here.
Then below the casting to usize will lead to problems.
What is the reason meta_len to be i32 instead of u32 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm honestly not sure, this was from the code that was there previously. This is the PR that introduced it initially and I don't see any information about why this choice #7962 -- @Jefffrey do you recall why i32 instead of u32? I'm happy to change it but I don't understand the implications.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe I was referring to the spec: https://arrow.apache.org/docs/format/Columnar.html#encapsulated-message-format

And saw it say <metadata_size: int32> so I defaulted to i32 🤔

Checking for valid i32 (aka non-negative) does sound reasonable for robustness

let statistics = &self.projected_statistics;
Ok(statistics
.clone()
.expect("projected_statistics must be set"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it need to panic here ? Would it be better to return an Err ?!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we need the panic here, I'm mirroring the other Arrow FileSource. I'm happy to change both but I'll need to track down the implications of this tonight. I was trying to minimize changes in behavior since I'm a new contributor.

Best I can tell it was introduced in #14224

);

let meta_len = [meta_len[0], meta_len[1], meta_len[2], meta_len[3]];
let meta_len = i32::from_le_bytes(meta_len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe I was referring to the spec: https://arrow.apache.org/docs/format/Columnar.html#encapsulated-message-format

And saw it say <metadata_size: int32> so I defaulted to i32 🤔

Checking for valid i32 (aka non-negative) does sound reasonable for robustness

}

// Custom implementation of inferring schema. Should eventually be moved upstream to arrow-rs.
// See <https://github.com/apache/arrow-rs/issues/5021>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't fully reviewed this PR, but just curious if you've managed to check if this code has been upstream to arrow-rs by now and we might be able to leverage it's code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate datasource Changes to the datasource crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for registering files in the Arrow IPC stream format as tables using register_arrow or similar

3 participants