Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change UnionArray constructors #5623

Merged
merged 5 commits into from
May 8, 2024

Conversation

mbrobbel
Copy link
Contributor

Which issue does this PR close?

Closes #5613.

Rationale for this change

It came up in #5585 that the UnionArray constructors are a bit odd when compared to the other array constructors.

What changes are included in this PR?

UnionArray::try_new and UnionArray::new_unchecked

Changed arguments:
From: field_type_ids: &[i8], type_ids: Buffer, value_offsets: Option<Buffer>, child_arrays: Vec<(Field, ArrayRef)>
To: fields: UnionFields, type_ids: ScalarBuffer<i8>, offsets: Option<ScalarBuffer<i32>>, children: Vec<ArrayRef>.

The order of children arrays must match the order of fields in the UnionFields.

The validation in try_new is a bit more strict.

UnionBuilder

It now uses a BTreeMap instead of a HashMap internally to get sorted iteration.

Are there any user-facing changes?

This is a breaking change that changes the UnionArray constructors.

@github-actions github-actions bot added arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate labels Apr 10, 2024
@tustvold tustvold added the api-change Changes to the arrow API label Apr 11, 2024
@tustvold
Copy link
Contributor

In light of #5368 I think I would like to sit on this for a bit, and we can look to get this bundled in when we next make a breaking release

@tustvold
Copy link
Contributor

tustvold commented May 8, 2024

I took the liberty of merging up main, making some minor tweaks to avoid hashmaps, and adding some additional tests. I think if you're happy this is then ready to go in (the next release is going to be breaking)

@tustvold tustvold merged commit b25c441 into apache:master May 8, 2024
24 of 25 checks passed
@mbrobbel
Copy link
Contributor Author

I took the liberty of merging up main, making some minor tweaks to avoid hashmaps, and adding some additional tests. I think if you're happy this is then ready to go in (the next release is going to be breaking)

Why would we want to avoid hashmaps/sets here?

It seems the negative type_id check was dropped.

@mbrobbel mbrobbel deleted the array/union/constructor branch May 13, 2024 09:35
@tustvold
Copy link
Contributor

Why would we want to avoid hashmaps/sets here?

They're slower, given there are only 128 possibilities and they'll almost always be clustered towards the lower end, it is faster just to use a vec

It seems the negative type_id check was dropped.

Would you care to expand on this?

@mbrobbel
Copy link
Contributor Author

Why would we want to avoid hashmaps/sets here?

They're slower, given there are only 128 possibilities and they'll almost always be clustered towards the lower end, it is faster just to use a vec

Is there a benchmark? I'm wondering how much slower.

It seems the negative type_id check was dropped.

Would you care to expand on this?

7cd47de#diff-7e118aa8aabb6fbbb2b5dcd508dd5b3219300bcb829c6cbef6cf35146320806aR191-R199

@tustvold
Copy link
Contributor

The check got moved to here - a69cbac#diff-7e118aa8aabb6fbbb2b5dcd508dd5b3219300bcb829c6cbef6cf35146320806aR209

Is there a benchmark? I'm wondering how much slower.

No, but I would expect it to be quite significant. You are comparing the performance of a pointer indirection to a hashmap lookup. Given this check is done for every type in the array, I would expect it to be quite significant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cleanup UnionArray Constructors
2 participants