Skip to content

Conversation

@brancz
Copy link
Contributor

@brancz brancz commented May 5, 2025

Which issue does this PR close?

Rationale for this change

See the above issues. And this is a follow up to

What changes are included in this PR?

Removal of dict_id from the canonical Schema's Field and the resulting repercussions of that.

Are there any user-facing changes?

Yes, the previously declared deprecations and a few more resulting changes that we didn't foresee previously.

The ones we did not previously foresee involve those places where we now need to pipe through an IPC schema where we previously only had the regular Schema (which used to be enough since that carried the dict_id):

  • flight_data_to_arrow_batch (arrow-flight)
  • arrow_data_from_flight_data (arrow-flight; sql)
  • read_record_batch (arrow-ipc)
  • read_dictionary (arrow-ipc)
  • FileDecoder::new (arrow-ipc)

I think these don't have stability guarantees but for the sake of completeness:

  • field_to_json (arrow-integration-test)
  • record_batch_from_json (arrow-integration-test)
  • array_from_json (arrow-integration-test)
  • dictionary_array_from_json (arrow-integration-test)

@tustvold @alamb @thinkharderdev

@github-actions github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate labels May 5, 2025
@brancz brancz force-pushed the ipc-dict-id-2 branch 5 times, most recently from ac99e01 to f690fc8 Compare May 5, 2025 11:58
@tustvold
Copy link
Contributor

tustvold commented May 5, 2025

It seems unfortunate to have methods requiring multiple different types of schema passed to them.

I'm afraid I don't have the capacity at the moment to disentangle what this PR is actually changing, the flight changes in particular I think will lead to a breaking change that hasn't been communicated via a prior deprecation.

I'm going to have to leave this one to other maintainers to look into, I no longer have the necessary context to follow what the IPC APIs are doing anymore.

@brancz
Copy link
Contributor Author

brancz commented May 5, 2025

In all cases we have the option to remove the “regular” schema, but it would have to mean that we construct it wherever we need the arrow-schema schema, which we can do but it’s a schema ref almost everywhere so it seemed like that was intentional for performance or lifetime reasons. I can definitely make that change if that’s the path we want to try there’s a danger though that it is perf sensitive and then we need to end up breaking the api twice.

@adriangb
Copy link
Contributor

adriangb commented Jun 9, 2025

@brancz I think it would help if we could break this PR up into smaller pieces. I came here looking to change Field::try_merge to drop the dictionary id checks. Is there any way we can disentangle these smaller changes from the larger IPC changes and such?

@brancz
Copy link
Contributor Author

brancz commented Jun 11, 2025

I guess we theoretically could, but that seems inconsistent with the rest of the codebase and it could cause unexpected failure and behavior without the remaining changes of this PR.

@alamb how do you propose we continue with this? I'm happy to rebase and fix the remaining things, if we can agree on the general trajectory of continuing with this strategy.

@alamb
Copy link
Contributor

alamb commented Jun 13, 2025

I guess we theoretically could, but that seems inconsistent with the rest of the codebase and it could cause unexpected failure and behavior without the remaining changes of this PR.

@alamb how do you propose we continue with this? I'm happy to rebase and fix the remaining things, if we can agree on the general trajectory of continuing with this strategy.

I like @adriangb 's suggestion to try to break this PR up into smaller changes as it will be easier to review. Specifically I think it will be easier to achieve what @tustvold describes as:

disentangle what this PR is actually changing

I am finding this PR hard to review because I feel it is hard to understand the implications of the change (on compatibility, performance, etc). Smaller PRs might make that easier

However, there might also be no way to break it up smaller . Some thoughts I had:

  1. changes to flight in one PR
  2. changes to the integration test (to ignore / stop using the ipc fields)
  3. changes to the reader/writer
  4. Then remove the fields

alamb pushed a commit that referenced this pull request Jul 3, 2025
# Which issue does this PR close?

- Closes #7810.

# Rationale for this change

Removes the last batch of functions that can be removed in 56.0.0.

Some deprecated functions remain (esp in arrow-schema and arrow-ipc),
but those will be dealt with elsewhere (#7467)

# What changes are included in this PR?

# Are these changes tested?

Covered by existing tests

# Are there any user-facing changes?

Yes, public functions are removed
@brancz
Copy link
Contributor Author

brancz commented Jul 14, 2025

I split out just the IPC layer changes in #7929. That would have to be merged first in order to do the remaining changes that rely on the IPC layer no longer using the dict_id field.

@brancz
Copy link
Contributor Author

brancz commented Jul 22, 2025

I was able to split this PR into 5 separate PRs so I'm closing this one.

@brancz brancz closed this Jul 22, 2025
alamb pushed a commit that referenced this pull request Jul 22, 2025
…7968)

# Which issue does this PR close?

Closes #6356

# Rationale for this change

Now that #7940 is merged, nothing
useful can be done with the `dict_id` field, therefore, it is now safe
to be removed from this requirement.

This was also split out from:
#7467

# What changes are included in this PR?

No longer require the `dict_id` fields of two `Field`s of schemas being
merged to be equal, as at this point the `dict_id` is only an IPC
concern, and the fact that it is still in the struct definition is just
legacy, marked for removal, we're just going through the proper
procedure of deprecating and replacing the APIs that use it.

# Are these changes tested?

Tests passing.

# Are there any user-facing changes?

No API changes, just a behavior change, that was to be expected and
desired due to the deprecations around the `dict_id` field.

@alamb @adriangb @tustvold
@alamb
Copy link
Contributor

alamb commented Jul 23, 2025

I was able to split this PR into 5 separate PRs so I'm closing this one.

Thank you very much for doing so 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate parquet Changes to the parquet crate

Projects

None yet

4 participants