Skip to content

Conversation

@Fokko
Copy link
Contributor

@Fokko Fokko commented Dec 24, 2024

A second attempt to fix this (the first attempt in #11604 but that one was not safe).

When a field that's dropped that still part of an older partition spec, we currently get an error. See #11842

This will replace it with the UnknownType which will be read as a null. This will work fine, since when the evaluator encounters a null, it will assume that there is no value:

if (parts == null) {
// the predicate has no partition column
return Expressions.alwaysTrue();
}

I misused the UnknownType here, but I think it is better than returning an arbitrary type, like we do in the UnknownTransform:

@Override
public Type getResultType(Type type) {
// the actual result type is not known
return Types.StringType.get();
}

Unfortunately, my first approach, where we drop the field, is incorrect, since V1 tables rely on the order.

Fixes #4563
Fixes #12738

@github-actions
Copy link

github-actions bot commented Mar 3, 2025

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Mar 3, 2025
@Fokko Fokko removed the stale label Mar 6, 2025
@Fokko Fokko force-pushed the fd-fix-missing-field branch from 842f32b to 83ac170 Compare April 16, 2025 07:04
@Fokko Fokko force-pushed the fd-fix-missing-field branch from 83ac170 to 90f5de3 Compare April 16, 2025 07:19
@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label May 17, 2025
@github-actions github-actions bot removed the stale label May 21, 2025
stevenzwu
stevenzwu previously approved these changes Jul 1, 2025
Copy link
Contributor

@stevenzwu stevenzwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

}

@Test
public void testPartitionTypeDropInactiveFields() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit, this is testing "ignoring" inactiveFields, they aren't dropped right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are dropped from the struct that's being returned, but ignore is maybe a bit friendlier :)

@stevenzwu stevenzwu added this to the Iceberg 1.10.0 milestone Jul 1, 2025
@stevenzwu stevenzwu dismissed their stale review July 1, 2025 19:13

left a comment to be clarified


public static PartitionSpec fromJson(Schema schema, JsonNode json) {
return fromJson(json).bind(schema);
return fromJson(json).bindUnchecked(schema);
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned about changing the behavior of this API to bindUnchecked since that skips over validations that still apply when parsing the schema. If I understand the intent,, it's because we're currently failing here

I think I agree that fundamentally when parsing a spec and binding we don't care if a source field exists or not since we'll replace it with unknown later. However, there are some other checks in checkCompatibility that still seem worth running through at this point when the sourceType is not null like making sure that the transform and type does align, the sourcetype isn't something like a struct etc.

It's unfortunately a bit more complicated but should we have a checkCompatibilityIgnoringMissing and a buildIgnoringMissingFields which just does the same checks as checkCompatibility but essentially ignoring if a ifeld is missing, and if it is defined we still do the existing checks? Then we largely keep the same guarantees on this function while solving the issue we set out to solve.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @RussellSpitzer @stevenzwu what do you think?

Copy link
Contributor

@stevenzwu stevenzwu Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unfortunately a bit more complicated but should we have a checkCompatibilityIgnoringMissing and a buildIgnoringMissingFields which just does the same checks as checkCompatibility but essentially ignoring if a ifeld is missing, and if it is defined we still do the existing checks?

this seems like a reasonable approach. We can also just add a boolean arg to the checkCompatibility method. it is a non-public method.

We still need the strict check of source field for the code path of TableMetadata.addPartitionSpecInternal.

TableMetadataParser is the only place using bindUnchecked. I assume it is for the same reason. Maybe the buildChecked can switch to this trimmed down version of checkCompatibility and ignore inactive source fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change to buildUnchecked is required because of a check that was added not long ago: #12887, where someone was previously allowed to create an invalid partition spec.

I've updated the code. Let me know what you think. If we don't want to use buildUnchecked, then we need three variations:

  • buildStrict when a new PartitionSpec is added, we want to make sure that all the fields are there
  • build with allowing missing fields for the inactive specs that contain fields that are dropped.
  • bindUnchecked when we don't want to perform checks at all

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TableMetadataParser is the only place using bindUnchecked. I assume it is for the same reason.

This is correct

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thank you @Fokko I think this is a great fix for a long standing issue that we should get in! Thank you @RussellSpitzer @stevenzwu @HonahX for reviewing!

@amogh-jahagirdar amogh-jahagirdar merged commit 8134815 into apache:main Jul 2, 2025
43 checks passed
@Fokko Fokko deleted the fd-fix-missing-field branch August 21, 2025 08:08
@amogh-jahagirdar amogh-jahagirdar mentioned this pull request Sep 2, 2025
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

5 participants