-
Notifications
You must be signed in to change notification settings - Fork 3k
Core, Spark: Handle unknown type during deletes #14356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
5d717a1 to
5ab73bf
Compare
| classes[i] = result.typeId().javaClass(); | ||
| if (null == sourceType) { | ||
| // When the source field has been dropped we cannot determine the type | ||
| classes[i] = Types.UnknownType.get().typeId().javaClass(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handling it similar to how it's done in
iceberg/api/src/main/java/org/apache/iceberg/PartitionSpec.java
Lines 136 to 139 in 5ab73bf
| // When the source field has been dropped we cannot determine the type | |
| if (sourceType == null) { | |
| resultType = Types.UnknownType.get(); | |
| } |
which was added by 8134815
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should default sourceType to UnknownType.get() and still call getResultType. The reason is that for many transforms, the result type is known. For example, bucket always produces IntegerType.get(). That will result in fewer cases where we have an unknown and use Void.class from this.
| case DECIMAL: | ||
| return ByteBuffer.wrap(((BigDecimal) value).unscaledValue().toByteArray()); | ||
| case UNKNOWN: | ||
| // underlying type not known |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the lower/upper bound is converted in
| this.lowerBound = Conversions.fromByteBuffer(primitive, summary.lowerBound()); | |
| this.upperBound = Conversions.fromByteBuffer(primitive, summary.upperBound()); |
| return (ByteBuffer) value; | ||
| case DECIMAL: | ||
| return ByteBuffer.wrap(((BigDecimal) value).unscaledValue().toByteArray()); | ||
| case UNKNOWN: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the lower/upper bound is converted to a byte buffer in
iceberg/core/src/main/java/org/apache/iceberg/PartitionSummary.java
Lines 80 to 81 in c07f2aa
| min != null ? Conversions.toByteBuffer(type, min) : null, | |
| max != null ? Conversions.toByteBuffer(type, max) : null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to me.
| byte[] unscaledBytes = new byte[buffer.remaining()]; | ||
| tmp.get(unscaledBytes); | ||
| return new BigDecimal(new BigInteger(unscaledBytes), decimal.scale()); | ||
| case UNKNOWN: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the lower/upper bound is converted from a byte buffer in
| this.lowerBound = Conversions.fromByteBuffer(primitive, summary.lowerBound()); | |
| this.upperBound = Conversions.fromByteBuffer(primitive, summary.upperBound()); |
| Object datum = get(row, i); | ||
| ValueWriter<Object> writer = writers[i]; | ||
|
|
||
| if (NullWriter.INSTANCE.getClass().equals(writer.getClass()) && null != datum) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise this fails with a CCE since datum can be a String, which is then casted to the NullWriter Void
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead, what about just changing the type parameter of NullWriter to Object? The argument is ignored anyway so that avoids the CCE and we don't have to special-case it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that would be ideal, but that breaks the API, because ValueWriters.nulls() is public. We could add a deprecation to that method and mention that the return type will be changed to ValueWriters<Object> in the next release, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can get around this by casting to a raw unparameterized type:
private static class NullWriter implements ValueWriter<Object> {
@SuppressWarnings({"unchecked", "rawtypes"})
private static final ValueWriter<Void> INSTANCE = (ValueWriter) new NullWriter();
private NullWriter() {}
@Override
public void write(Object ignored, Encoder encoder) throws IOException {
encoder.writeNull();
}
}The resulting ValueWriter<Void> is more restrictive than ValueWriter<Object> so it is safe and you don't have to change the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, that would also work, thanks
...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestAlterTablePartitionFields.java
Outdated
Show resolved
Hide resolved
5ab73bf to
3fe4ee2
Compare
| } | ||
|
|
||
| boolean canContain(Object value) { | ||
| if (Types.UnknownType.get().equals(type)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be moved after the null and nan value checks? The null and nan counts are still valid, right?
This should also have a test update.
3fe4ee2 to
d2441c5
Compare
rdblue
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good once the null writer issue is fixed.
|
thanks for reviewing @rdblue |
Historical background / Problem
There was an issue a while ago where one couldn't drop a partition field and then remove it from a table's schema, as that would always result in an error. This was fixed by #11868, where the field's type was replaced with
UnknownTypeand thus ignored during reads. However, if a user performs a DELETE operation after dropping a partition field and then removing it from a table's schema, the DELETE operation fails in a few places, because those places don't deal with theUnknownType.Approach
This PR updates all of the places that were failing during the DELETE operation to properly deal with the
UnknownType. I'm currently unsure if that's the right approach to deal with this issue, so looking for some feedback/ideas on that. I've also added some tests that reproduce the original issue.fixes #14343